Alat berikut memvisualisasikan apa yang dilakukan komputer langkah demi langkah saat menjalankan program tersebut Show
Editor Kode Python Kontribusikan kode dan komentar Anda melalui Disqus Sebelumnya. Rumah Latihan Pencarian dan Penyortiran Python Berapa tingkat kesulitan latihan ini? Mudah Sedang KerasUji keterampilan Pemrograman Anda dengan kuis w3resource Ikuti kami di Facebook dan Twitter untuk pembaruan terbaru. Piton. Kiat Hari IniDekomposisi koleksi Asumsikan kita memiliki fungsi yang mengembalikan tuple dari dua nilai dan kita ingin menetapkan setiap nilai ke variabel terpisah. Salah satu caranya adalah dengan menggunakan pengindeksan seperti di bawah ini abc = (5, 10) x = abc[0] y = abc[1] print(x, y) Keluaran 5 10_ There is a better option that allows us to do the same operation in one line x, y = abc print(x, y) Keluaran 5 10_ It can be extended to a tuple with more than 2 values or some other data structures such as lists or sets The pandas I/O API is a set of top level In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object05 functions accessed like that generally return a pandas object. The corresponding In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object07 functions are object methods that are accessed like . Below is a table containing available In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object09 and In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object10 Format Type Data Description Reader Writer text CSV text Fixed-Width Text File text JSON text HTML text LaTeX text XML text Local clipboard binary MS Excel binary OpenDocument binary Format HDF5 binary Format Bulu binary Bentuk Parket binary Format ORC binary Status binary SAS binary SPSS binary Format Acar Python SQL SQL SQL Google BigQuery adalah perbandingan kinerja informal untuk beberapa metode IO ini Catatan Untuk contoh yang menggunakan kelas In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object_11, pastikan Anda mengimpornya dengan In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object12 untuk Python 3 File CSV & teksFungsi pekerja keras untuk membaca file teks (a. k. a. file datar) adalah. Lihat untuk beberapa strategi lanjutan Opsi penguraianmenerima argumen umum berikut Dasarfilepath_or_buffer beragamBaik jalur ke file (a , , atau In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object17), URL (termasuk lokasi http, ftp, dan S3), atau objek apa pun dengan metode In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object18 (seperti file terbuka atau )sep str, default ke In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object20 untuk , In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object22 untuk Pembatas untuk digunakan. Jika sep adalah In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24, mesin C tidak dapat secara otomatis mendeteksi pemisah, tetapi mesin parsing Python dapat, artinya yang terakhir akan digunakan dan secara otomatis mendeteksi pemisah dengan alat sniffer bawaan Python,. Selain itu, pemisah yang lebih panjang dari 1 karakter dan berbeda dari In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object26 akan ditafsirkan sebagai ekspresi reguler dan juga akan memaksa penggunaan mesin parsing Python. Perhatikan bahwa pembatas regex cenderung mengabaikan data yang dikutip. Contoh regex. In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object27delimiter str, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Alternative argument name for sep delim_whitespace boolean, default FalseSpecifies whether or not whitespace (e. g. In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object29 or In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object30) will be used as the delimiter. Equivalent to setting In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object31. If this option is set to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, nothing should be passed in for the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object33 parameter Column and index locations and namesheader int or list of ints, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object34 Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names. if no names are passed the behavior is identical to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36. Explicitly pass In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 to be able to replace existing names The header can be a list of ints that specify row locations for a MultiIndex on the columns e. g. In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object38. Intervening rows that are not specified will be skipped (e. g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object39, so header=0 denotes the first line of data rather than the first line of the filenames array-like, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 List of column names to use. If file contains no header row, then you should explicitly pass In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36. Duplicates in this list are not allowedindex_col int, str, sequence of int / str, or False, optional, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Column(s) to use as the row labels of the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used Catatan In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object44 can be used to force pandas to not use the first column as the index, e. g. when you have a malformed file with delimiters at the end of each line The default value of In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 instructs pandas to guess. If the number of fields in the column header row is equal to the number of fields in the body of the data file, then a default index is used. If it is larger, then the first columns are used as index so that the remaining number of fields in the body are equal to the number of fields in the header The first row after the header is used to determine the number of columns, which will go into the index. If the subsequent rows contain less columns than the first row, they are filled with In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46 This can be avoided through In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47. This ensures that the columns are taken as is and the trailing data are ignoredusecols list-like or callable, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Return a subset of the columns. If list-like, all elements must either be positional (i. e. integer indices into the document columns) or strings that correspond to column names provided either by the user in In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 or inferred from the document header row(s). If In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 are given, the document header row(s) are not taken into account. For example, a valid list-like In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object47 parameter would be In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object52 or In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object53 Element order is ignored, so In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object54 is the same as In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object55. To instantiate a DataFrame from In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object56 with element order preserved use In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object57 for columns in In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object58 order or In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object59 for In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object60 order If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True In [1]: import pandas as pd In [2]: from io import StringIO In [3]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" In [4]: pd.read_csv(StringIO(data)) Out[4]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [5]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["COL1", "COL3"]) Out[5]: col1 col3 0 a 1 1 a 2 2 c 3 Using this parameter results in much faster parsing time and lower memory usage when using the c engine. The Python engine loads the data first before deciding which columns to drop squeeze boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 If the parsed data only contains one column then return a In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object62 Deprecated since version 1. 4. 0. Append In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object63 to the call to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object64 to squeeze the data. prefix str, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Prefix to add to column numbers when no header, e. g. ‘X’ for X0, X1, … Deprecated since version 1. 4. 0. Use a list comprehension on the DataFrame’s columns after calling In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object66. In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 1mangle_dupe_cols boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 Duplicate columns will be specified as ‘X’, ‘X. 1’…’X. N’, rather than ‘X’…’X’. Passing in In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 will cause data to be overwritten if there are duplicate names in the columns Tidak digunakan lagi sejak versi 1. 5. 0. The argument was never implemented, and a new argument where the renaming pattern can be specified will be added instead. General parsing configurationdtype Type name or dict of column -> type, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Data type for data or columns. E. g. In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object70 Use In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15 or In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72 together with suitable In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object73 settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion New in version 1. 5. 0. Support for defaultdict was added. Specify a defaultdict as input where the default determines the dtype of the columns which are not explicitly listed. engine {In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object74, In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object75, In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object76} Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine New in version 1. 4. 0. The “pyarrow” engine was added as an experimental engine, and some features are unsupported, or may not work correctly, with this engine. converters dict, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Dict of functions for converting values in certain columns. Keys can either be integers or column labels true_values list, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Values to consider as In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32false_values list, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Values to consider as In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61skipinitialspace boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 Skip spaces after delimiter skiprows list-like or integer, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" In [11]: pd.read_csv(StringIO(data)) Out[11]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0) Out[12]: col1 col2 col3 0 a b 2skipfooter int, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object84 Number of lines at bottom of file to skip (unsupported with engine=’c’) nrows int, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Number of rows of file to read. Useful for reading pieces of large files low_memory boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, or specify the type with the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 parameter. Note that the entire file is read into a single In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 regardless, use the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object90 or In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object91 parameter to return the data in chunks. (Only valid with C parser)memory_map boolean, default False If a filepath is provided for In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object92, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead NA and missing data handlingna_values scalar, str, list-like, or dict, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. See below for a list of the values interpreted as NaN by default keep_default_na boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 Whether or not to include the default NaN values when parsing the data. Depending on whether In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object73 is passed in, the behavior is as follows
Note that if In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6410 is passed in as In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object96 and In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object73 parameters will be ignoredna_filter boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6415 can improve the performance of reading a large fileverbose boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 Indicate number of NA values placed in non-numeric columns skip_blank_lines boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 If In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, skip over blank lines rather than interpreting as NaN values Datetime handlingparse_dates boolean atau daftar int atau nama atau daftar daftar atau dict, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61.
Catatan A fast-path exists for iso8601-formatted dates infer_datetime_format boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 If In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 and parse_dates is enabled for a column, attempt to infer the datetime format to speed up the processingkeep_date_col boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 If In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 and parse_dates specifies combining multiple columns then keep the original columnsdate_parser function, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Function to use for converting a sequence of string columns to an array of datetime instances. The default uses In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6429 to do the conversion. pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs. 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as argumentsdayfirst boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 DD/MM format dates, international and European format cache_dates boolean, default TrueIf True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets New in version 0. 25. 0 Iterationiterator boolean, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61 Return In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6432 object for iteration or getting chunks with In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6433chunksize int, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Return In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6432 object for iteration. See below Quoting, compression, and file formatcompression {In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object34, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6437, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6438, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6439, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6440, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6441, In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6443}, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object34 For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip, bz2, zip, xz, or zstandard if In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object92 is path-like ending in ‘. gz’, ‘. bz2’, ‘. zip’, ‘. xz’, ‘. zst’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 for no decompression. Can also be a dict with key In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6447 set to one of { In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6439, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6437, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6438, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6441} and other key-value pairs are forwarded to In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6452, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6453, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6454, or In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6455. As an example, the following could be passed for faster compression and to create a reproducible gzip archive. In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6456 Changed in version 1. 1. 0. dict option extended to support In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6457 and In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6458. Changed in version 1. 2. 0. Previous versions forwarded dict entries for ‘gzip’ to In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6459. thousands str, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Thousands separator decimal str, defaultIn [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6461 Character to recognize as decimal point. E. g. use In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object20 for European datafloat_precision string, default None Specifies which converter the C engine should use for floating-point values. The options are In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 for the ordinary converter, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6464 for the high-precision converter, and In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6465 for the round-trip converterlineterminator str (length 1), default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Character to break file into lines. Only valid with C parser quotechar str (length 1)The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored quoting int orIn [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6467 instance, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object84 Control field quoting behavior per In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6467 constants. Use one of In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6470 (0), In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6471 (1), In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6472 (2) or In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6473 (3)doublequote boolean, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32 Ketika In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int64_75 ditentukan dan In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6476 bukan In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6473, tunjukkan apakah akan menginterpretasikan dua elemen berurutan In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6475 di dalam bidang sebagai satu elemen In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6475escapechar str (length 1), default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 One-character string used to escape delimiter when quoting is In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6473comment str, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object39), fully commented lines are ignored by the parameter In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6484 but not by In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6485. For example, if In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6486, parsing ‘#empty\na,b,c\n1,2,3’ with In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 will result in ‘a,b,c’ being treated as the headerencoding str, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Encoding to use for UTF when reading/writing (e. g. In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6489). dialect str or instance, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 If provided, this parameter will override values (default or not) for the following parameters. In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object33, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6493, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6494, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6495, In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6475, and In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6476. If it is necessary to override values, a ParserWarning will be issued. Lihat dokumentasi untuk detail lebih lanjut Error handlingerror_bad_lines boolean, optional, defaultIn [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 Lines with too many fields (e. g. a csv line with too many commas) will by default cause an exception to be raised, and no In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 will be returned. If In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, then these “bad lines” will dropped from the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 that is returned. See below Deprecated since version 1. 3. 0. The In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6403 parameter should be used instead to specify behavior upon encountering a bad line instead. warn_bad_lines boolean, optional, default In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object24 If error_bad_lines is In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object61, and warn_bad_lines is In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object32, a warning for each “bad line” will be output Deprecated since version 1. 3. 0. The In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6403 parameter should be used instead to specify behavior upon encountering a bad line instead. on_bad_lines (‘error’, ‘warn’, ‘skip’), default ‘error’ Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are
New in version 1. 3. 0 Specifying column data typesYou can indicate the data type for the whole In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object43 or individual columns In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object Fortunately, pandas offers more than one way to ensure that your column(s) contain only one In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88. If you’re unfamiliar with these concepts, you can see to learn more about dtypes, and to learn more about In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72 conversion in pandas For instance, you can use the In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6411 argument of In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int64 Or you can use the function to coerce the dtypes after reading in the data, In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int64 which will convert all valid parsing to floats, leaving the invalid parsing as In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46 Ultimately, how you deal with reading in columns containing mixed dtypes depends on your specific needs. In the case above, if you wanted to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object46 out the data anomalies, then is probably your best option. However, if you wanted for all the data to be coerced, no matter the type, then using the In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6411 argument of would certainly be worth trying Catatan In some cases, reading in abnormal data with columns containing mixed dtypes will result in an inconsistent dataset. If you rely on pandas to infer the dtypes of your columns, the parsing engine will go and infer the dtypes for different chunks of the data, rather than the whole dataset at once. Consequently, you can end up with column(s) with mixed dtypes. For example, In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000)) In [30]: df = pd.DataFrame({"col_1": col_1}) In [31]: df.to_csv("foo.csv") In [32]: mixed_df = pd.read_csv("foo.csv") In [33]: mixed_df["col_1"].apply(type).value_counts() Out[33]: <class 'int'> 737858 <class 'str'> 262144 Name: col_1, dtype: int64 In [34]: mixed_df["col_1"].dtype Out[34]: dtype('O') will result with In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6419 containing an In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6420 dtype for certain chunks of the column, and In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object15 for others due to the mixed dtypes from the data that was read in. It is important to note that the overall column will be marked with a In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 of In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object72, which is used for columns with mixed dtypes Specifying categorical dtypeIn [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6424 columns can be parsed directly by specifying In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6425 or In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6426 In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3" In [36]: pd.read_csv(StringIO(data)) Out[36]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [37]: pd.read_csv(StringIO(data)).dtypes Out[37]: col1 object col2 object col3 int64 dtype: object In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes Out[38]: col1 category col2 category col3 category dtype: object Individual columns can be parsed as a In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6424 using a dict specification In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes Out[39]: col1 category col2 object col3 int64 dtype: object Specifying In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6425 will result in an unordered In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6424 whose In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6430 are the unique values observed in the data. For more control on the categories and order, create a In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6431 ahead of time, and pass that for that column’s In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 In [40]: from pandas.api.types import CategoricalDtype In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True) In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes Out[42]: col1 category col2 object col3 int64 dtype: object When using In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6433, “unexpected” values outside of In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6434 are treated as missing values In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 10 This matches the behavior of In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6435 Catatan Dengan In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6425, kategori yang dihasilkan akan selalu diuraikan sebagai string (tipe objek). If the categories are numeric they can be converted using the function, or as appropriate, another converter such as When In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object88 is a In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6431 with homogeneous In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6430 ( all numeric, all datetimes, etc. ), konversi dilakukan secara otomatis In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 11 Naming and using columnsHandling column namesA file may or may not have a header row. pandas assumes the first row should be used as the column names In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 12 By specifying the In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object49 argument in conjunction with In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6484 you can indicate other names to use and whether or not to throw away the header row (if any) In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 13 Jika tajuk berada di baris selain yang pertama, berikan nomor baris ke In [21]: data = "col_1\n1\n2\n'A'\n4.22" In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str}) In [23]: df Out[23]: col_1 0 1 1 2 2 'A' 3 4.22 In [24]: df["col_1"].apply(type).value_counts() Out[24]: <class 'str'> 4 Name: col_1, dtype: int6484. This will skip the preceding rows In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 14 Catatan Default behavior is to infer the column names. if no names are passed the behavior is identical to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object35 and column names are inferred from the first non-blank line of the file, if column names are passed explicitly then the behavior is identical to In [13]: import numpy as np In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11" In [15]: print(data) a,b,c,d 1,2,3,4 5,6,7,8 9,10,11 In [16]: df = pd.read_csv(StringIO(data), dtype=object) In [17]: df Out[17]: a b c d 0 1 2 3 4 1 5 6 7 8 2 9 10 11 NaN In [18]: df["a"][0] Out[18]: '1' In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"}) In [20]: df.dtypes Out[20]: a int64 b object c float64 d Int64 dtype: object36 Duplicate names parsing
Jika file atau header berisi nama duplikat, panda secara default akan membedakannya untuk mencegah penimpaan data In [6]: data = "col1,col2,col3\na,b,1" In [7]: df = pd.read_csv(StringIO(data)) In [8]: df.columns = [f"pre_{col}" for col in df.columns] In [9]: df Out[9]: pre_col1 pre_col2 pre_col3 0 a b 1_5 Tidak ada lagi data duplikat karena In [25]: df2 = pd.read_csv(StringIO(data)) In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce") In [27]: df2 Out[27]: col_1 0 1.00 1 2.00 2 NaN 3 4.22 In [28]: df2["col_1"].apply(type).value_counts() Out[28]: <class 'float'> 4 Name: col_1, dtype: int6448 secara default, yang mengubah serangkaian kolom duplikat 'X', ..., 'X' menjadi 'X', 'X. 1’, …, ‘X. N' Memfilter kolom ( |