File biner di python w3schools

Alat berikut memvisualisasikan apa yang dilakukan komputer langkah demi langkah saat menjalankan program tersebut

Show

Editor Kode Python

Kontribusikan kode dan komentar Anda melalui Disqus

Sebelumnya. Rumah Latihan Pencarian dan Penyortiran Python
Lanjut. Tulis program Python untuk pencarian berurutan

Berapa tingkat kesulitan latihan ini?

Mudah Sedang Keras

Uji keterampilan Pemrograman Anda dengan kuis w3resource



Ikuti kami di Facebook dan Twitter untuk pembaruan terbaru.

Piton. Kiat Hari Ini

Dekomposisi koleksi

Asumsikan kita memiliki fungsi yang mengembalikan tuple dari dua nilai dan kita ingin menetapkan setiap nilai ke variabel terpisah. Salah satu caranya adalah dengan menggunakan pengindeksan seperti di bawah ini

abc = (5, 10)
x = abc[0]
y = abc[1]
print(x, y)

Keluaran

5 10
_

There is a better option that allows us to do the same operation in one line

x, y = abc
print(x, y)

Keluaran

5 10
_

It can be extended to a tuple with more than 2 values or some other data structures such as lists or sets

The pandas I/O API is a set of top level

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
05 functions accessed like that generally return a pandas object. The corresponding
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
07 functions are object methods that are accessed like . Below is a table containing available
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
09 and
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
10

Format Type

Data Description

Reader

Writer

text

CSV

text

Fixed-Width Text File

text

JSON

text

HTML

text

LaTeX

text

XML

text

Local clipboard

binary

MS Excel

binary

OpenDocument

binary

Format HDF5

binary

Format Bulu

binary

Bentuk Parket

binary

Format ORC

binary

Status

binary

SAS

binary

SPSS

binary

Format Acar Python

SQL

SQL

SQL

Google BigQuery

adalah perbandingan kinerja informal untuk beberapa metode IO ini

Catatan

Untuk contoh yang menggunakan kelas

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_11, pastikan Anda mengimpornya dengan
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
12 untuk Python 3

File CSV & teks

Fungsi pekerja keras untuk membaca file teks (a. k. a. file datar) adalah. Lihat untuk beberapa strategi lanjutan

Opsi penguraian

menerima argumen umum berikut

Dasar

filepath_or_buffer beragam

Baik jalur ke file (a , , atau

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
17), URL (termasuk lokasi http, ftp, dan S3), atau objek apa pun dengan metode
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
18 (seperti file terbuka atau )

sep str, default ke
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
20 untuk ,
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
22 untuk

Pembatas untuk digunakan. Jika sep adalah

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24, mesin C tidak dapat secara otomatis mendeteksi pemisah, tetapi mesin parsing Python dapat, artinya yang terakhir akan digunakan dan secara otomatis mendeteksi pemisah dengan alat sniffer bawaan Python,. Selain itu, pemisah yang lebih panjang dari 1 karakter dan berbeda dari
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
26 akan ditafsirkan sebagai ekspresi reguler dan juga akan memaksa penggunaan mesin parsing Python. Perhatikan bahwa pembatas regex cenderung mengabaikan data yang dikutip. Contoh regex.
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
27

delimiter str, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Alternative argument name for sep

delim_whitespace boolean, default False

Specifies whether or not whitespace (e. g.

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
29 or
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
30) will be used as the delimiter. Equivalent to setting
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
31. If this option is set to
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32, nothing should be passed in for the
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
33 parameter

Column and index locations and names

header int or list of ints, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
34

Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names. if no names are passed the behavior is identical to

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
35 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
36. Explicitly pass
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
35 to be able to replace existing names

The header can be a list of ints that specify row locations for a MultiIndex on the columns e. g.

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
38. Intervening rows that are not specified will be skipped (e. g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
39, so header=0 denotes the first line of data rather than the first line of the file

names array-like, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

List of column names to use. If file contains no header row, then you should explicitly pass

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
36. Duplicates in this list are not allowed

index_col int, str, sequence of int / str, or False, optional, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Column(s) to use as the row labels of the

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used

Catatan

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
44 can be used to force pandas to not use the first column as the index, e. g. when you have a malformed file with delimiters at the end of each line

The default value of

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24 instructs pandas to guess. If the number of fields in the column header row is equal to the number of fields in the body of the data file, then a default index is used. If it is larger, then the first columns are used as index so that the remaining number of fields in the body are equal to the number of fields in the header

The first row after the header is used to determine the number of columns, which will go into the index. If the subsequent rows contain less columns than the first row, they are filled with

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46

This can be avoided through

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47. This ensures that the columns are taken as is and the trailing data are ignored

usecols list-like or callable, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Return a subset of the columns. If list-like, all elements must either be positional (i. e. integer indices into the document columns) or strings that correspond to column names provided either by the user in

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
49 or inferred from the document header row(s). If
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
49 are given, the document header row(s) are not taken into account. For example, a valid list-like
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47 parameter would be
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
52 or
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
53

Element order is ignored, so

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
54 is the same as
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
55. To instantiate a DataFrame from
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
56 with element order preserved use
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
57 for columns in
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
58 order or
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
59 for
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
60 order

If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True

In [1]: import pandas as pd

In [2]: from io import StringIO

In [3]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [4]: pd.read_csv(StringIO(data))
Out[4]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [5]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["COL1", "COL3"])
Out[5]: 
  col1  col3
0    a     1
1    a     2
2    c     3

Using this parameter results in much faster parsing time and lower memory usage when using the c engine. The Python engine loads the data first before deciding which columns to drop

squeeze boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61

If the parsed data only contains one column then return a

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
62

Deprecated since version 1. 4. 0. Append

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
63 to the call to
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
64 to squeeze the data.

prefix str, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Prefix to add to column numbers when no header, e. g. ‘X’ for X0, X1, …

Deprecated since version 1. 4. 0. Use a list comprehension on the DataFrame’s columns after calling

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
66.

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1

mangle_dupe_cols boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32

Duplicate columns will be specified as ‘X’, ‘X. 1’…’X. N’, rather than ‘X’…’X’. Passing in

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61 will cause data to be overwritten if there are duplicate names in the columns

Tidak digunakan lagi sejak versi 1. 5. 0. The argument was never implemented, and a new argument where the renaming pattern can be specified will be added instead.

General parsing configuration

dtype Type name or dict of column -> type, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Data type for data or columns. E. g.

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
70 Use
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
15 or
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
72 together with suitable
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
73 settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion

New in version 1. 5. 0. Support for defaultdict was added. Specify a defaultdict as input where the default determines the dtype of the columns which are not explicitly listed.

engine {
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
74,
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
75,
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
76}

Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine

New in version 1. 4. 0. The “pyarrow” engine was added as an experimental engine, and some features are unsupported, or may not work correctly, with this engine.

converters dict, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Dict of functions for converting values in certain columns. Keys can either be integers or column labels

true_values list, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Values to consider as

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32

false_values list, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Values to consider as

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61

skipinitialspace boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61

Skip spaces after delimiter

skiprows list-like or integer, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2

skipfooter int, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
84

Number of lines at bottom of file to skip (unsupported with engine=’c’)

nrows int, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Number of rows of file to read. Useful for reading pieces of large files

low_memory boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32

Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61, or specify the type with the
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
88 parameter. Note that the entire file is read into a single
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 regardless, use the
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
90 or
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
91 parameter to return the data in chunks. (Only valid with C parser)

memory_map boolean, default False

If a filepath is provided for

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
92, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead

NA and missing data handling

na_values scalar, str, list-like, or dict, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. See below for a list of the values interpreted as NaN by default

keep_default_na boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32

Whether or not to include the default NaN values when parsing the data. Depending on whether

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
73 is passed in, the behavior is as follows

  • If

    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    96 is
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    32, and
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    73 are specified,
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    73 is appended to the default NaN values used for parsing

  • If

    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    96 is
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    32, and
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    73 are not specified, only the default NaN values are used for parsing

  • If

    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    96 is
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    61, and
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    73 are specified, only the NaN values specified
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    73 are used for parsing

  • If

    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    96 is
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    61, and
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    73 are not specified, no strings will be parsed as NaN

Note that if

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
10 is passed in as
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61, the
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
96 and
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
73 parameters will be ignored

na_filter boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32

Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
15 can improve the performance of reading a large file

verbose boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61

Indicate number of NA values placed in non-numeric columns

skip_blank_lines boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32

If

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32, skip over blank lines rather than interpreting as NaN values

Datetime handling

parse_dates boolean atau daftar int atau nama atau daftar daftar atau dict, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61.
  • If

    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    32 -> try parsing the index

  • If

    In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    21 -> try parsing columns 1, 2, 3 each as a separate date column

  • If

    In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    22 -> combine columns 1 and 3 and parse as a single date column

  • If

    In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    23 -> parse columns 1, 3 as date and call result ‘foo’

Catatan

A fast-path exists for iso8601-formatted dates

infer_datetime_format boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61

If

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32 and parse_dates is enabled for a column, attempt to infer the datetime format to speed up the processing

keep_date_col boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61

If

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32 and parse_dates specifies combining multiple columns then keep the original columns

date_parser function, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Function to use for converting a sequence of string columns to an array of datetime instances. The default uses

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
29 to do the conversion. pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs. 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments

dayfirst boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61

DD/MM format dates, international and European format

cache_dates boolean, default True

If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets

New in version 0. 25. 0

Iteration

iterator boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61

Return

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
32 object for iteration or getting chunks with
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
33

chunksize int, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Return

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
32 object for iteration. See below

Quoting, compression, and file format

compression {
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
34,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
37,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
38,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
39,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
40,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
41,
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
43}, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
34

For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip, bz2, zip, xz, or zstandard if

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
92 is path-like ending in ‘. gz’, ‘. bz2’, ‘. zip’, ‘. xz’, ‘. zst’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24 for no decompression. Can also be a dict with key
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
47 set to one of {
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
39,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
37,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
38,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
41} and other key-value pairs are forwarded to
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
52,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
53,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
54, or
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
55. As an example, the following could be passed for faster compression and to create a reproducible gzip archive.
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
56

Changed in version 1. 1. 0. dict option extended to support

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
57 and
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
58.

Changed in version 1. 2. 0. Previous versions forwarded dict entries for ‘gzip’ to

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
59.

thousands str, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Thousands separator

decimal str, default
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
61

Character to recognize as decimal point. E. g. use

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
20 for European data

float_precision string, default None

Specifies which converter the C engine should use for floating-point values. The options are

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24 for the ordinary converter,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
64 for the high-precision converter, and
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
65 for the round-trip converter

lineterminator str (length 1), default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Character to break file into lines. Only valid with C parser

quotechar str (length 1)

The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored

quoting int or
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
67 instance, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
84

Control field quoting behavior per

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
67 constants. Use one of
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
70 (0),
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
71 (1),
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
72 (2) or
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
73 (3)

doublequote boolean, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32

Ketika

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_75 ditentukan dan
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
76 bukan
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
73, tunjukkan apakah akan menginterpretasikan dua elemen berurutan
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
75 di dalam bidang sebagai satu elemen
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
75

escapechar str (length 1), default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

One-character string used to escape delimiter when quoting is

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
73

comment str, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
39), fully commented lines are ignored by the parameter
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
84 but not by
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
85. For example, if
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
86, parsing ‘#empty\na,b,c\n1,2,3’ with
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
35 will result in ‘a,b,c’ being treated as the header

encoding str, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Encoding to use for UTF when reading/writing (e. g.

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
89).

dialect str or instance, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

If provided, this parameter will override values (default or not) for the following parameters.

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
33,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
93,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
94,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
95,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
75, and
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
76. If it is necessary to override values, a ParserWarning will be issued. Lihat dokumentasi untuk detail lebih lanjut

Error handling

error_bad_lines boolean, optional, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

Lines with too many fields (e. g. a csv line with too many commas) will by default cause an exception to be raised, and no

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 will be returned. If
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61, then these “bad lines” will dropped from the
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 that is returned. See below

Deprecated since version 1. 3. 0. The

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
03 parameter should be used instead to specify behavior upon encountering a bad line instead.

warn_bad_lines boolean, optional, default
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24

If error_bad_lines is

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61, and warn_bad_lines is
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32, a warning for each “bad line” will be output

Deprecated since version 1. 3. 0. The

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
03 parameter should be used instead to specify behavior upon encountering a bad line instead.

on_bad_lines (‘error’, ‘warn’, ‘skip’), default ‘error’

Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are

  • ‘error’, raise an ParserError when a bad line is encountered

  • ‘warn’, print a warning when a bad line is encountered and skip that line

  • ‘skip’, skip bad lines without raising or warning when they are encountered

New in version 1. 3. 0

Specifying column data types

You can indicate the data type for the whole

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 or individual columns

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object

Fortunately, pandas offers more than one way to ensure that your column(s) contain only one

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
88. If you’re unfamiliar with these concepts, you can see to learn more about dtypes, and to learn more about
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
72 conversion in pandas

For instance, you can use the

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
11 argument of

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64

Or you can use the function to coerce the dtypes after reading in the data,

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64

which will convert all valid parsing to floats, leaving the invalid parsing as

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46

Ultimately, how you deal with reading in columns containing mixed dtypes depends on your specific needs. In the case above, if you wanted to

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46 out the data anomalies, then is probably your best option. However, if you wanted for all the data to be coerced, no matter the type, then using the
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
11 argument of would certainly be worth trying

Catatan

In some cases, reading in abnormal data with columns containing mixed dtypes will result in an inconsistent dataset. If you rely on pandas to infer the dtypes of your columns, the parsing engine will go and infer the dtypes for different chunks of the data, rather than the whole dataset at once. Consequently, you can end up with column(s) with mixed dtypes. For example,

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')

will result with

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
19 containing an
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
20 dtype for certain chunks of the column, and
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
15 for others due to the mixed dtypes from the data that was read in. It is important to note that the overall column will be marked with a
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
88 of
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
72, which is used for columns with mixed dtypes

Specifying categorical dtype

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
24 columns can be parsed directly by specifying
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
25 or
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
26

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object

Individual columns can be parsed as a

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
24 using a dict specification

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object

Specifying

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
25 will result in an unordered
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
24 whose
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
30 are the unique values observed in the data. For more control on the categories and order, create a
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
31 ahead of time, and pass that for that column’s
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
88

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object

When using

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
33, “unexpected” values outside of
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
34 are treated as missing values

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
0

This matches the behavior of

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
35

Catatan

Dengan

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
25, kategori yang dihasilkan akan selalu diuraikan sebagai string (tipe objek). If the categories are numeric they can be converted using the function, or as appropriate, another converter such as

When

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
88 is a
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
31 with homogeneous
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
30 ( all numeric, all datetimes, etc. ), konversi dilakukan secara otomatis

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
1

Naming and using columns

Handling column names

A file may or may not have a header row. pandas assumes the first row should be used as the column names

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
2

By specifying the

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
49 argument in conjunction with
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
84 you can indicate other names to use and whether or not to throw away the header row (if any)

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
3

Jika tajuk berada di baris selain yang pertama, berikan nomor baris ke

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
84. This will skip the preceding rows

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
4

Catatan

Default behavior is to infer the column names. if no names are passed the behavior is identical to

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
35 and column names are inferred from the first non-blank line of the file, if column names are passed explicitly then the behavior is identical to
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
36

Duplicate names parsing

Tidak digunakan lagi sejak versi 1. 5. 0.

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
47 tidak pernah diterapkan, dan argumen baru di mana pola penggantian nama dapat ditentukan akan ditambahkan sebagai gantinya.

Jika file atau header berisi nama duplikat, panda secara default akan membedakannya untuk mencegah penimpaan data

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_5

Tidak ada lagi data duplikat karena

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
48 secara default, yang mengubah serangkaian kolom duplikat 'X', ..., 'X' menjadi 'X', 'X. 1’, …, ‘X. N'

Memfilter kolom (
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47)

Argumen

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47 memungkinkan Anda untuk memilih subset kolom apa pun dalam file, baik menggunakan nama kolom, nomor posisi, atau panggilan yang dapat dipanggil

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_6

Argumen

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47 juga dapat digunakan untuk menentukan kolom mana yang tidak digunakan dalam hasil akhir

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_7

Dalam hal ini, callable menentukan bahwa kami mengecualikan kolom "a" dan "c" dari output

Komentar dan baris kosong

Mengabaikan komentar baris dan baris kosong

Jika parameter

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_52 ditentukan, maka baris yang dikomentari sepenuhnya akan diabaikan. Secara default, baris yang benar-benar kosong juga akan diabaikan

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_8

Jika

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
53, maka
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
66 tidak akan mengabaikan baris kosong

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_9

Peringatan

Kehadiran baris yang diabaikan dapat menimbulkan ambiguitas yang melibatkan nomor baris;

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_0

Jika

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
84 dan
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
85 ditentukan,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
84 akan relatif terhadap akhir
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
85. Misalnya

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_1

Komentar

Kadang-kadang komentar atau data meta dapat disertakan dalam sebuah file

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_2

Secara default, parser menyertakan komentar di output

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_3

Kami dapat menekan komentar menggunakan kata kunci ________193______52

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_4

Berurusan dengan data Unicode

Argumen

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
62 harus digunakan untuk data unicode yang dikodekan, yang akan menghasilkan string byte yang didekodekan menjadi unicode sebagai hasilnya

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_5

Beberapa format yang menyandikan semua karakter sebagai beberapa byte, seperti UTF-16, tidak akan diurai dengan benar sama sekali tanpa menentukan penyandian.

Kolom indeks dan pembatas tambahan

Jika file memiliki satu kolom data lebih banyak daripada jumlah nama kolom, kolom pertama akan digunakan sebagai nama baris

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_6

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_7

Biasanya, Anda dapat mencapai perilaku ini menggunakan opsi

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
64

Ada beberapa kasus pengecualian saat file telah disiapkan dengan pembatas di akhir setiap baris data, membingungkan parser. Untuk secara eksplisit menonaktifkan inferensi kolom indeks dan membuang kolom terakhir, berikan

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
44

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_8

Jika subset data diuraikan menggunakan opsi

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47, spesifikasi
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
64 didasarkan pada subset itu, bukan data asli

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_9

Penanganan Tanggal

Menentukan kolom tanggal

Untuk lebih memudahkan bekerja dengan data datetime, gunakan argumen kata kunci

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
69 dan
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
70 untuk memungkinkan pengguna menentukan berbagai kolom dan format tanggal/waktu untuk mengubah input data teks menjadi objek
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
71

Kasus paling sederhana adalah dengan hanya meneruskan

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
72

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_0

Seringkali kita ingin menyimpan data tanggal dan waktu secara terpisah, atau menyimpan berbagai bidang tanggal secara terpisah. kata kunci

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_69 dapat digunakan untuk menentukan kombinasi kolom untuk mengurai tanggal dan/atau waktu dari

Anda dapat menentukan daftar daftar kolom ke

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
69, kolom tanggal yang dihasilkan akan ditambahkan ke output (agar tidak memengaruhi urutan kolom yang ada) dan nama kolom baru akan menjadi gabungan dari nama kolom komponen

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_1

Secara default parser menghapus kolom tanggal komponen, tetapi Anda dapat memilih untuk mempertahankannya melalui kata kunci

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
75

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_2

Perhatikan bahwa jika Anda ingin menggabungkan beberapa kolom menjadi satu kolom tanggal, daftar bersarang harus digunakan. Dengan kata lain,

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_76 menunjukkan bahwa kolom kedua dan ketiga masing-masing harus diuraikan sebagai kolom tanggal terpisah sementara
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
77 berarti dua kolom harus diuraikan menjadi satu kolom

Anda juga dapat menggunakan dict untuk menentukan kolom nama khusus

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_3

Penting untuk diingat bahwa jika beberapa kolom teks akan diuraikan menjadi satu kolom tanggal, maka kolom baru ditambahkan ke data. Spesifikasi

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_64 didasarkan pada kumpulan kolom baru ini daripada kolom data asli

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_4

Catatan

Jika kolom atau indeks berisi tanggal yang tidak dapat diuraikan, seluruh kolom atau indeks akan dikembalikan tanpa diubah sebagai tipe data objek. Untuk penguraian waktu non-standar, gunakan setelah

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
80

Catatan

read_csv memiliki fast_path untuk mem-parsing string datetime dalam format iso8601, e. g “2000-01-01T00. 01. 02+00. 00” dan variasi serupa. Jika Anda dapat mengatur data Anda untuk menyimpan waktu dalam format ini, waktu muat akan jauh lebih cepat, ~20x telah diamati

Fungsi parsing tanggal

Terakhir, parser memungkinkan Anda menentukan fungsi

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
70 khusus untuk memanfaatkan sepenuhnya fleksibilitas API penguraian tanggal

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_5

panda akan mencoba memanggil fungsi ________193______70 dengan tiga cara berbeda. Jika pengecualian dimunculkan, yang berikutnya dicoba

  1. In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    _70 pertama kali dipanggil dengan satu atau lebih array sebagai argumen, sebagaimana didefinisikan menggunakan
    In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    69 (e. g. ,
    In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    _85)

  2. If #1 fails,

    In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    70 is called with all the columns concatenated row-wise into a single array (e. g. ,
    In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    87)

Note that performance-wise, you should try these methods of parsing dates in order

  1. Try to infer the format using

    In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    88 (see section below)

  2. If you know the format, use

    In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    89.
    In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    90

  3. If you have a really non-standard format, use a custom

    In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    70 function. For optimal performance, this should be vectorized, i. e. , it should accept arrays as arguments

Parsing a CSV with mixed timezones

pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains columns with a mixture of timezones, the default result will be an object-dtype column with strings, even with

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
69

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
6

To parse the mixed-timezone values as a datetime column, pass a partially-applied with

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
94 as the
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
70

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
7

Inferring datetime format

If you have

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
69 enabled for some or all of your columns, and your datetime strings are all formatted the same way, you may get a large speed up by setting
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
88. If set, pandas will attempt to guess the format of your datetime strings, and then use a faster means of parsing the strings. 5-10x parsing speeds have been observed. panda akan mundur ke penguraian biasa jika format tidak dapat ditebak atau format yang ditebak tidak dapat mengurai seluruh kolom string dengan benar. So in general,
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
98 should not have any negative consequences if enabled

Here are some examples of datetime strings that can be guessed (All representing December 30th, 2011 at 00. 00. 00)

  • “20111230”

  • “2011/12/30”

  • “20111230 00. 00. 00”

  • “12/30/2011 00. 00. 00”

  • “30/Des/2011 00. 00. 00”

  • “30/Desember/2011 00. 00. 00”

Perhatikan bahwa

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_98 peka terhadap
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
00. Dengan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_01, akan menebak "01/12/2011" menjadi 1 Desember. Dengan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_02 (default) akan menebak "01/12/2011" menjadi 12 Januari

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_8

Format tanggal internasional

Sementara format tanggal AS cenderung MM/DD/YYYY, banyak format internasional menggunakan DD/MM/YYYY sebagai gantinya. Untuk kenyamanan, disediakan kata kunci

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
00

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_9

Menulis CSV ke objek file biner

Baru di versi 1. 2. 0

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_04 memungkinkan penulisan CSV ke file objek membuka mode biner. Dalam kebanyakan kasus, tidak perlu menentukan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
05 karena Panda akan mendeteksi secara otomatis apakah objek file dibuka dalam mode teks atau biner

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_0

Menentukan metode untuk konversi floating-point

Parameter

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
06 dapat ditentukan untuk menggunakan konverter floating-point tertentu selama penguraian dengan mesin C. Pilihannya adalah konverter biasa, konverter presisi tinggi, dan konverter bolak-balik (yang dijamin menjadi nilai bolak-balik setelah menulis ke file). Misalnya

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_1

Seribu pemisah

Untuk angka besar yang telah ditulis dengan pemisah ribuan, Anda dapat mengatur kata kunci

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
07 menjadi string dengan panjang 1 sehingga bilangan bulat akan diuraikan dengan benar

Secara default, angka dengan pemisah ribuan akan diuraikan sebagai string

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_2

Kata kunci

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_07 memungkinkan bilangan bulat diuraikan dengan benar

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_3

nilai NA

Untuk mengontrol nilai mana yang diuraikan sebagai nilai yang hilang (yang ditandai dengan

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46), tentukan string di
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
73. Jika Anda menentukan daftar string, maka semua nilai di dalamnya dianggap sebagai nilai yang hilang. Jika Anda menentukan nomor (a
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
11, seperti
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
12 atau
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
13 seperti
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
14), nilai setara yang sesuai juga akan menyiratkan nilai yang hilang (dalam hal ini secara efektif
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
15 diakui sebagai
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46)

Untuk mengganti sepenuhnya nilai default yang dianggap tidak ada, tentukan

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
17

Nilai default

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_46 yang dikenali adalah
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
19

Mari kita perhatikan beberapa contoh

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_4

Dalam contoh di atas

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
14 dan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
12 akan dikenali sebagai
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46, selain default. Sebuah string pertama-tama akan ditafsirkan sebagai
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
14 numerik, kemudian sebagai
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_5

Di atas, hanya bidang kosong yang akan dikenali sebagai ________4______46

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_6

Di atas, baik

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
26 dan
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
84 sebagai string adalah
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_7

Nilai default, selain string

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
29 dikenali sebagai
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46

Ketakterbatasan

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_31 seperti nilai akan diuraikan sebagai
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
32 (positif tak terhingga), dan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
33 sebagai
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
34 (negatif tak terhingga). Ini akan mengabaikan kasus nilai, artinya
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
35, juga akan diuraikan sebagai
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
32

Seri Kembali

Menggunakan kata kunci

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
37, parser akan mengembalikan output dengan satu kolom sebagai
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
62

Tidak digunakan lagi sejak versi 1. 4. 0. Pengguna sebaiknya menambahkan

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
63 ke DataFrame yang dikembalikan oleh
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
66 sebagai gantinya.

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_8

Nilai Boolean

Nilai umum

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32,
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61,
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
43, dan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
44 semuanya diakui sebagai boolean. Terkadang Anda mungkin ingin mengenali nilai lain sebagai boolean. Untuk melakukannya, gunakan opsi
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_45 dan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
46 sebagai berikut

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
_9

Menangani garis "buruk".

Beberapa file mungkin memiliki baris yang salah format dengan bidang yang terlalu sedikit atau terlalu banyak. Baris dengan bidang yang terlalu sedikit akan memiliki nilai NA yang terisi di bidang yang tertinggal. Baris dengan terlalu banyak bidang akan menimbulkan kesalahan secara default

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_0

Anda dapat memilih untuk melewati garis buruk

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_1

Atau lewati fungsi yang dapat dipanggil untuk menangani garis buruk jika

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
47. Garis buruk akan menjadi daftar string yang dipisahkan oleh
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
48

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_2

Anda juga dapat menggunakan parameter

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_47 untuk menghilangkan data kolom asing yang muncul di beberapa baris tetapi tidak di baris lainnya

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_3

Jika Anda ingin menyimpan semua data termasuk baris dengan terlalu banyak bidang, Anda dapat menentukan jumlah

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
49 yang cukup. Ini memastikan bahwa baris dengan bidang yang tidak cukup diisi dengan
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_4

Dialek

Kata kunci

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_52 memberikan fleksibilitas yang lebih besar dalam menentukan format file. Secara default menggunakan dialek Excel tetapi Anda dapat menentukan nama dialek atau contoh

Misalkan Anda memiliki data dengan tanda kutip yang tidak tertutup

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_5

Secara default,

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_66 menggunakan dialek Excel dan memperlakukan tanda kutip ganda sebagai karakter tanda kutip, yang menyebabkannya gagal saat menemukan baris baru sebelum menemukan tanda kutip ganda penutup

Kita bisa mengatasi ini menggunakan

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
52

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_6

Semua opsi dialek dapat ditentukan secara terpisah dengan argumen kata kunci

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_7

Opsi dialek umum lainnya adalah

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
95, untuk melewati spasi setelah pembatas

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_8

Pengurai melakukan segala upaya untuk "melakukan hal yang benar" dan tidak rapuh. Jenis inferensi adalah masalah yang cukup besar. Jika sebuah kolom dapat dipaksa menjadi tipe integer tanpa mengubah isinya, parser akan melakukannya. Setiap kolom non-numerik akan muncul sebagai objek dtype seperti objek panda lainnya

Mengutip dan Melarikan Diri Karakter

Kutipan (dan karakter melarikan diri lainnya) di bidang yang disematkan dapat ditangani dengan berbagai cara. Salah satu caranya adalah dengan menggunakan garis miring terbalik;

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_9

File dengan kolom lebar tetap

Saat membaca data yang dibatasi, fungsi bekerja dengan file data yang memiliki lebar kolom yang diketahui dan tetap. Parameter fungsi untuk

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_60 sebagian besar sama dengan
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
66 dengan dua parameter tambahan, dan penggunaan yang berbeda dari parameter
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
33

  • In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    _63. Daftar pasangan (tupel) yang memberikan luasan bidang dengan lebar tetap dari setiap baris sebagai interval setengah terbuka (i. e. , [dari untuk[ ). Nilai string 'infer' dapat digunakan untuk menginstruksikan parser untuk mencoba mendeteksi spesifikasi kolom dari 100 baris pertama data. Perilaku default, jika tidak ditentukan, adalah menyimpulkan

  • In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    _64. Daftar lebar bidang yang dapat digunakan sebagai pengganti 'colspec' jika intervalnya bersebelahan

  • In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    _33. Karakter untuk dipertimbangkan sebagai karakter pengisi dalam file dengan lebar tetap. Dapat digunakan untuk menentukan karakter pengisi bidang jika bukan spasi (mis. g. , '~')

Pertimbangkan file data dengan lebar tetap yang khas

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
0

Untuk mengurai file ini menjadi

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_43, kita hanya perlu menyediakan spesifikasi kolom ke fungsi
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
60 bersama dengan nama file

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_1

Note how the parser automatically picks column names X. when

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
36 argument is specified. Alternatively, you can supply just the column widths for contiguous columns:

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
2

Parser akan menangani ruang putih ekstra di sekitar kolom, jadi tidak apa-apa untuk memiliki pemisahan ekstra antara kolom dalam file

Secara default,

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_60 akan mencoba menyimpulkan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
63 file dengan menggunakan 100 baris pertama file. Itu dapat melakukannya hanya dalam kasus ketika kolom disejajarkan dan dipisahkan dengan benar oleh
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
33 yang disediakan (pembatas default adalah spasi putih)

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_3

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_60 mendukung parameter
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
88 untuk menentukan jenis kolom yang diuraikan agar berbeda dari jenis yang disimpulkan

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_4

Indeks

File dengan kolom indeks "implisit".

Pertimbangkan file dengan satu entri lebih sedikit di header daripada jumlah kolom data

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_5

Dalam kasus khusus ini,

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
66 mengasumsikan bahwa kolom pertama akan digunakan sebagai indeks dari
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_6

Perhatikan bahwa tanggal tidak diuraikan secara otomatis. Dalam hal ini Anda perlu melakukan seperti sebelumnya

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_7

Membaca indeks dengan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76

Misalkan Anda memiliki data yang diindeks oleh dua kolom

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_8

Argumen

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_64 ke
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
66 dapat mengambil daftar nomor kolom untuk mengubah banyak kolom menjadi
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76 untuk indeks objek yang dikembalikan

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_9

Membaca kolom dengan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_76

Dengan menentukan daftar lokasi baris untuk argumen

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
84, Anda dapat membaca dalam
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76 untuk kolom. Menentukan baris yang tidak berurutan akan melewati baris yang mengintervensi

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_0

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_66 juga dapat menginterpretasikan format indeks multi-kolom yang lebih umum

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_1

Catatan

Jika

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_64 tidak ditentukan (mis. g. Anda tidak memiliki indeks, atau menulisnya dengan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
85, maka
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
49 apa pun pada indeks kolom akan hilang

Secara otomatis "mengendus" pembatas

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_66 mampu menyimpulkan file yang dibatasi (tidak harus dipisahkan koma), karena panda menggunakan kelas modul csv. Untuk ini, Anda harus menentukan ________208______89

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_2

Membaca banyak file untuk membuat satu DataFrame

Paling baik digunakan untuk menggabungkan banyak file. Lihat sebagai contoh

Iterasi melalui file potongan demi potongan

Misalkan Anda ingin mengulang melalui file (berpotensi sangat besar) dengan malas daripada membaca seluruh file ke dalam memori, seperti berikut ini

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_3

Dengan menentukan

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
90 hingga
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
66, nilai yang dikembalikan akan berupa objek bertipe iterable
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
32

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_4

Berubah pada versi 1. 2.

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
94 mengembalikan pengelola konteks saat melakukan iterasi melalui file.

Menentukan

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_95 juga akan mengembalikan objek
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
32

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_5

Menentukan mesin parser

Panda saat ini mendukung tiga mesin, mesin C, mesin python, dan mesin pyarrow eksperimental (memerlukan paket

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
97). Secara umum, mesin pyarrow tercepat pada beban kerja yang lebih besar dan kecepatannya setara dengan mesin C pada sebagian besar beban kerja lainnya. Mesin python cenderung lebih lambat daripada mesin pyarrow dan C pada sebagian besar beban kerja. Namun, mesin pyarrow jauh lebih tangguh daripada mesin C, yang kekurangan beberapa fitur dibandingkan dengan mesin Python

Jika memungkinkan, panda menggunakan parser C (ditentukan sebagai

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
98), tetapi mungkin kembali ke Python jika opsi yang tidak didukung C ditentukan

Saat ini, opsi yang tidak didukung oleh mesin C dan pyrarrow termasuk

  • In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    _48 selain karakter tunggal (mis. g. pemisah regex)

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _00

  • ________208______89 dengan

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    02

Menentukan salah satu opsi di atas akan menghasilkan

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
03 kecuali mesin python dipilih secara eksplisit menggunakan
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
04

Opsi yang tidak didukung oleh mesin pyarrow yang tidak tercakup dalam daftar di atas termasuk

  • In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    _06

  • In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    _90

  • In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    _52

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _08

  • In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    _07

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _10

  • In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    _52

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _12

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _13

  • In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    _03

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _15

  • In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    _76

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _17

  • In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    _11

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _19

  • In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    _91

  • In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    _00

  • In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    _98

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _23

  • In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    _95

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _25

Menentukan opsi ini dengan

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_26 akan memunculkan
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
27

Membaca/menulis file jarak jauh

Anda dapat meneruskan URL untuk membaca atau menulis file jarak jauh ke banyak fungsi IO panda - contoh berikut menunjukkan membaca file CSV

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_6

New in version 1. 3. 0

Header khusus dapat dikirim bersama permintaan HTTP dengan meneruskan kamus pemetaan nilai kunci header ke argumen kata kunci

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
28 seperti yang ditunjukkan di bawah ini

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_7

Semua URL yang bukan file lokal atau HTTP(s) ditangani oleh fsspec, jika dipasang, dan berbagai implementasi sistem filenya (termasuk Amazon S3, Google Cloud, SSH, FTP, webHDFS…). Beberapa dari implementasi ini akan memerlukan paket tambahan untuk diinstal, misalnya URL S3 memerlukan pustaka s3fs

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_8

Saat berurusan dengan sistem penyimpanan jarak jauh, Anda mungkin memerlukan konfigurasi tambahan dengan variabel lingkungan atau file konfigurasi di lokasi khusus. Misalnya, untuk mengakses data di bucket S3, Anda perlu menentukan kredensial dengan salah satu dari beberapa cara yang tercantum di. Hal yang sama berlaku untuk beberapa backend penyimpanan, dan Anda harus mengikuti tautan di untuk implementasi yang dibangun ke dalam

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
29 dan untuk yang tidak disertakan dalam distribusi utama
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
29

Anda juga dapat meneruskan parameter langsung ke driver backend. Misalnya, jika Anda tidak memiliki kredensial S3, Anda tetap dapat mengakses data publik dengan menentukan koneksi anonim, seperti

Baru di versi 1. 2. 0

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_9

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_29 juga memungkinkan URL kompleks, untuk mengakses data dalam arsip terkompresi, caching file lokal, dan banyak lagi. Untuk menyimpan contoh di atas secara lokal, Anda akan memodifikasi panggilan ke

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
_0

di mana kami menentukan bahwa parameter "anon" dimaksudkan untuk bagian "s3" dari implementasi, bukan untuk implementasi caching. Perhatikan bahwa ini menyimpan cache ke direktori sementara selama durasi sesi saja, tetapi Anda juga dapat menentukan penyimpanan permanen

Menulis ke format CSV

Objek

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
62 dan
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 memiliki metode instance
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
34 yang memungkinkan penyimpanan konten objek sebagai file nilai yang dipisahkan koma. Fungsi mengambil sejumlah argumen. Hanya yang pertama yang diperlukan

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _35. Jalur string ke file untuk menulis atau objek file. Jika objek file itu harus dibuka dengan
    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    36

  • In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    _48. Pemisah bidang untuk file keluaran (default ",")

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _38. Representasi string dari nilai yang hilang (default ‘’)

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _39. Format string untuk angka floating point

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _40. Kolom untuk ditulis (default Tidak Ada)

  • In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    _84. Apakah akan menuliskan nama kolom (default True)

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _42. apakah akan menulis nama baris (indeks) (default True)

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _43. Label kolom untuk kolom indeks jika diinginkan. Jika Tidak Ada (default), dan
    In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    _84 dan
    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    42 Benar, maka nama indeks digunakan. (Urutan harus diberikan jika
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    _43 menggunakan MultiIndex)

  • In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    _05. Mode tulis Python, default 'w'

  • In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    _62. string yang mewakili penyandian untuk digunakan jika kontennya non-ASCII, untuk versi Python sebelum 3

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    _17. Character sequence denoting line end (default
    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    50)

  • In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    76. Set quoting rules as in csv module (default csv. QUOTE_MINIMAL). Note that if you have set a
    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    39 then floats are converted to strings and csv. QUOTE_NONNUMERIC will treat them as non-numeric

  • In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    75. Character used to quote fields (default ‘”’)

  • In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    93. Control quoting of
    In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    75 in fields (default True)

  • In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    94. Character used to escape
    In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    48 and
    In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    75 when appropriate (default None)

  • In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    90. Number of rows to write at a time

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    60. Format string for datetime objects

Writing a formatted string

The

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 object has an instance method
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
62 which allows control over the string representation of the object. All arguments are optional

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    63 default None, for example a StringIO object

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    40 default None, which columns to write

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    65 default None, minimum width of each column

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    38 default
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    46, representation of NA value

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    68 default None, a dictionary (by column) of functions each of which takes a single argument and returns a formatted string

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    39 default None, a function which takes a single (float) argument and returns a formatted string; to be applied to floats in the
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    43

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    71 default True, set to False for a
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    43 with a hierarchical index to print every MultiIndex key at each row

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    73 default True, will print the names of the indices

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    42 default True, will print the index (ie, row labels)

  • In [21]: data = "col_1\n1\n2\n'A'\n4.22"
    
    In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
    
    In [23]: df
    Out[23]: 
      col_1
    0     1
    1     2
    2   'A'
    3  4.22
    
    In [24]: df["col_1"].apply(type).value_counts()
    Out[24]: 
    <class 'str'>    4
    Name: col_1, dtype: int64
    
    84 default True, will print the column labels

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    76 default
    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    77, will print column headers left- or right-justified

The

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
62 object also has a
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
62 method, but with only the
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
63,
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
38,
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
39 arguments. Ada juga
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_83 argumen yang, jika diatur ke
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32, juga akan menampilkan panjang Seri

JSON

Read and write

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
85 format files and strings

A

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
62 or
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 can be converted to a valid JSON string. Use
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
88 with optional parameters

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    35 . pathname atau buffer untuk menulis output Ini bisa
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    24 dalam hal ini string JSON dikembalikan

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    91

    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    62
    • default is

      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      42

    • allowed values are {

      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      94,
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      95,
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      42}

    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    43
    • default is

      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      40

    • allowed values are {

      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      94,
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      95,
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      42,
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      40,
      In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
      Out[39]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      03,
      In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
      Out[39]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      04}

    The format of the JSON string

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    94

    dict like {index -> [index], columns -> [columns], data -> [values]}

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    95

    list like [{column -> value}, … , {column -> value}]

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    42

    dict like {index -> {column -> value}}

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    40

    dict like {column -> {index -> value}}

    In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    03

    just the values array

    In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    04

    adhering to the JSON Table Schema

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    60 . string, type of date conversion, ‘epoch’ for timestamp, ‘iso’ for ISO8601

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    12 . The number of decimal places to use when encoding floating point values, default 10

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    13 . force encoded string to be ASCII, default True

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    14 . The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’ or ‘ns’ for seconds, milliseconds, microseconds and nanoseconds respectively. Default ‘ms’

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    15 . The handler to call if an object cannot otherwise be converted to a suitable format for JSON. Takes a single argument, which is the object to convert, and returns a serializable object

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    16 . If
    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    95 orient, then will write each record per line as json

Note

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
46’s,
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
19’s and
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24 will be converted to
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
21 and
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
71 objects will be converted based on the
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
60 and
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
14 parameters

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
1

Orient options

There are a number of different options for the format of the resulting JSON file / string. Consider the following

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 and
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
62

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
2

Column oriented (the default for

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43) serializes the data as nested JSON objects with column labels acting as the primary index

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
3

Index oriented (the default for

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
62) similar to column oriented but the index labels are now primary

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
4

Record oriented serializes the data to a JSON array of column -> value records, index labels are not included. This is useful for passing

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 data to plotting libraries, for example the JavaScript library
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
30

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
5

Value oriented is a bare-bones option which serializes to nested JSON arrays of values only, column and index labels are not included

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
_6

Split oriented serializes to a JSON object containing separate entries for values, index and columns. Name is also included for

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
62

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
7

Table oriented serializes to the JSON Table Schema, allowing for the preservation of metadata including but not limited to dtypes and index names

Catatan

Any orient option that encodes to a JSON object will not preserve the ordering of index and column labels during round-trip serialization. If you wish to preserve label ordering use the

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
94 option as it uses ordered containers

Penanganan tanggal

Menulis dalam format tanggal ISO

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
_8

Menulis dalam format tanggal ISO, dengan mikrodetik

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
_9

Stempel waktu Epoch, dalam detik

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
_0

Menulis ke file, dengan indeks tanggal dan kolom tanggal

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
_1

Perilaku mundur

Jika serializer JSON tidak dapat menangani konten penampung secara langsung, ia akan mundur dengan cara berikut

  • jika dtype tidak didukung (mis. g.

    In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    33) then the
    In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    15, if provided, will be called for each value, otherwise an exception is raised

  • if an object is unsupported it will attempt the following

    • check if the object has defined a

      In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
      Out[39]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      35 method and call it. A
      In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
      Out[39]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      35 method should return a
      In [21]: data = "col_1\n1\n2\n'A'\n4.22"
      
      In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
      
      In [23]: df
      Out[23]: 
        col_1
      0     1
      1     2
      2   'A'
      3  4.22
      
      In [24]: df["col_1"].apply(type).value_counts()
      Out[24]: 
      <class 'str'>    4
      Name: col_1, dtype: int64
      
      43 which will then be JSON serialized

    • invoke the

      In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
      Out[39]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      15 if one was provided

    • convert the object to a

      In [21]: data = "col_1\n1\n2\n'A'\n4.22"
      
      In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})
      
      In [23]: df
      Out[23]: 
        col_1
      0     1
      1     2
      2   'A'
      3  4.22
      
      In [24]: df["col_1"].apply(type).value_counts()
      Out[24]: 
      <class 'str'>    4
      Name: col_1, dtype: int64
      
      43 by traversing its contents. However this will often fail with an
      In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
      Out[39]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      40 or give unexpected results

In general the best approach for unsupported objects or dtypes is to provide a

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
15. For example

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
2

can be dealt with by specifying a simple

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
15

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
3

Reading JSON

Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 if
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
44 is not supplied or is
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24. To explicitly force
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
62 parsing, pass
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
47

  • In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    92 . a VALID JSON string or file handle / StringIO. The string could be a URL. Valid URL schemes include http, ftp, S3, and file. For file URLs, a host is expected. For instance, a local file could be file . //localhost/path/to/table. json

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    44 . type of object to recover (series or frame), default ‘frame’

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    91

    Series
    • default is

      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      42

    • allowed values are {

      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      94,
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      95,
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      42}

    DataFrame
    • default is

      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      40

    • allowed values are {

      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      94,
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      95,
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      42,
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      40,
      In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
      Out[39]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      03,
      In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
      Out[39]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      04}

    The format of the JSON string

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    94

    dict like {index -> [index], columns -> [columns], data -> [values]}

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    95

    list like [{column -> value}, … , {column -> value}]

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    42

    dict like {index -> {column -> value}}

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    40

    dict like {column -> {index -> value}}

    In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    03

    just the values array

    In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    04

    adhering to the JSON Table Schema

  • In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    88 . if True, infer dtypes, if a dict of column to dtype, then use those, if
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    61, then don’t infer dtypes at all, default is True, apply only to the data

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    70 . boolean, try to convert the axes to the proper dtypes, default is
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    32

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    72 . a list of columns to parse for dates; If
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    32, then try to parse date-like columns, default is
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    32

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    75 . boolean, default
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    32. If parsing dates, then parse the default date-like columns

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    77 . direct decoding to NumPy arrays. default is
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    61; Supports numeric data only, although labels may be non-numeric. Also note that the JSON ordering MUST be the same for each term if
    In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    79

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    80 . boolean, default
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    _61. Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    61) is to use fast but less precise builtin functionality

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    14 . string, the timestamp unit to detect if converting dates. Default None. By default the timestamp precision will be detected, if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force timestamp precision to seconds, milliseconds, microseconds or nanoseconds respectively

  • In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    16 . reads file as one json object per line

  • In [25]: df2 = pd.read_csv(StringIO(data))
    
    In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")
    
    In [27]: df2
    Out[27]: 
       col_1
    0   1.00
    1   2.00
    2    NaN
    3   4.22
    
    In [28]: df2["col_1"].apply(type).value_counts()
    Out[28]: 
    <class 'float'>    4
    Name: col_1, dtype: int64
    
    62 . The encoding to use to decode py3 bytes

  • In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    90 . when used in combination with
    In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    87, return a JsonReader which reads in
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    90 lines per iteration

The parser will raise one of

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
89 if the JSON is not parseable

If a non-default

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
91 was used when encoding to JSON be sure to pass the same option here so that decoding produces sensible results, see for an overview

Data conversion

The default of

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
91,
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
92, and
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
93 will try to parse the axes, and all of the data into appropriate types, including dates. If you need to override specific dtypes, pass a dict to
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
88.
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
70 should only be set to
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61 if you need to preserve string-like numbers (e. g. ‘1’, ‘2’) in an axes

Catatan

Large integer values may be converted to dates if

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
93 and the data and / or column labels appear ‘date-like’. The exact threshold depends on the
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
14 specified. ‘date-like’ means that the column label meets one of the following criteria

  • it ends with

    In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    99

  • it ends with

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    00

  • it begins with

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    01

  • it is

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    02

  • it is

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    03

Peringatan

When reading JSON data, automatic coercing into dtypes has some quirks

  • an index can be reconstructed in a different order from serialization, that is, the returned order is not guaranteed to be the same as before serialization

  • a column that was

    In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    11 data will be converted to
    In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    13 if it can be done safely, e. g. a column of
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    06

  • bool columns will be converted to

    In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    13 on reconstruction

Thus there are times where you may want to specify specific dtypes via the

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
88 keyword argument

Reading from a JSON string

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
4

Reading from a file

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
5

Don’t convert any data (but still convert axes and dates)

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
6

Specify dtypes for conversion

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
7

Preserve string indices

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
8

Dates written in nanoseconds need to be read back in nanoseconds

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
9

The Numpy parameter

Catatan

This param has been deprecated as of version 1. 0. 0 and will raise a

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
09

This supports numeric data only. Index and columns labels may be non-numeric, e. g. strings, dates etc

If

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
79 is passed to
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
11 an attempt will be made to sniff an appropriate dtype during deserialization and to subsequently decode directly to NumPy arrays, bypassing the need for intermediate Python objects

This can provide speedups if you are deserialising a large amount of numeric data

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
00

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
01

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
02

The speedup is less noticeable for smaller datasets

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
03

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
04

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
05

Peringatan

Direct NumPy decoding makes a number of assumptions and may fail or produce unexpected output if these assumptions are not satisfied

  • data is numeric

  • data is uniform. The dtype is sniffed from the first value decoded. A

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    27 may be raised, or incorrect output may be produced if this condition is not satisfied

  • labels are ordered. Labels are only read from the first container, it is assumed that each subsequent row / column has been encoded in the same order. Ini harus dipenuhi jika data dikodekan menggunakan

    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    88 tetapi mungkin tidak demikian jika JSON berasal dari sumber lain

Normalisasi

pandas provides a utility function to take a dict or list of dicts and normalize this semi-structured data into a flat table

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
06

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
07

The max_level parameter provides more control over which level to end normalization. With max_level=1 the following snippet normalizes until 1st nesting level of the provided dict

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
08

Line delimited json

pandas is able to read and write line-delimited json files that are common in data processing pipelines using Hadoop or Spark

For line-delimited json files, pandas can also return an iterator which reads in

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
90 lines at a time. This can be useful for large files or to read from a stream

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
09

Table schema

Table Schema is a spec for describing tabular datasets as a JSON object. The JSON includes information on the field names, types, and other attributes. You can use the orient

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
04 to build a JSON string with two fields,
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
16 and
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
56

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
10

The

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
16 field contains the
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
19 key, which itself contains a list of column name to type pairs, including the
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
20 or
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76 (see below for a list of types). The
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
16 field also contains a
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
23 field if the (Multi)index is unique

The second field,

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
56, contains the serialized data with the
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
95 orient. The index is included, and any datetimes are ISO 8601 formatted, as required by the Table Schema spec

The full list of types supported are described in the Table Schema spec. This table shows the mapping from pandas types

pandas type

Table Schema type

int64

integer

float64

number

bool

boolean

datetime64[ns]

datetime

timedelta64[ns]

duration

categorical

any

object

str

A few notes on the generated table schema

  • The

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    16 object contains a
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    27 field. This contains the version of pandas’ dialect of the schema, and will be incremented with each revision

  • All dates are converted to UTC when serializing. Even timezone naive values, which are treated as UTC with an offset of 0

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    11

  • datetimes with a timezone (before serializing), include an additional field

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    28 with the time zone name (e. g.
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    29)

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    12

  • Periods are converted to timestamps before serialization, and so have the same behavior of being converted to UTC. Selain itu, periode akan berisi dan bidang tambahan

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    30 dengan frekuensi periode, e. g.
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    31

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    13

  • Categoricals use the

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    32 type and an
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    33 constraint listing the set of possible values. Additionally, an
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    34 field is included

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    14

  • A

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    23 field, containing an array of labels, is included if the index is unique

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    15

  • The

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    23 behavior is the same with MultiIndexes, but in this case the
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    23 is an array

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    16

  • The default naming roughly follows these rules

    • For series, the

      In [40]: from pandas.api.types import CategoricalDtype
      
      In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
      
      In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
      Out[42]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      38 is used. If that’s none, then the name is
      In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
      Out[39]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      03

    • For

      In [40]: from pandas.api.types import CategoricalDtype
      
      In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
      
      In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
      Out[42]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      40, the stringified version of the column name is used

    • For

      In [40]: from pandas.api.types import CategoricalDtype
      
      In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
      
      In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
      Out[42]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      20 (not
      In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
      
      In [30]: df = pd.DataFrame({"col_1": col_1})
      
      In [31]: df.to_csv("foo.csv")
      
      In [32]: mixed_df = pd.read_csv("foo.csv")
      
      In [33]: mixed_df["col_1"].apply(type).value_counts()
      Out[33]: 
      <class 'int'>    737858
      <class 'str'>    262144
      Name: col_1, dtype: int64
      
      In [34]: mixed_df["col_1"].dtype
      Out[34]: dtype('O')
      
      76),
      In [40]: from pandas.api.types import CategoricalDtype
      
      In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
      
      In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
      Out[42]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      43 is used, with a fallback to
      In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
      
      In [36]: pd.read_csv(StringIO(data))
      Out[36]: 
        col1 col2  col3
      0    a    b     1
      1    a    b     2
      2    c    d     3
      
      In [37]: pd.read_csv(StringIO(data)).dtypes
      Out[37]: 
      col1    object
      col2    object
      col3     int64
      dtype: object
      
      In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
      Out[38]: 
      col1    category
      col2    category
      col3    category
      dtype: object
      
      42 if that is None

    • For

      In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
      
      In [30]: df = pd.DataFrame({"col_1": col_1})
      
      In [31]: df.to_csv("foo.csv")
      
      In [32]: mixed_df = pd.read_csv("foo.csv")
      
      In [33]: mixed_df["col_1"].apply(type).value_counts()
      Out[33]: 
      <class 'int'>    737858
      <class 'str'>    262144
      Name: col_1, dtype: int64
      
      In [34]: mixed_df["col_1"].dtype
      Out[34]: dtype('O')
      
      76,
      In [40]: from pandas.api.types import CategoricalDtype
      
      In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
      
      In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
      Out[42]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      46 is used. If any level has no name, then
      In [40]: from pandas.api.types import CategoricalDtype
      
      In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
      
      In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
      Out[42]: 
      col1    category
      col2      object
      col3       int64
      dtype: object
      
      47 is used

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
11 also accepts
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
49 as an argument. This allows for the preservation of metadata such as dtypes and index names in a round-trippable manner

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
17

Please note that the literal string ‘index’ as the name of an is not round-trippable, nor are any names beginning with

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
51 within a . These are used by default in to indicate missing values and the subsequent read cannot distinguish the intent

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
18

When using

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
49 along with user-defined
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
55, the generated schema will contain an additional
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
56 key in the respective
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
19 element. This extra key is not standard but does enable JSON roundtrips for extension types (e. g.
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
58)

The

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
56 key carries the name of the extension, if you have properly registered the
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
60, pandas will use said name to perform a lookup into the registry and re-convert the serialized data into your custom dtype

HTML

Reading HTML content

Peringatan

We highly encourage you to read the below regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers

The top-level

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
61 function can accept an HTML string/file/URL and will parse HTML tables into list of pandas
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
40. Let’s look at a few examples

Catatan

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
63 returns a
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
64 of
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 objects, even if there is only a single table contained in the HTML content

Read a URL with no options

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
19

Catatan

The data from the above URL changes every Monday so the resulting data above may be slightly different

Read in the content of the file from the above URL and pass it to

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
63 as a string

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
20

You can even pass in an instance of

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
11 if you so desire

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
21

Catatan

The following examples are not run by the IPython evaluator due to the fact that having so many network-accessing functions slows down the documentation build. If you spot an error or an example that doesn’t run, please do not hesitate to report it over on pandas GitHub issues page

Read a URL and match a table that contains specific text

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
22

Specify a header row (by default

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
68 or
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
69 elements located within a
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
70 are used to form the column index, if multiple rows are contained within
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
70 then a MultiIndex is created); if specified, the header row is taken from the data minus the parsed header elements (
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
68 elements)

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
23

Specify an index column

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
24

Specify a number of rows to skip

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
25

Specify a number of rows to skip using a list (

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
73 works as well)

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
26

Tentukan atribut HTML

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
27

Specify values that should be converted to NaN

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
28

Specify whether to keep the default set of NaN values

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
29

Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_30

Use some combination of the above

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
31

Read in pandas

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
74 output (with some loss of floating point precision)

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
32

Backend

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
_75 akan memunculkan error pada parse yang gagal jika itu adalah satu-satunya parser yang Anda berikan. If you only have a single parser you can provide just a string, but it is considered good practice to pass a list with one string if, for example, the function expects a sequence of strings. You may use

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
33

Or you could pass

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
76 without a list

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_34

However, if you have bs4 and html5lib installed and pass

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24 or
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
78 then the parse will most likely succeed. Note that as soon as a parse succeeds, the function will return

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
35

Links can be extracted from cells along with the text using

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
79

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
36

New in version 1. 5. 0

Writing to HTML files

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 objects have an instance method
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
74 which renders the contents of the
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 as an HTML table. The function arguments are as in the method
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
62 described above

Catatan

Not all of the possible options for

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
84 are shown here for brevity’s sake. See
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
85 for the full set of options

Catatan

In an HTML-rendering supported environment like a Jupyter Notebook,

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
86 will render the raw HTML into the environment

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
37

The

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
40 argument will limit the columns shown

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
38

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
39 takes a Python callable to control the precision of floating point values

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
39

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
89 will make the row labels bold by default, but you can turn that off

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
40

The

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
90 argument provides the ability to give the resulting HTML table CSS classes. Note that these classes are appended to the existing
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
91 class

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
41

The

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
92 argument provides the ability to add hyperlinks to cells that contain URLs

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
42

Finally, the

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
93 argument allows you to control whether the “<”, “>” and “&” characters escaped in the resulting HTML (by default it is
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32). So to get the HTML without escaped characters pass
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
95

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
43

Escaped

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
44

Not escaped

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
45

Catatan

Some browsers may not show a difference in the rendering of the previous two HTML tables

HTML Table Parsing Gotchas

There are some versioning issues surrounding the libraries that are used to parse HTML tables in the top-level pandas io function

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
63

Masalah dengan lxml

  • Benefits

    • lxml is very fast

    • lxml requires Cython to install correctly

  • Drawbacks

    • lxml does not make any guarantees about the results of its parse unless it is given

    • In light of the above, we have chosen to allow you, the user, to use the lxml backend, but this backend will use html5lib if lxml fails to parse

    • It is therefore highly recommended that you install both BeautifulSoup4 and html5lib, so that you will still get a valid result (provided everything else is valid) even if lxml fails

Masalah dengan BeautifulSoup4 menggunakan lxml sebagai backend

  • The above issues hold here as well since BeautifulSoup4 is essentially just a wrapper around a parser backend

Issues with BeautifulSoup4 using html5lib as a backend

  • Benefits

    • html5lib is far more lenient than lxml and consequently deals with real-life markup in a much saner way rather than just, e. g. , dropping an element without notifying you

    • html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely important for parsing HTML tables, since it guarantees a valid document. However, that does NOT mean that it is “correct”, since the process of fixing markup does not have a single definition

    • html5lib is pure Python and requires no additional build steps beyond its own installation

  • Drawbacks

    • The biggest drawback to using html5lib is that it is slow as molasses. However consider the fact that many tables on the web are not big enough for the parsing algorithm runtime to matter. It is more likely that the bottleneck will be in the process of reading the raw text from the URL over the web, i. e. , IO (input-output). For very large tables, this might not be true

LaTeX

New in version 1. 3. 0

Currently there are no methods to read from LaTeX, only output methods

Writing to LaTeX files

Catatan

DataFrame and Styler objects currently have a

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
97 method. We recommend using the Styler. to_latex() method over DataFrame. to_latex() due to the former’s greater flexibility with conditional styling, and the latter’s possible future deprecation.

Review the documentation for Styler. to_latex , which gives examples of conditional styling and explains the operation of its keyword arguments.

For simple application the following pattern is sufficient

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
46

To format values before output, chain the Styler. format method.

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
47

XML

Reading XML

New in version 1. 3. 0

The top-level

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
98 function can accept an XML string/file/URL and will parse nodes and attributes into a pandas
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43

Catatan

Since there is no standard XML structure where design types can vary in many ways,

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
000 works best with flatter, shallow versions. If an XML document is deeply nested, use the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
001 feature to transform XML into a flatter version

Let’s look at a few examples

Read an XML string

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
48

Read a URL with no options

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
49

Read in the content of the “books. xml” file and pass it to

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
000 as a string

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
50

Read in the content of the “books. xml” as instance of

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
11 or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
004 and pass it to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
000

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_51

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
52

Even read XML from AWS S3 buckets such as NIH NCBI PMC Article Datasets providing Biomedical and Life Science Jorurnals

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
53

With lxml as default

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
006, you access the full-featured XML library that extends Python’s ElementTree API. One powerful tool is ability to query nodes selectively or conditionally with more expressive XPath

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
54

Specify only elements or only attributes to parse

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
55

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
56

XML documents can have namespaces with prefixes and default namespaces without prefixes both of which are denoted with a special attribute

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
007. In order to parse by node under a namespace context,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
008 must reference a prefix

For example, below XML contains a namespace with prefix,

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
009, and URI at
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
010. In order to parse
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
011 nodes,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
012 must be used

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
57

Similarly, an XML document can have a default namespace without prefix. Failing to assign a temporary prefix will return no nodes and raise a

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
27. But assigning any temporary name to correct URI allows parsing by nodes

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
58

However, if XPath does not reference node names such as default,

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
014, then
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
012 is not required

With lxml as parser, you can flatten nested XML documents with an XSLT script which also can be string/file/URL types. As background, XSLT is a special-purpose language written in a special XML file that can transform original XML documents into other XML, HTML, even text (CSV, JSON, etc. ) using an XSLT processor

For example, consider this somewhat nested structure of Chicago “L” Rides where station and rides elements encapsulate data in their own sections. With below XSLT,

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
75 can transform original nested document into a flatter output (as shown below for demonstration) for easier parse into
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
59

Untuk file XML yang sangat besar yang dapat berkisar dari ratusan megabyte hingga gigabyte, mendukung penguraian file yang cukup besar tersebut menggunakan dan yang merupakan metode hemat memori untuk beralih melalui pohon XML dan mengekstrak elemen dan atribut tertentu. tanpa memegang seluruh pohon dalam memori

New in version 1. 5. 0

To use this feature, you must pass a physical XML file path into

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
000 and use the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
020 argument. Files should not be compressed or point to online sources but stored on local disk. Also,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
020 should be a dictionary where the key is the repeating nodes in document (which become the rows) and the value is a list of any element or attribute that is a descendant (i. e. , child, grandchild) of repeating node. Since XPath is not used in this method, descendants do not need to share same relationship with one another. Below shows example of reading in Wikipedia’s very large (12 GB+) latest article data dump

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
60

Writing XML

New in version 1. 3. 0

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 objects have an instance method
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
023 which renders the contents of the
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 as an XML document

Catatan

This method does not support special properties of XML including DTD, CData, XSD schemas, processing instructions, comments, and others. Only namespaces at the root level is supported. However,

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
001 allows design changes after initial output

Let’s look at a few examples

Write an XML without options

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
61

Write an XML with new root and row name

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
62

Write an attribute-centric XML

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
63

Write a mix of elements and attributes

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
64

Any

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
40 with hierarchical columns will be flattened for XML element names with levels delimited by underscores

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
65

Write an XML with default namespace

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
66

Write an XML with namespace prefix

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
67

Write an XML without declaration or pretty print

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_68

Write an XML and transform with stylesheet

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
69

XML Final Notes

  • All XML documents adhere to W3C specifications. Both

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    027 and
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    75 parsers will fail to parse any markup document that is not well-formed or follows XML syntax rules. Do be aware HTML is not an XML document unless it follows XHTML specs. However, other popular markup types including KML, XAML, RSS, MusicML, MathML are compliant XML schemas

  • For above reason, if your application builds XML prior to pandas operations, use appropriate DOM libraries like

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    027 and
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    75 to build the necessary document and not by string concatenation or regex adjustments. Always remember XML is a special text file with markup rules

  • With very large XML files (several hundred MBs to GBs), XPath and XSLT can become memory-intensive operations. Be sure to have enough available RAM for reading and writing to large XML files (roughly about 5 times the size of text)

  • Because XSLT is a programming language, use it with caution since such scripts can pose a security risk in your environment and can run large or infinite recursive operations. Always test scripts on small fragments before full run

  • The etree parser supports all functionality of both

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    000 and
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    023 except for complex XPath and any XSLT. Though limited in features,
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    027 is still a reliable and capable parser and tree builder. Its performance may trail
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    75 to a certain degree for larger files but relatively unnoticeable on small to medium size files

Excel files

The method can read Excel 2007+ (

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
036) files using the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
037 Python module. Excel 2003 (
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
038) files can be read using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
039. Binary Excel (
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
040) files can be read using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
041. The instance method is used for saving a
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 to Excel. Generally the semantics are similar to working with data. See the for some advanced strategies

Peringatan

The xlwt package for writing old-style

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
038 excel files is no longer maintained. The xlrd package is now only for reading old-style
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
038 files

Before pandas 1. 3. 0, the default argument

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
046 to would result in using the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
039 engine in many cases, including new Excel 2007+ (
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
036) files. pandas will now default to using the openpyxl engine

It is strongly encouraged to install

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
037 to read Excel 2007+ (
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
036) files. Please do not report issues when using ``xlrd`` to read ``. xlsx`` files. This is no longer supported, switch to using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
037 instead

Attempting to use the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
053 engine will raise a
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
09 unless the option
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
055 is set to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
056. While this option is now deprecated and will also raise a
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
09, it can be globally set and the warning suppressed. Users are recommended to write
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
036 files using the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
037 engine instead

Reading Excel files

In the most basic use-case,

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
060 takes a path to an Excel file, and the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
061 indicating which sheet to parse

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
70

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
062 class

To facilitate working with multiple sheets from the same file, the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
062 class can be used to wrap the file and can be passed into
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
060 There will be a performance benefit for reading multiple sheets as the file is read into memory only once

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
71

The

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
062 class can also be used as a context manager

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
72

The

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
066 property will generate a list of the sheet names in the file

The primary use-case for an

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
062 is parsing multiple sheets with different parameters

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
73

Note that if the same parsing parameters are used for all sheets, a list of sheet names can simply be passed to

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
060 with no loss in performance

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
74

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
062 can also be called with a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
070 object as a parameter. This allows the user to control how the excel file is read. For example, sheets can be loaded on demand by calling
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
071 with
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
072

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
75

Specifying sheets

Catatan

The second argument is

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
061, not to be confused with
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
074

Catatan

An ExcelFile’s attribute

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
066 provides access to a list of sheets

  • The arguments

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    061 allows specifying the sheet or sheets to read

  • Nilai default untuk

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    _061 adalah 0, menunjukkan untuk membaca lembar pertama

  • Pass a string to refer to the name of a particular sheet in the workbook

  • Pass an integer to refer to the index of a sheet. Indices follow Python convention, beginning at 0

  • Pass a list of either strings or integers, to return a dictionary of specified sheets

  • Pass a

    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    24 to return a dictionary of all available sheets

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
76

Using the sheet index

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
77

Using all default values

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
78

Using None to get all sheets

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
79

Using a list to get multiple sheets

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
80

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
060 can read more than one sheet, by setting
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
061 to either a list of sheet names, a list of sheet positions, or
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24 to read all sheets. Sheets can be specified by sheet index or sheet name, using an integer or string, respectively

Reading a
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
060 can read a
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76 index, by passing a list of columns to
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
64 and a
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76 column by passing a list of rows to
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
84. If either the
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
42 or
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
40 have serialized level names those will be read in as well by specifying the rows/columns that make up the levels

Misalnya, untuk membaca dalam indeks ________208______76 tanpa nama

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
81

If the index has level names, they will parsed as well, using the same parameters

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
82

If the source file has both

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76 index and columns, lists specifying each should be passed to
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
64 and
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
84

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
83

Missing values in columns specified in

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
64 will be forward filled to allow roundtripping with
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
095 for
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
096. To avoid forward filling the missing values use
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
097 after reading the data instead of
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
64

Parsing specific columns

It is often the case that users will insert columns to do temporary computations in Excel and you may not want to read in those columns.

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
060 takes a
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47 keyword to allow you to specify a subset of columns to parse

Changed in version 1. 0. 0

Passing in an integer for

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47 will no longer work. Please pass in a list of ints from 0 to
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47 inclusive instead

You can specify a comma-delimited set of Excel columns and ranges as a string

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
84

If

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47 is a list of integers, then it is assumed to be the file column indices to be parsed

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
85

Element order is ignored, so

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
54 is the same as
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
55

If

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47 is a list of strings, it is assumed that each string corresponds to a column name provided either by the user in
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
49 or inferred from the document header row(s). Those strings define which columns will be parsed

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
86

Element order is ignored, so

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
108 is the same as
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
109

If

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
47 is callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
87

Mengurai tanggal

Datetime-like values are normally automatically converted to the appropriate dtype when reading the excel file. But if you have a column of strings that look like dates (but are not actually formatted as dates in excel), you can use the

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
69 keyword to parse those strings to datetimes

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
88

Cell converters

It is possible to transform the contents of Excel cells via the

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
11 option. For instance, to convert a column to boolean

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
89

This options handles missing values and treats exceptions in the converters as missing data. Transformations are applied cell by cell rather than to the column as a whole, so the array dtype is not guaranteed. For instance, a column of integers with missing values cannot be transformed to an array with integer dtype, because NaN is strictly a float. Anda dapat menutupi data yang hilang secara manual untuk memulihkan tipe bilangan bulat

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
90

Dtype specifications

As an alternative to converters, the type for an entire column can be specified using the

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
88 keyword, which takes a dictionary mapping column names to types. To interpret data with no type inference, use the type
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
15 or
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
72

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
91

Writing Excel files

Writing Excel files to disk

To write a

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 object to a sheet of an Excel file, you can use the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
095 instance method. The arguments are largely the same as
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
34 described above, the first argument being the name of the excel file, and the optional second argument the name of the sheet to which the
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 should be written. For example

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
92

Files with a

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
038 extension will be written using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
053 and those with a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
036 extension will be written using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
124 (if available) or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
037

The

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 will be written in a way that tries to mimic the REPL output. The
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
43 will be placed in the second row instead of the first. You can place it in the first row by setting the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
128 option in
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
042 to
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
61

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
93

In order to write separate

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
40 to separate sheets in a single Excel file, one can pass an
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
132

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
94

Writing Excel files to memory

pandas supports writing Excel files to buffer-like objects such as

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
11 or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
004 using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
132

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
95

Catatan

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
136 is optional but recommended. Setting the engine determines the version of workbook produced. Setting
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
137 will produce an Excel 2003-format workbook (xls). Using either
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
138 or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
139 will produce an Excel 2007-format workbook (xlsx). If omitted, an Excel 2007-formatted workbook is produced

Excel writer engines

Deprecated since version 1. 2. 0. As the xlwt package is no longer maintained, the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
053 engine will be removed from a future version of pandas. This is the only engine in pandas that supports writing to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
038 files.

pandas chooses an Excel writer via two methods

  1. the

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    136 keyword argument

  2. the filename extension (via the default specified in config options)

By default, pandas uses the XlsxWriter for

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
036, openpyxl for
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
144, and xlwt for
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
038 files. If you have multiple engines installed, you can set the default engine through
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
146 and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
055. pandas will fall back on openpyxl for
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
036 files if Xlsxwriter is not available

To specify which writer you want to use, you can pass an engine keyword argument to

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
095 and to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
132. The built-in engines are

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    037. version 2. 4 or higher is required

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    124

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    053

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
96

Style and formatting

The look and feel of Excel worksheets created from pandas can be modified using the following parameters on the

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43’s
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
095 method

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    39 . Format string for floating point numbers (default
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    24)

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    158 . A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    24)

Using the Xlsxwriter engine provides many options for controlling the format of an Excel worksheet created with the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
095 method. Excellent examples can be found in the Xlsxwriter documentation here. https. //xlsxwriter. readthedocs. io/working_with_pandas. html

OpenDocument Spreadsheets

New in version 0. 25

The method can also read OpenDocument spreadsheets using the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
162 module. The semantics and features for reading OpenDocument spreadsheets match what can be done for using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
163

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
97

Catatan

Currently pandas only supports reading OpenDocument spreadsheets. Writing is not implemented

Binary Excel (. xlsb) files

New in version 1. 0. 0

The method can also read binary Excel files using the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
041 module. The semantics and features for reading binary Excel files mostly match what can be done for using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
166.
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
041 does not recognize datetime types in files and will return floats instead

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
98

Catatan

Currently pandas only supports reading binary Excel files. Writing is not implemented

Clipboard

A handy way to grab data is to use the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
168 method, which takes the contents of the clipboard buffer and passes them to the
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
66 method. For instance, you can copy the following text to the clipboard (CTRL-C on many operating systems)

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
99

And then import the data directly to a

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 by calling

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
00

The

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
171 method can be used to write the contents of a
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 to the clipboard. Following which you can paste the clipboard contents into other applications (CTRL-V on many operating systems). Di sini kami mengilustrasikan menulis
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_43 ke clipboard dan membacanya kembali

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
01

We can see that we got the same content back, which we had earlier written to the clipboard

Catatan

Anda mungkin perlu menginstal xclip atau xsel (dengan PyQt5, PyQt4 atau qtpy) di Linux untuk menggunakan metode ini

Pickling

Semua objek panda dilengkapi dengan metode

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
174 yang menggunakan modul
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
175 Python untuk menyimpan struktur data ke disk menggunakan format pickle

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
02

The

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
176 function in the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
177 namespace can be used to load any pickled pandas object (or any other pickled object) from file

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
03

Peringatan

Loading pickled data received from untrusted sources can be unsafe

See. https. //docs. python. org/3/library/pickle. html

Peringatan

is only guaranteed backwards compatible back to pandas version 0. 20. 3

Compressed pickle files

, and can read and write compressed pickle files. The compression types of

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
57,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
58,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
184,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
185 are supported for reading and writing. The
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
186 file format only supports reading and must contain only one data file to be read

The compression type can be an explicit parameter or be inferred from the file extension. If ‘infer’, then use

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
57,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
58,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
186,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
184,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
185 if filename ends in
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
192,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
193,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
194,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
195, or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
196, respectively

The compression parameter can also be a

In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
43 in order to pass options to the compression protocol. It must have a
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
47 key set to the name of the compression protocol, which must be one of {
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
39,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
37,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
38,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
40,
In [21]: data = "col_1\n1\n2\n'A'\n4.22"

In [22]: df = pd.read_csv(StringIO(data), converters={"col_1": str})

In [23]: df
Out[23]: 
  col_1
0     1
1     2
2   'A'
3  4.22

In [24]: df["col_1"].apply(type).value_counts()
Out[24]: 
<class 'str'>    4
Name: col_1, dtype: int64
41}. All other key-value pairs are passed to the underlying compression library

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
04

Using an explicit compression type

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
05

Inferring compression type from the extension

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_06

The default is to ‘infer’

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
07

Passing options to the compression protocol in order to speed up compression

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
08

msgpack

pandas support for

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
204 has been removed in version 1. 0. 0. It is recommended to use instead

Alternatively, you can also the Arrow IPC serialization format for on-the-wire transmission of pandas objects. For documentation on pyarrow, see here

HDF5 (PyTables)

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
205 is a dict-like object which reads and writes pandas using the high performance HDF5 format using the excellent PyTables library. See the for some advanced strategies

Peringatan

pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle. Memuat data acar yang diterima dari sumber yang tidak tepercaya bisa jadi tidak aman

See. https. //docs. python. org/3/library/pickle. html for more

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
09

Objects can be written to the file just like adding key-value pairs to a dict

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
10

In a current or later Python session, you can retrieve stored objects

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
11

Deletion of the object specified by the key

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
12

Closing a Store and using a context manager

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
13

Read/write API

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
205 supports a top-level API using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
207 for reading and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
208 for writing, similar to how
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
66 and
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
34 work

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
14

HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
211

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
15

Fixed format

The examples above show storing using

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
212, which write the HDF5 to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
213 in a fixed array format, called the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
214 format. These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
214 format stores offer very fast writing and slightly faster reading than
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
04 stores. This format is specified by default when using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
212 or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
208 or by
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
219 or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
220

Peringatan

A

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
214 format will raise a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
222 if you try to retrieve using a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
223

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
16

Table format

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
205 supports another
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
213 format on disk, the
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
04 format. Conceptually a
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
04 is shaped very much like a DataFrame, with rows and columns. A
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
04 may be appended to in the same or other sessions. In addition, delete and query type operations are supported. This format is specified by
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
229 or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
230 to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
231 or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
212 or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
208

This format can be set as an option as well

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
234 to enable
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
235 to by default store in the
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
04 format

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
17

Catatan

You can also create a

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
04 by passing
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
229 or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
230 to a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
212 operation

Hierarchical keys

Keys to a store can be specified as a string. These can be in a hierarchical path-name like format (e. g.

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
241), which will generate a hierarchy of sub-stores (or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
242 in PyTables parlance). Kunci dapat ditentukan tanpa awalan '/' dan selalu mutlak (mis. g. ‘foo’ refers to ‘/foo’). Removal operations can remove everything in the sub-store and below, so be careful

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
18

You can walk through the group hierarchy using the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
243 method which will yield a tuple for each group key along with the relative keys of its contents

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
19

Peringatan

Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for items stored under the root node

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
20

Instead, use explicit string based keys

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
21

Storing types

Storing mixed types in a table

Storing mixed-dtype data is supported. Strings are stored as a fixed-width using the maximum size of the appended column. Subsequent attempts at appending longer strings will raise a

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
27

Passing

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
245 as a parameter to append will set a larger minimum for the string columns. Storing
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
246 are currently supported. For string columns, passing
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
247 to append will change the default nan representation on disk (which converts to/from
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
248), this defaults to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
249

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
22

Storing MultiIndex DataFrames

Menyimpan MultiIndex

In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
40 sebagai tabel sangat mirip dengan menyimpan/memilih dari indeks homogen
In [40]: from pandas.api.types import CategoricalDtype

In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)

In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
Out[42]: 
col1    category
col2      object
col3       int64
dtype: object
40

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
23

Catatan

The

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
42 keyword is reserved and cannot be use as a level name

Querying

Querying a table

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
253 and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
254 operations have an optional criterion that can be specified to select/delete only a subset of the data. This allows one to have a very large on-disk table and retrieve only a portion of the data

A query is specified using the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
255 class under the hood, as a boolean expression

  • In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    42 and
    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    40 are supported indexers of
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    40

  • if

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    259 are specified, these can be used as additional indexers

  • level name in a MultiIndex, with default name

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    260,
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    261, … if not provided

Valid comparison operators are

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
262

Ekspresi boolean yang valid digabungkan dengan

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    263 . atau

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    _264. Dan

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    _265 dan
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    266. untuk pengelompokan

Aturan ini mirip dengan bagaimana ekspresi boolean digunakan dalam panda untuk pengindeksan

Catatan

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    _267 akan diperluas secara otomatis ke operator pembanding
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    268

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    _269 bukan operator, tetapi hanya dapat digunakan dalam keadaan yang sangat terbatas

  • Jika daftar/tupel ekspresi diteruskan, mereka akan digabungkan melalui

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    264

Berikut ini adalah ekspresi yang valid

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    _271

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    272

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    273

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    274

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    275

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    276

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    277

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    278

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    279

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    280

The

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
281 are on the left-hand side of the sub-expression

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
40,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
283,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
284

Sisi kanan sub-ekspresi (setelah operator pembanding) bisa jadi

  • functions that will be evaluated, e. g.

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    285

  • strings, e. g.

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    286

  • date-like, e. g.

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    287, or
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    288

  • lists, e. g.

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    289

  • variables that are defined in the local names space, e. g.

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    290

Catatan

Passing a string to a query by interpolating it into the query expression is not recommended. Simply assign the string of interest to a variable and use that variable in an expression. For example, do this

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
24

instead of this

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
25

The latter will not work and will raise a

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
291. Note that there’s a single quote followed by a double quote in the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
292 variable

If you must interpolate, use the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
293 format specifier

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
26

which will quote

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
292

Here are some examples

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
27

Use boolean expressions, with in-line function evaluation

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
28

Use inline column reference

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
29

The

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
40 keyword can be supplied to select a list of columns to be returned, this is equivalent to passing a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
296

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
30

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
297 and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
298 parameters can be specified to limit the total search space. These are in terms of the total number of rows in a table

Catatan

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
253 will raise a
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
27 if the query expression has an unknown variable reference. Usually this means that you are trying to select on a column that is not a data_column

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
253 will raise a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
291 if the query expression is not valid

Query timedelta64[ns]

You can store and query using the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
303 type. Terms can be specified in the format.
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
304, where float may be signed (and fractional), and unit can be
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
305 for the timedelta. Here’s an example

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
31

Query MultiIndex

Selecting from a

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76 can be achieved by using the name of the level

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
32

If the

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76 levels names are
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24, the levels are automatically made available via the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
309 keyword with
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
310 the level of the
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
76 you want to select from

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
33

Indexing

You can create/modify an index for a table with

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
312 after data is already in the table (after and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
313 operation). Creating a table index is highly encouraged. This will speed your queries a great deal when you use a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
253 with the indexed dimension as the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
223

Catatan

Indexes are automagically created on the indexables and any data columns you specify. This behavior can be turned off by passing

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
316 to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
231

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
34

Oftentimes when appending large amounts of data to a store, it is useful to turn off index creation for each append, then recreate at the end

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
35

Then create the index when finished appending

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
36

See here for how to create a completely-sorted-index (CSI) on an existing store

Query via data columns

Anda dapat menetapkan (dan mengindeks) kolom tertentu yang Anda inginkan agar dapat melakukan kueri (selain kolom

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
318, yang selalu dapat Anda kueri). For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. You can specify
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
319 to force all columns to be
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
259

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
37

There is some performance degradation by making lots of columns into

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
321, so it is up to the user to designate these. In addition, you cannot change data columns (nor indexables) after the first append/put operation (Of course you can simply read in the data and create a new table. )

Iterator

You can pass

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
95 or
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
323 to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
253 and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
325 to return an iterator on the results. The default is 50,000 rows returned in a chunk

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
38

Catatan

You can also use the iterator with

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
207 which will open, then automatically close the store when finished iterating

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
39

Note, that the chunksize keyword applies to the source rows. So if you are doing a query, then the chunksize will subdivide the total rows in the table and the query applied, returning an iterator on potentially unequal sized chunks

Here is a recipe for generating a query and using it to create equal sized return chunks

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
40

Advanced queries

Select a single column

To retrieve a single indexable or data column, use the method

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
327. This will, for example, enable you to get the index very quickly. These return a
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
62 of the result, indexed by the row number. These do not currently accept the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
223 selector

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
41

Selecting coordinates

Sometimes you want to get the coordinates (a. k. a the index locations) of your query. This returns an

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
330 of the resulting locations. These coordinates can also be passed to subsequent
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
223 operations

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_42

Selecting using a where mask

Terkadang kueri Anda dapat melibatkan pembuatan daftar baris untuk dipilih. Usually this

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
332 would be a resulting
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
42 from an indexing operation. This example selects the months of a datetimeindex which are 5

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
43

Storer object

If you want to inspect the stored object, retrieve via

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
334. You could use this programmatically to say get the number of rows in an object

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
44

Multiple table queries

The methods

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
335 and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
325 can perform appending/selecting from multiple tables at once. The idea is to have one table (call it the selector table) that you index most/all of the columns, and perform your queries. The other table(s) are data tables with an index matching the selector table’s index. You can then perform a very fast query on the selector table, yet get lots of data back. This method is similar to having a very wide table, but enables more efficient queries

The

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
335 method splits a given single DataFrame into multiple tables according to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
338, a dictionary that maps the table names to a list of ‘columns’ you want in that table. If
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
24 is used in place of a list, that table will have the remaining unspecified columns of the given DataFrame. The argument
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
340 defines which table is the selector table (which you can make queries from). The argument
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
341 will drop rows from the input
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 to ensure tables are synchronized. This means that if a row for one of the tables being written to is entirely
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
343, that row will be dropped from all tables

If

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
341 is False, THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES. Remember that entirely
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
345 rows are not written to the HDFStore, so if you choose to call
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
346, some tables may have more rows than others, and therefore
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
325 may not work or it may return unexpected results

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
45

Delete from a table

You can delete from a table selectively by specifying a

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
223. In deleting rows, it is important to understand the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
213 deletes rows by erasing the rows, then moving the following data. Thus deleting can potentially be a very expensive operation depending on the orientation of your data. To get optimal performance, it’s worthwhile to have the dimension you are deleting be the first of the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
350

Data is ordered (on the disk) in terms of the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
350. Here’s a simple use case. Anda menyimpan data tipe panel, dengan tanggal di
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
283 dan id di
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
353. The data is then interleaved like this

  • date_1
    • id_1

    • id_2

    • .

    • id_n

  • date_2
    • id_1

    • .

    • id_n

It should be clear that a delete operation on the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
283 will be fairly quick, as one chunk is removed, then the following data moved. On the other hand a delete operation on the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
353 will be very expensive. In this case it would almost certainly be faster to rewrite the table using a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
223 that selects all but the missing data

Peringatan

Please note that HDF5 DOES NOT RECLAIM SPACE in the h5 files automatically. Thus, repeatedly deleting (or removing nodes) and adding again, WILL TEND TO INCREASE THE FILE SIZE

To repack and clean the file, use

Notes & caveats

Kompresi

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
213 allows the stored data to be compressed. This applies to all kinds of stores, not just tables. Two parameters are used to control compression.
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
358 and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
359

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    358 specifies if and how hard data is to be compressed.
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    361 and
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    362 disables compression and
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    363 enables compression

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    359 specifies which compression library to use. If nothing is specified the default library
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    365 is used. A compression library usually optimizes for either good compression rates or speed and the results will depend on the type of data. Which type of compression to choose depends on your specific needs and data. The list of supported compression libraries

    • zlib. The default compression library. A classic in terms of compression, achieves good compression rates but is somewhat slow

    • lzo. Fast compression and decompression

    • bzip2. Good compression rates

    • blosc. Fast compression and decompression

      Support for alternative blosc compressors

      • blosc. blosclz This is the default compressor for

        In [6]: data = "col1,col2,col3\na,b,1"
        
        In [7]: df = pd.read_csv(StringIO(data))
        
        In [8]: df.columns = [f"pre_{col}" for col in df.columns]
        
        In [9]: df
        Out[9]: 
          pre_col1 pre_col2  pre_col3
        0        a        b         1
        
        366

      • blosc. lz4. A compact, very popular and fast compressor

      • blosc. lz4hc. A tweaked version of LZ4, produces better compression ratios at the expense of speed

      • blosc. snappy. A popular compressor used in many places

      • blosc. zlib. A classic; somewhat slower than the previous ones, but achieving better compression ratios

      • blosc. zstd. An extremely well balanced codec; it provides the best compression ratios among the others above, and at reasonably fast speed

    If

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    359 is defined as something other than the listed libraries a
    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    27 exception is issued

Catatan

If the library specified with the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
359 option is missing on your platform, compression defaults to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
365 without further ado

Enable compression for all objects within the file

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
46

Or on-the-fly compression (this only applies to tables) in stores where compression is not enabled

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_47

ptrepack

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
213 offers better write performance when tables are compressed after they are written, as opposed to turning on compression at the very beginning. You can use the supplied
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
213 utility
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
373. In addition,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
373 can change compression levels after the fact

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
48

Furthermore

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
375 will repack the file to allow you to reuse previously deleted space. Alternatively, one can simply remove the file and write again, or use the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
376 method

Caveats

Peringatan

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
205 is not-threadsafe for writing. The underlying
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
213 only supports concurrent reads (via threading or processes). If you need reading and writing at the same time, you need to serialize these operations in a single thread in a single process. You will corrupt your data otherwise. See the (GH2397) for more information

  • If you use locks to manage write access between multiple processes, you may want to use before releasing write locks. Untuk kenyamanan, Anda dapat menggunakan

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    _380 untuk melakukannya untuk Anda

  • Setelah

    In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
    Out[39]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    _04 dibuat, kolom (DataFrame) diperbaiki;

  • Ketahuilah bahwa zona waktu (mis. g. ,

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    382) tidak harus sama di seluruh versi zona waktu. Jadi jika data dilokalkan ke zona waktu tertentu di HDFStore menggunakan satu versi pustaka zona waktu dan data tersebut diperbarui dengan versi lain, data akan dikonversi ke UTC karena zona waktu ini dianggap tidak sama. Either use the same version of timezone library or use
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    383 with the updated timezone definition

Peringatan

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_213 akan menampilkan ________57______385 jika nama kolom tidak dapat digunakan sebagai pemilih atribut. Pengidentifikasi alami hanya berisi huruf, angka, dan garis bawah, dan tidak boleh dimulai dengan angka. Pengidentifikasi lain tidak dapat digunakan dalam klausa
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
223 dan umumnya merupakan ide yang buruk

Tipe Data

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_205 akan memetakan objek dtype ke
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
213 yang mendasari dtype. Ini berarti jenis berikut diketahui berfungsi

Type

Mewakili nilai-nilai yang hilang

floating .

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
389

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
248

integer .

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
391

boolean

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
392

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
19

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
303

In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
19

categorical . see the section below

object .

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
396

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
248

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
398 columns are not supported, and WILL FAIL

Categorical data

You can write data that contains

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
399 dtypes to a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
205. Queries work the same as if it was an object array. However, the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
399 dtyped data is stored in a more efficient manner

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
49

String columns

min_itemsize

The underlying implementation of

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
205 uses a fixed column width (itemsize) for string columns. Itemsize kolom string dihitung sebagai maksimum panjang data (untuk kolom itu) yang diteruskan ke
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
205, di append pertama. Subsequent appends, may introduce a string for a column larger than the column can hold, an Exception will be raised (otherwise you could have a silent truncation of these columns, leading to loss of information). In the future we may relax this and allow a user-specified truncation to occur

Pass

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
404 on the first table creation to a-priori specify the minimum length of a particular string column.
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
404 can be an integer, or a dict mapping a column name to an integer. You can pass
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
03 as a key to allow all indexables or data_columns to have this min_itemsize

Passing a

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
404 dict will cause all passed columns to be created as data_columns automatically

Catatan

If you are not passing any

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
259, then the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
404 will be the maximum of the length of any string passed

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
50

nan_rep

String columns will serialize a

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
248 (a missing value) with the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
411 string representation. This defaults to the string value
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
249. You could inadvertently turn an actual
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
249 value into a missing value

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
51

External compatibility

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
205 writes
In [39]: pd.read_csv(StringIO(data), dtype={"col1": "category"}).dtypes
Out[39]: 
col1    category
col2      object
col3       int64
dtype: object
04 format objects in specific formats suitable for producing loss-less round trips to pandas objects. For external compatibility,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
205 can read native
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
213 format tables

It is possible to write an

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
205 object that can easily be imported into
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
419 using the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
420 library (Package website). Create a table format store like this

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
52

In R this file can be read into a

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
421 object using the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
420 library. The following example function reads the corresponding column names and data values from the values and assembles them into a
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
421

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
53

Now you can import the

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 into R

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
54

Catatan

The R function lists the entire HDF5 file’s contents and assembles the

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
421 object from all matching nodes, so use this only as a starting point if you have stored multiple
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 objects to a single HDF5 file

Performance

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    427 format come with a writing performance penalty as compared to
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    214 stores. The benefit is the ability to append/delete and query (potentially very large amounts of data). Write times are generally longer as compared with regular stores. Query times can be quite fast, especially on an indexed axis

  • You can pass

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    429 to
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    231, specifying the write chunksize (default is 50000). This will significantly lower your memory usage on writing

  • You can pass

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    431 to the first
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    231, to set the TOTAL number of rows that
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    213 will expect. This will optimize read/write performance

  • Duplicate rows can be written to tables, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)

  • A

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    434 will be raised if you are attempting to store types that will be pickled by PyTables (rather than stored as endemic types). See for more information and some solutions

Feather

Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy

Feather is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas dtypes, including extension dtypes such as categorical and datetime with tz

Several caveats

  • Format TIDAK akan menulis

    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    20, atau
    In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    76 untuk
    In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    43 dan akan menimbulkan kesalahan jika yang non-default disediakan. You can
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    438 to store the index or
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    439 to ignore it

  • Duplicate column names and non-string columns names are not supported

  • Actual Python objects in object dtype columns are not supported. These will raise a helpful error message on an attempt at serialization

See the Full Documentation

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
55

Write to a feather file

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
56

Read from a feather file

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
57

Parquet

Apache Parquet provides a partitioned binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance

Parquet is designed to faithfully serialize and de-serialize

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 s, supporting all of the pandas dtypes, including extension dtypes such as datetime with tz

Several caveats

  • Duplicate column names and non-string columns names are not supported

  • The

    In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    97 engine always writes the index to the output, but
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    442 only writes non-default indexes. This extra column can cause problems for non-pandas consumers that are not expecting it. You can force including or omitting indexes with the
    In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
    
    In [36]: pd.read_csv(StringIO(data))
    Out[36]: 
      col1 col2  col3
    0    a    b     1
    1    a    b     2
    2    c    d     3
    
    In [37]: pd.read_csv(StringIO(data)).dtypes
    Out[37]: 
    col1    object
    col2    object
    col3     int64
    dtype: object
    
    In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
    Out[38]: 
    col1    category
    col2    category
    col3    category
    dtype: object
    
    42 argument, regardless of the underlying engine

  • Index level names, if specified, must be strings

  • In the

    In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    97 engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as their primitive dtype

  • The

    In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    97 engine preserves the
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    34 flag of categorical dtypes with string types.
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    442 does not preserve the
    In [40]: from pandas.api.types import CategoricalDtype
    
    In [41]: dtype = CategoricalDtype(["d", "c", "b", "a"], ordered=True)
    
    In [42]: pd.read_csv(StringIO(data), dtype={"col1": dtype}).dtypes
    Out[42]: 
    col1    category
    col2      object
    col3       int64
    dtype: object
    
    34 flag

  • Non supported types include

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    449 and actual Python object types. These will raise a helpful error message on an attempt at serialization.
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    450 type is supported with pyarrow >= 0. 16. 0

  • The

    In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))
    
    In [30]: df = pd.DataFrame({"col_1": col_1})
    
    In [31]: df.to_csv("foo.csv")
    
    In [32]: mixed_df = pd.read_csv("foo.csv")
    
    In [33]: mixed_df["col_1"].apply(type).value_counts()
    Out[33]: 
    <class 'int'>    737858
    <class 'str'>    262144
    Name: col_1, dtype: int64
    
    In [34]: mixed_df["col_1"].dtype
    Out[34]: dtype('O')
    
    97 engine preserves extension data types such as the nullable integer and string data type (requiring pyarrow >= 0. 16. 0, and requiring the extension type to implement the needed protocols, see the )

You can specify an

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
136 to direct the serialization. Ini bisa salah satu dari
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
_97, atau
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
442, atau
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
455. If the engine is NOT specified, then the
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
456 option is checked; if this is also
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
455, then
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
97 is tried, and falling back to
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
442

See the documentation for pyarrow and fastparquet

Catatan

These engines are very similar and should read/write nearly identical parquet format files.

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_460 mendukung data timedelta,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
461 mendukung waktu sadar zona waktu. Pustaka ini berbeda karena memiliki ketergantungan mendasar yang berbeda (
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
442 dengan menggunakan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
463, sementara
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
97 menggunakan pustaka-c)

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
58

Write to a parquet file

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
59

Read from a parquet file

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
60

Read only certain columns of a parquet file

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
61

Handling indexes

Serializing a

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 to parquet may include the implicit index as one or more columns in the output file. Thus, this code

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
62

creates a parquet file with three columns if you use

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
97 for serialization.
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
467,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
468, and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
469. If you’re using
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
442, the index be written to the file

This unexpected extra column causes some databases like Amazon Redshift to reject the file, because that column doesn’t exist in the target table

If you want to omit a dataframe’s indexes when writing, pass

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
316 to

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
63

This creates a parquet file with just the two expected columns,

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
467 and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
468. If your
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 has a custom index, you won’t get it back when you load this file into a
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43

Melewati

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_477 akan selalu menulis indeks, bahkan jika itu bukan perilaku default mesin yang mendasarinya

Partitioning Parquet files

Parquet supports partitioning of data based on the values of one or more columns

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
64

The

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
478 specifies the parent directory to which data will be saved. The
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
479 are the column names by which the dataset will be partitioned. Columns are partitioned in the order they are given. The partition splits are determined by the unique values in the partition columns. The above example creates a partitioned dataset that may look like

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
65

ORC

New in version 1. 0. 0

Similar to the format, the ORC Format is a binary columnar serialization for data frames. It is designed to make reading data frames efficient. pandas provides both the reader and the writer for the ORC format, and . This requires the pyarrow library

Peringatan

  • It is highly recommended to install pyarrow using conda due to some issues occurred by pyarrow

  • requires pyarrow>=7. 0. 0

  • and are not supported on Windows yet, you can find valid environments on

  • For supported dtypes please refer to

  • Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_66

Menulis ke file orc

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_67

Baca dari file orc

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_68

Baca hanya kolom tertentu dari file orc

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_69

kueri SQL

The

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
485 module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Abstraksi database disediakan oleh SQLAlchemy jika diinstal. Selain itu, Anda memerlukan perpustakaan driver untuk database Anda. Contoh driver tersebut adalah psycopg2 untuk PostgreSQL atau pymysql untuk MySQL. Untuk SQLite ini termasuk dalam pustaka standar Python secara default. Anda dapat menemukan ikhtisar driver yang didukung untuk setiap dialek SQL di dokumen SQLAlchemy

Jika SQLAlchemy tidak diinstal, fallback hanya disediakan untuk sqlite (dan untuk mysql untuk kompatibilitas mundur, tetapi ini tidak digunakan lagi dan akan dihapus di versi yang akan datang). Mode ini membutuhkan adaptor database Python yang mematuhi Python DB-API

Lihat juga beberapa untuk beberapa strategi lanjutan

Fungsi utamanya adalah

(nama_tabel, kon[, skema,. ])

Baca tabel database SQL ke dalam DataFrame

(sql, con[, index_col,. ])

Baca kueri SQL ke dalam DataFrame

(sql, con[, index_col,. ])

Baca kueri SQL atau tabel database ke dalam DataFrame

(nama, kon[, skema,. ])

Write records stored in a DataFrame to a SQL database

Catatan

Fungsi ini adalah pembungkus kenyamanan dan (dan untuk kompatibilitas mundur) dan akan didelegasikan ke fungsi tertentu tergantung pada input yang disediakan (nama tabel database atau kueri sql). Nama tabel tidak perlu dikutip jika memiliki karakter khusus

Dalam contoh berikut, kami menggunakan mesin database SQlite SQL. Anda dapat menggunakan database SQLite sementara tempat data disimpan di "memori"

Untuk terhubung dengan SQLAlchemy Anda menggunakan fungsi

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
493 untuk membuat objek mesin dari database URI. Anda hanya perlu membuat mesin satu kali per database yang Anda sambungkan. Untuk informasi selengkapnya tentang
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
493 dan pemformatan URI, lihat contoh di bawah dan dokumentasi SQLAlchemy

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_70

Jika Anda ingin mengelola koneksi Anda sendiri, Anda dapat melewati salah satunya. Contoh di bawah membuka koneksi ke database menggunakan manajer konteks Python yang secara otomatis menutup koneksi setelah blok selesai. Lihat untuk penjelasan tentang bagaimana koneksi database ditangani

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_71

Peringatan

Saat Anda membuka koneksi ke database, Anda juga bertanggung jawab untuk menutupnya. Efek samping membiarkan koneksi terbuka mungkin termasuk mengunci database atau perilaku melanggar lainnya

Assuming the following data is in a

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
56, we can insert it into the database using

Indo

Tanggal

Kol_1

Kol_2

Kol_3

26

18-10-2012

X

25. 7

BENAR

42

19-10-2012

Y

-12. 4

PALSU

63

20-10-2012

Z

5. 73

BENAR

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_72

With some databases, writing large DataFrames can result in errors due to packet size limitations being exceeded. Hal ini dapat dihindari dengan menyetel parameter

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_90 saat memanggil
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
499. Misalnya, berikut ini menulis
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_56 ke database dalam kumpulan 1000 baris sekaligus

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_73

tipe data SQL

akan mencoba memetakan data Anda ke tipe data SQL yang sesuai berdasarkan tipe data. Ketika Anda memiliki kolom dtype

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_72, panda akan mencoba menyimpulkan tipe data

Anda selalu dapat mengganti tipe default dengan menentukan tipe SQL yang diinginkan dari salah satu kolom dengan menggunakan argumen

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
88. Argumen ini memerlukan nama kolom pemetaan kamus ke tipe SQLAlchemy (atau string untuk mode fallback sqlite3). Misalnya, menentukan untuk menggunakan tipe
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
504 sqlalchemy alih-alih tipe default
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
505 untuk kolom string

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_74

Catatan

Karena dukungan terbatas untuk timedelta dalam rasa database yang berbeda, kolom dengan tipe

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
506 akan ditulis sebagai nilai integer sebagai nanodetik ke database dan peringatan akan dimunculkan

Catatan

Kolom

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
399 dtype akan dikonversi ke representasi padat seperti yang akan Anda dapatkan dengan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
508 (e. g. untuk kategori string ini memberikan array string). Karena itu, membaca kembali tabel database tidak menghasilkan kategorikal

Tipe data datetime

Menggunakan SQLAlchemy, mampu menulis data datetime yang naif zona waktu atau sadar zona waktu. Namun, data yang dihasilkan disimpan dalam database pada akhirnya tergantung pada tipe data yang didukung untuk data datetime dari sistem database yang digunakan

The following table lists supported data types for datetime data for some common databases. Dialek database lain mungkin memiliki tipe data yang berbeda untuk data datetime

Basis data

Jenis Tanggal dan Waktu SQL

Dukungan Zona Waktu

SQLite

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_510

TIDAK

MySQL

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_511 atau
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
512

TIDAK

PostgreSQL

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_511 atau
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
514

Ya

Saat menulis data sadar zona waktu ke database yang tidak mendukung zona waktu, data akan ditulis sebagai stempel waktu naif zona waktu yang ada di waktu lokal sehubungan dengan zona waktu

juga mampu membaca data datetime yang sadar zona waktu atau naif. Saat membaca tipe

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_514, panda akan mengubah data menjadi UTC

Metode penyisipan

Parameter

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_517 mengontrol klausa penyisipan SQL yang digunakan. Nilai yang mungkin adalah

  • In [13]: import numpy as np
    
    In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"
    
    In [15]: print(data)
    a,b,c,d
    1,2,3,4
    5,6,7,8
    9,10,11
    
    In [16]: df = pd.read_csv(StringIO(data), dtype=object)
    
    In [17]: df
    Out[17]: 
       a   b   c    d
    0  1   2   3    4
    1  5   6   7    8
    2  9  10  11  NaN
    
    In [18]: df["a"][0]
    Out[18]: '1'
    
    In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})
    
    In [20]: df.dtypes
    Out[20]: 
    a      int64
    b     object
    c    float64
    d      Int64
    dtype: object
    
    24. Menggunakan klausa SQL
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    519 standar (satu per baris)

  • In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    _520. Berikan banyak nilai dalam satu klausa
    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    519. Ini menggunakan sintaks SQL khusus yang tidak didukung oleh semua backend. Ini biasanya memberikan kinerja yang lebih baik untuk database analitik seperti Presto dan Redshift, tetapi memiliki kinerja yang lebih buruk untuk backend SQL tradisional jika tabel berisi banyak kolom. Untuk informasi lebih lanjut, periksa SQLAlchemy

  • dapat dipanggil dengan tanda tangan

    In [6]: data = "col1,col2,col3\na,b,1"
    
    In [7]: df = pd.read_csv(StringIO(data))
    
    In [8]: df.columns = [f"pre_{col}" for col in df.columns]
    
    In [9]: df
    Out[9]: 
      pre_col1 pre_col2  pre_col3
    0        a        b         1
    
    522. Ini dapat digunakan untuk mengimplementasikan metode penyisipan yang lebih berperforma baik berdasarkan fitur dialek backend tertentu

Contoh callable menggunakan klausa COPY PostgreSQL

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_75

Meja baca

akan membaca tabel database yang diberi nama tabel dan secara opsional subset kolom untuk dibaca

Catatan

Untuk menggunakan , Anda harus menginstal dependensi opsional SQLAlchemy

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_76

Catatan

Perhatikan bahwa panda menyimpulkan tipe kolom dari output kueri, dan bukan dengan mencari tipe data dalam skema database fisik. Misalnya, asumsikan

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_525 adalah kolom bilangan bulat dalam sebuah tabel. Kemudian, secara intuitif,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_526 akan mengembalikan seri bernilai bilangan bulat, sedangkan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
527 akan mengembalikan seri bernilai objek (str). Accordingly, if the query output is empty, then all resulting columns will be returned as object-valued (since they are most general). If you foresee that your query will sometimes generate an empty result, you may want to explicitly typecast afterwards to ensure dtype integrity

Anda juga dapat menentukan nama kolom sebagai indeks

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43, dan menentukan subkumpulan kolom yang akan dibaca

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_77

Dan Anda dapat secara eksplisit memaksa kolom untuk diuraikan sebagai tanggal

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_78

Jika perlu, Anda dapat secara eksplisit menentukan string format, atau dict argumen untuk diteruskan

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_79

Anda dapat memeriksa apakah ada tabel menggunakan

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
530

Dukungan skema

Membaca dari dan menulis ke skema yang berbeda didukung melalui kata kunci ________ 225 ______ 16 dalam fungsi dan. Namun perhatikan bahwa ini tergantung pada rasa basis data (sqlite tidak memiliki skema). Sebagai contoh

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_80

Querying

Anda dapat melakukan kueri menggunakan SQL mentah dalam fungsi. Dalam hal ini Anda harus menggunakan varian SQL yang sesuai untuk database Anda. Saat menggunakan SQLAlchemy, Anda juga dapat meneruskan konstruksi bahasa SQLAlchemy Expression, yang merupakan database-agnostik

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_81

Tentu saja, Anda dapat menentukan kueri yang lebih "kompleks".

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_82

The function supports a

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
90 argument. Menentukan ini akan mengembalikan iterator melalui potongan hasil kueri

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_83

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_84

Anda juga dapat menjalankan kueri biasa tanpa membuat

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 dengan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
538. Ini berguna untuk kueri yang tidak mengembalikan nilai, seperti INSERT. Ini secara fungsional setara dengan memanggil
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_539 pada mesin SQLAlchemy atau objek koneksi db. Sekali lagi, Anda harus menggunakan varian sintaks SQL yang sesuai untuk database Anda

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_85

Contoh sambungan mesin

Untuk terhubung dengan SQLAlchemy Anda menggunakan fungsi

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
493 untuk membuat objek mesin dari database URI. Anda hanya perlu membuat mesin satu kali per database yang Anda sambungkan

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
86

For more information see the examples the SQLAlchemy documentation

Advanced SQLAlchemy queries

You can use SQLAlchemy constructs to describe your query

Use

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
541 to specify query parameters in a backend-neutral way

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
87

If you have an SQLAlchemy description of your database you can express where conditions using SQLAlchemy expressions

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
88

You can combine SQLAlchemy expressions with parameters passed to using

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
543

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
89

Sqlite fallback

The use of sqlite is supported without using SQLAlchemy. Mode ini membutuhkan adaptor database Python yang mematuhi Python DB-API

Anda dapat membuat koneksi seperti itu

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_90

Dan kemudian keluarkan pertanyaan berikut

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_91

Google BigQuery

Peringatan

Mulai dari 0. 20. 0, panda telah memisahkan dukungan Google BigQuery ke dalam paket terpisah

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
544. Anda dapat
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_545 untuk mendapatkannya

The

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
544 package provides functionality to read/write from Google BigQuery

panda terintegrasi dengan paket eksternal ini. jika

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_544 diinstal, Anda dapat menggunakan metode panda
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
548 dan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
549, yang akan memanggil fungsi masing-masing dari
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
544

Dokumentasi lengkap dapat ditemukan di sini

Format status

Metode

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_551 akan menulis DataFrame ke a. file dta. Versi format file ini selalu 115 (Stata 12)

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_92

File data stata memiliki dukungan tipe data yang terbatas; . Selain itu, Stata mencadangkan nilai tertentu untuk mewakili data yang hilang. Mengekspor nilai yang tidak hilang yang berada di luar rentang yang diizinkan di Stata untuk tipe data tertentu akan mengetik ulang variabel ke ukuran yang lebih besar berikutnya. Misalnya, nilai

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_552 dibatasi antara -127 dan 100 di Stata, sehingga variabel dengan nilai di atas 100 akan memicu konversi ke
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
553.
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_249 nilai dalam tipe data floating point disimpan sebagai tipe data dasar yang hilang (
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
561 di Stata)

Catatan

Tidak mungkin mengekspor nilai data yang hilang untuk tipe data bilangan bulat

Penulis Stata dengan anggun menangani tipe data lain termasuk

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
562,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
563,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
564,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
565,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
566 dengan mentransmisikan ke tipe terkecil yang didukung yang dapat mewakili data. Misalnya, data dengan jenis
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_564 akan dilemparkan ke
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
552 jika semua nilai kurang dari 100 (batas atas untuk data
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
552 yang tidak hilang di Stata), atau, jika nilai berada di luar rentang ini, variabel dilemparkan ke

Peringatan

Konversi dari

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
562 ke
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
556 dapat mengakibatkan hilangnya presisi jika nilai
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
562 lebih besar dari 2**53

Peringatan

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_574 dan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
551 hanya mendukung string lebar tetap yang berisi hingga 244 karakter, batasan yang diberlakukan oleh format file versi 115 dta. Attempting to write Stata dta files with strings longer than 244 characters raises a
In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
27

Membaca dari format Stata

Fungsi tingkat atas

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
577 akan membaca file dta dan mengembalikan
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 atau
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
579 yang dapat digunakan untuk membaca file secara bertahap

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_93

Menentukan

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
90 menghasilkan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
579 contoh yang dapat digunakan untuk membaca
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
90 baris dari file sekaligus. Objek
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_579 dapat digunakan sebagai iterator

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_94

Untuk kontrol yang lebih halus, gunakan

In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
95 dan tentukan
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
90 dengan setiap panggilan ke
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
18

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_95

Saat ini

In [35]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [36]: pd.read_csv(StringIO(data))
Out[36]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [37]: pd.read_csv(StringIO(data)).dtypes
Out[37]: 
col1    object
col2    object
col3     int64
dtype: object

In [38]: pd.read_csv(StringIO(data), dtype="category").dtypes
Out[38]: 
col1    category
col2    category
col3    category
dtype: object
_42 diambil sebagai kolom

Parameter

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
588 menunjukkan apakah label nilai harus dibaca dan digunakan untuk membuat variabel
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
24 darinya. Label nilai juga dapat diambil oleh fungsi
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
590, yang mengharuskan
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
18 dipanggil sebelum digunakan

Parameter

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
592 menunjukkan apakah representasi nilai yang hilang di Stata harus dipertahankan. Jika
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_61 (default), nilai yang hilang direpresentasikan sebagai
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
248. Jika
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_32, nilai yang hilang diwakili menggunakan objek
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
596, dan kolom yang berisi nilai yang hilang akan memiliki tipe data
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
72

Catatan

dan

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_579 dukungan. format dta 113-115 (Stata 10-12), 117 (Stata 13), dan 118 (Stata 14)

Catatan

Pengaturan

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_600 akan dialihkan ke tipe data panda standar.
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
562 for all integer types and
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
556 for floating point data. Secara default, tipe data Stata dipertahankan saat mengimpor

Categorical data

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
_24 data dapat diekspor ke file data Stata sebagai data berlabel nilai. Data yang diekspor terdiri dari kode kategori dasar sebagai nilai data bilangan bulat dan kategori sebagai label nilai. Stata tidak memiliki persamaan eksplisit dengan
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
24 dan informasi tentang apakah variabel dipesan hilang saat mengekspor

Peringatan

Stata hanya mendukung label nilai string, sehingga

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
15 dipanggil pada kategori saat mengekspor data. Mengekspor
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
24 variabel dengan kategori non-string menghasilkan peringatan, dan dapat mengakibatkan hilangnya informasi jika
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
15 representasi kategori tidak unik

Data berlabel juga dapat diimpor dari file data Stata sebagai

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
24 variabel menggunakan argumen kata kunci
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
588 (
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32 secara default). Argumen kata kunci
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
611 (
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
32 secara default) menentukan apakah variabel
In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
24 yang diimpor diurutkan

Catatan

Saat mengimpor data kategorikal, nilai variabel dalam file data Stata tidak dipertahankan karena

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
24 variabel selalu menggunakan tipe data bilangan bulat antara
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
615 dan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
616 di mana
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
310 adalah jumlah kategori. Jika nilai asli dalam file data Stata diperlukan, ini dapat diimpor dengan menyetel
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
618, yang akan mengimpor data asli (tetapi bukan label variabel). Nilai asli dapat dicocokkan dengan data kategorikal yang diimpor karena ada pemetaan sederhana antara nilai data Stata asli dan kode kategori variabel Kategorikal yang diimpor. nilai yang hilang diberi kode
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
_615, dan nilai asli terkecil diberi kode
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
84, terkecil kedua diberi kode
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
621 dan seterusnya sampai nilai asli terbesar diberi kode
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
616

Catatan

Stata mendukung seri berlabel sebagian. Seri ini memiliki label nilai untuk beberapa tetapi tidak semua nilai data. Mengimpor seri berlabel sebagian akan menghasilkan

In [25]: df2 = pd.read_csv(StringIO(data))

In [26]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")

In [27]: df2
Out[27]: 
   col_1
0   1.00
1   2.00
2    NaN
3   4.22

In [28]: df2["col_1"].apply(type).value_counts()
Out[28]: 
<class 'float'>    4
Name: col_1, dtype: int64
24 dengan kategori string untuk nilai yang diberi label dan kategori numerik untuk nilai tanpa label

format SAS

Fungsi tingkat atas dapat membaca (tetapi tidak menulis) SAS XPORT (. xpt) and (since v0. 18. 0) SAS7BDAT (. sas7bdat) memformat file

File SAS hanya berisi dua jenis nilai. Teks ASCII dan nilai floating point (biasanya 8 byte tetapi terkadang terpotong). Untuk file ekspor, tidak ada konversi tipe otomatis ke bilangan bulat, tanggal, atau kategori. Untuk file SAS7BDAT, kode format memungkinkan variabel tanggal diubah secara otomatis menjadi tanggal. Secara default seluruh file dibaca dan dikembalikan sebagai

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43

Tentukan

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_90 atau gunakan
In [29]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))

In [30]: df = pd.DataFrame({"col_1": col_1})

In [31]: df.to_csv("foo.csv")

In [32]: mixed_df = pd.read_csv("foo.csv")

In [33]: mixed_df["col_1"].apply(type).value_counts()
Out[33]: 
<class 'int'>    737858
<class 'str'>    262144
Name: col_1, dtype: int64

In [34]: mixed_df["col_1"].dtype
Out[34]: dtype('O')
95 untuk mendapatkan objek pembaca (
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
628 atau
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
629) untuk membaca file secara bertahap. Objek pembaca juga memiliki atribut yang berisi informasi tambahan tentang file dan variabelnya

Baca file SAS7BDAT

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_96

Dapatkan iterator dan baca file XPORT 100.000 baris sekaligus

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_97

Spesifikasi untuk format file xport tersedia dari situs web SAS

Tidak ada dokumentasi resmi yang tersedia untuk format SAS7BDAT

format SPSS

New in version 0. 25. 0

The top-level function can read (but not write) SPSS SAV (. sav) dan ZSAV (. zsav) memformat file

File SPSS berisi nama kolom. Secara default seluruh file dibaca, kolom kategori diubah menjadi

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
631, dan
In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 dengan semua kolom dikembalikan

Tentukan parameter

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_47 untuk mendapatkan subset kolom. Tentukan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
618 untuk menghindari konversi kolom kategori menjadi
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
631

Baca file SPSS

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_98

Ekstrak subset kolom yang terdapat dalam

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_47 dari file SPSS dan hindari mengubah kolom kategori menjadi
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
631

In [10]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"

In [11]: pd.read_csv(StringIO(data))
Out[11]: 
  col1 col2  col3
0    a    b     1
1    a    b     2
2    c    d     3

In [12]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[12]: 
  col1 col2  col3
0    a    b     2
_99

Informasi lebih lanjut tentang format file SAV dan ZSAV tersedia di sini

Format file lainnya

pandas sendiri hanya mendukung IO dengan sekumpulan format file terbatas yang dipetakan dengan bersih ke model data tabularnya. Untuk membaca dan menulis format file lain ke dalam dan dari panda, kami merekomendasikan paket ini dari komunitas yang lebih luas

netCDF

xarray menyediakan struktur data yang terinspirasi oleh panda

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
43 untuk bekerja dengan kumpulan data multidimensi, dengan fokus pada format file netCDF dan konversi yang mudah ke dan dari panda

Pertimbangan kinerja

Ini adalah perbandingan informal dari berbagai metode IO, menggunakan panda 0. 24. 2. Pengaturan waktu bergantung pada mesin dan perbedaan kecil harus diabaikan

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_00

Fungsi pengujian berikut akan digunakan di bawah ini untuk membandingkan kinerja beberapa metode IO

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_01

Saat menulis, tiga fungsi teratas dalam hal kecepatan adalah

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
639,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
640 dan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
641

In [13]: import numpy as np

In [14]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"

In [15]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11

In [16]: df = pd.read_csv(StringIO(data), dtype=object)

In [17]: df
Out[17]: 
   a   b   c    d
0  1   2   3    4
1  5   6   7    8
2  9  10  11  NaN

In [18]: df["a"][0]
Out[18]: '1'

In [19]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})

In [20]: df.dtypes
Out[20]: 
a      int64
b     object
c    float64
d      Int64
dtype: object
_02

Saat membaca, tiga fungsi teratas dalam hal kecepatan adalah

In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
642,
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
643 dan
In [6]: data = "col1,col2,col3\na,b,1"

In [7]: df = pd.read_csv(StringIO(data))

In [8]: df.columns = [f"pre_{col}" for col in df.columns]

In [9]: df
Out[9]: 
  pre_col1 pre_col2  pre_col3
0        a        b         1
644

Bagaimana cara membuat file biner dengan Python?

Contoh 1. Buka file dalam mode tulis biner lalu tentukan konten yang akan ditulis dalam bentuk byte. Selanjutnya, gunakan fungsi tulis untuk menulis konten byte ke file biner .

Bagaimana Anda menangani file biner dengan Python?

Untuk membuka file dalam format biner, tambahkan 'b' ke parameter mode . Oleh karena itu mode "rb" membuka file dalam format biner untuk dibaca, sedangkan mode "wb" membuka file dalam format biner untuk ditulis.

Bagaimana cara mengedit file biner dengan Python?

Step 1. Mencari kata dalam file biner. Langkah 2. Saat mencari di dalam file, variabel “pos” menyimpan posisi record pointer file kemudian melintasi (melanjutkan) pembacaan record. Langkah 3. Jika kata yang akan dicari ada maka tempatkan pointer tulis (ke akhir dari record sebelumnya) i. e. di pos

Apa tiga jenis file di Python?

Ada tiga kategori objek file yang berbeda. .
File teks
File biner buffer
File biner mentah