kode python

Cara menggunakan python dataframe to list

This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the .

Customarily, we import as follows:

In [1]: import numpy as np In [2]: import pandas as pd

Object creation

See the .

Creating a by passing a list of values, letting pandas create a default integer index:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64

Creating a by passing a NumPy array, with a datetime index using and labeled columns:

In [5]: dates = pd.date_range("20130101", periods=6) In [6]: dates Out[6]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD")) In [8]: df Out[8]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988

Creating a by passing a dictionary of objects that can be converted into a series-like structure:

In [9]: df2 = pd.DataFrame( ...: { ...: "A": 1.0, ...: "B": pd.Timestamp("20130102"), ...: "C": pd.Series(1, index=list(range(4)), dtype="float32"), ...: "D": np.array([3] * 4, dtype="int32"), ...: "E": pd.Categorical(["test", "train", "test", "train"]), ...: "F": "foo", ...: } ...: ) ...: In [10]: df2 Out[10]: A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo

The columns of the resulting have different :

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

In [12]: df2.<TAB> # noqa: E225, E999 df2.A df2.bool df2.abs df2.boxplot df2.add df2.C df2.add_prefix df2.clip df2.add_suffix df2.columns df2.align df2.copy df2.all df2.count df2.any df2.combine df2.append df2.D df2.apply df2.describe df2.applymap df2.diff df2.B df2.duplicated

As you can see, the columns In [17]: df.to_numpy() Out[17]: array([[ 0.4691, -0.2829, -1.5091, -1.1356], [ 1.2121, -0.1732, 0.1192, -1.0442], [-0.8618, -2.1046, -0.4949, 1.0718], [ 0.7216, -0.7068, -1.0396, 0.2719], [-0.425 , 0.567 , 0.2762, -1.0874], [-0.6737, 0.1136, -1.4784, 0.525 ]]) 6, In [17]: df.to_numpy() Out[17]: array([[ 0.4691, -0.2829, -1.5091, -1.1356], [ 1.2121, -0.1732, 0.1192, -1.0442], [-0.8618, -2.1046, -0.4949, 1.0718], [ 0.7216, -0.7068, -1.0396, 0.2719], [-0.425 , 0.567 , 0.2762, -1.0874], [-0.6737, 0.1136, -1.4784, 0.525 ]]) 7, In [17]: df.to_numpy() Out[17]: array([[ 0.4691, -0.2829, -1.5091, -1.1356], [ 1.2121, -0.1732, 0.1192, -1.0442], [-0.8618, -2.1046, -0.4949, 1.0718], [ 0.7216, -0.7068, -1.0396, 0.2719], [-0.425 , 0.567 , 0.2762, -1.0874], [-0.6737, 0.1136, -1.4784, 0.525 ]]) 8, and In [17]: df.to_numpy() Out[17]: array([[ 0.4691, -0.2829, -1.5091, -1.1356], [ 1.2121, -0.1732, 0.1192, -1.0442], [-0.8618, -2.1046, -0.4949, 1.0718], [ 0.7216, -0.7068, -1.0396, 0.2719], [-0.425 , 0.567 , 0.2762, -1.0874], [-0.6737, 0.1136, -1.4784, 0.525 ]]) 9 are automatically tab completed. In [18]: df2.to_numpy() Out[18]: array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object) 0 and In [18]: df2.to_numpy() Out[18]: array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object) 1 are there as well; the rest of the attributes have been truncated for brevity.

Viewing data

See the .

Use and to view the top and bottom rows of the frame respectively:

In [13]: df.head() Out[13]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 In [14]: df.tail(3) Out[14]: A B C D 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988

Display the or :

In [15]: df.index Out[15]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [16]: df.columns Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')

gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call , pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being In [18]: df2.to_numpy() Out[18]: array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object) 9, which requires casting every value to a Python object.

For In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 00, our of all floating-point values, and is fast and doesn’t require copying data:

In [17]: df.to_numpy() Out[17]: array([[ 0.4691, -0.2829, -1.5091, -1.1356], [ 1.2121, -0.1732, 0.1192, -1.0442], [-0.8618, -2.1046, -0.4949, 1.0718], [ 0.7216, -0.7068, -1.0396, 0.2719], [-0.425 , 0.567 , 0.2762, -1.0874], [-0.6737, 0.1136, -1.4784, 0.525 ]])

For In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 03, the with multiple dtypes, is relatively expensive:

In [18]: df2.to_numpy() Out[18]: array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)

Note

does not include the index or column labels in the output.

shows a quick statistic summary of your data:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 0

Transposing your data:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 1

sorts by an axis:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 2

sorts by values:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 3

Selection

Note

While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, , , and .

See the indexing documentation and .

Getting

Selecting a single column, which yields a , equivalent to In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 15:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 4

Selecting via In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 16 (In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 17), which slices the rows:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 5

Selection by label

See more in using or .

For getting a cross section using a label:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 6

Selecting on a multi-axis by label:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 7

Showing label slicing, both endpoints are included:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 8

Reduction in the dimensions of the returned object:

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 9

For getting a scalar value:

For getting fast access to a scalar (equivalent to the prior method):

Selection by position

See more in using or .

Select via the position of the passed integers:

By integer slices, acting similar to NumPy/Python:

By lists of integer position locations, similar to the NumPy/Python style:

For slicing rows explicitly:

For slicing columns explicitly:

For getting a value explicitly:

For getting fast access to a scalar (equivalent to the prior method):

Boolean indexing

Using a single column’s values to select data:

Selecting values from a DataFrame where a boolean condition is met:

Using the method for filtering:

Setting

Setting a new column automatically aligns the data by the indexes:

Setting values by label:

Setting values by position:

Setting by assigning with a NumPy array:

The result of the prior setting operations:

A In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 23 operation with setting:

Missing data

pandas primarily uses the value In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 24 to represent missing data. It is by default not included in computations. See the .

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:

drops any rows that have missing data:

fills missing data:

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 0

gets the boolean mask where values are In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 28:

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 1

Operations

See the .

Stats

Operations in general exclude missing data.

Performing a descriptive statistic:

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 2

Same operation on the other axis:

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 3

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension:

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 4

Apply

applies a user defined function to the data:

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 5

Histogramming

See more at .

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 6

String Methods

Series is equipped with a set of string processing methods in the In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 30 attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 30 generally uses regular expressions by default (and in some cases always uses them). See more at .

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 7

Merge

Concat

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

See the .

Concatenating pandas objects together along an axis with :

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 8

Note

Adding a column to a is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the constructor instead of building a by iteratively appending records to it.

Join

enables SQL style join types along specific columns. See the section.

In [11]: df2.dtypes Out[11]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object 9

Another example that can be given is:

Grouping

By “group by” we are referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure

See the .

Grouping and then applying the function to the resulting groups:

Grouping by multiple columns forms a hierarchical index, and again we can apply the function:

Reshaping

See the sections on and .

Stack

The method “compresses” a level in the DataFrame’s columns:

With a “stacked” DataFrame or Series (having a as the In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 41), the inverse operation of is , which by default unstacks the last level:

Pivot tables

See the section on .

pivots a specifying the In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 46, In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 41 and In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 48

Time series

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the .

localizes a time series to a time zone:

converts a timezones aware time series to another time zone:

Converting between time span representations:

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

Categoricals

pandas can include categorical data in a . For full docs, see the and the .

Converting the raw grades to a categorical data type:

Rename the categories to more meaningful names:

Reorder the categories and simultaneously add the missing categories (methods under return a new by default):

Sorting is per order in the categories, not lexical order:

Grouping by a categorical column also shows empty categories:

Plotting

See the docs.

We use the standard convention for referencing the matplotlib API:

The In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [4]: s Out[4]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64 54 method is used to close a figure window:

If running under Jupyter Notebook, the plot will appear on . Otherwise use matplotlib.pyplot.show to show it or matplotlib.pyplot.savefig to write it to a file.

Apa perbedaan Series dengan Dataframe?

Series diibaratkan sebagai array satu dimensi sama halnya dengan numpy array, hanya bedanya mempunyai index dan kita dapat mengontrol index dari setiap elemen tersebut. Sedangkan data frame merupakan array dua dimensi dengan baris dan kolom.

Apa yang dimaksud dengan data frame?

Dataframe merupakan tabel atau data tabular dengan array dua dimensi yaitu baris dan kolom. Struktur data ini merupakan cara paling standar untuk menyimpan data. Setiap kolom pada dataframe merupakan objek dari Series, dan baris terdiri dari elemen yang ada pada Series.

Struktur apa yang dimiliki Dataframes Pandas?

Struktur Pandas Library pandas memiliki dua jenis struktur data, yaitu series dan dataframe. Struktur data series merupakan satu kolom bagian dari tabel data frame yang merupakan 1 dimensional numpy array sebagai baris datanya, terdiri dari 1 tipe data.

Package Pandas untuk apa?

Pandas adalah paket Python open source yang paling sering dipakai untuk menganalisis data serta membangun sebuah machine learning. Pandas dibuat berdasarkan satu package lain bernama Numpy, yang mendukung arrays multi dimensi.

Object creation

Viewing data

Selection

Getting

Selection by label

Selection by position

Boolean indexing

Setting

Missing data

Operations

Stats

Apply

Histogramming

String Methods

Merge

Concat

Join

Grouping

Reshaping

Stack

Pivot tables

Time series

Categoricals

Plotting

Apa perbedaan Series dengan Dataframe?

Apa yang dimaksud dengan data frame?

Struktur apa yang dimiliki Dataframes Pandas?

Package Pandas untuk apa?

Pos Terkait

Toplist

Postingan terbaru

LIHAT SEMUA