Cara menggunakan crosstab order by python

A crosstab computes aggregated metrics among two or more columns in a dataset that contains categorical values.

Import Modules

In [122]:

import pandas as pd import seaborn as sns

Get Tips Dataset

Let's get the df_tips.head() 7 dataset from the df_tips.head() 8 library and assign it to the DataFrame df_tips.head() 9.

In [123]:

df_tips = sns.load_dataset('tips')

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column namecolumn descriptionpd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 0financial amount of meal in U.S. dollarspd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 1financial amount of the meal's tip in U.S. dollarspd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 2gender of serverpd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 3boolean to represent if server smokes or notpd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 4day of weekpd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 5meal name (Lunch or Dinner)pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 6count of people eating meal

Preview the first 5 rows of df_tips.head() 9.

In [124]:

df_tips.head()

Out[124]:

total_billtipsexsmokerdaytimesize016.991.01FemaleNoSunDinner2110.341.66MaleNoSunDinner3221.013.50MaleNoSunDinner3323.683.31MaleNoSunDinner2424.593.61FemaleNoSunDinner4

Implement Crosstabs with Tips Dataset

Grouping By a Single Field for the Index and Column

Let's compute a simple crosstab across the pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 4 and pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 2 column. We want our returned index to be the unique values from pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 4 and our returned columns to be the unique values from pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 2. By default in pandas, the pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 2 computes an aggregated metric of a count (aka frequency).

So, each of the values inside our table represent a count across the index and column. For example, males served 30 unique groups across all Thursdays in our dataset.

In [125]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

Out[125]:

sexMaleFemaledayThur3032Fri109Sat5928Sun5818

One issue with this crosstab output is the column names are nonsensical. Just saying pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 3 or pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 4 isn't very specific. They should be renamed to be clearer. We can use the rename() method and set the argument pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 5 to be a dictionary in which the keys are the current column names and the values are the respective new names to set.

In [126]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[126]:

sexcount_meals_served_by_malescount_meals_served_by_femalesdayThur3032Fri109Sat5928Sun5818

Also, in the output above, see where it says pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 2 as the column name and pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 4 for the row name? We can modify those names using arguments in the pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 2 method. Let's set the pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 9 argument to pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 0 since that's a more specific name than pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 2.

In [127]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[127]:

gendercount_meals_served_by_malescount_meals_served_by_femalesdayThur3032Fri109Sat5928Sun5818

In this example, we passed in two columns from our DataFrame. However, one nice feature of pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 2 is that you don't need the data to be in a DataFrame. For the pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 3 and pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 5 arguments, you can pass in two numpy arrays.

Let's double check the logic from above makes sense. Let's use filtering in pandas to verify that there were 30 meals served by a male on Thursday. Our query below matches the 30 number we see above.

In [128]:

len(df_tips.query("sex=='Male' and day=='Thur'"))

Out[128]:

30

Alternatively, given the crosstab output above, you can present it in a different format that may be easier for further analysis. I won't dive into details of this operation, but in addition to the code above, you can chain the pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 5 method and then the pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 6 method to pivot the DataFrame so each row is a unique combination of a value from pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 2 and pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 4 with the appropriate count of meals served.

In [129]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})

Out[129]:

genderdaycount_meals_served0MaleThur301MaleFri102MaleSat593MaleSun584FemaleThur325FemaleFri96FemaleSat287FemaleSun18

For each row and column of this previous crosstab, we can modify an argument to get the totals. Set the argument pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 9 to len(df_tips.query("sex=='Male' and day=='Thur'")) 0 to get these totals. By default, the returned output will have a column and row name of len(df_tips.query("sex=='Male' and day=='Thur'")) 1.

In [130]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[130]:

gendercount_meals_served_by_malescount_meals_served_by_femalesAlldayThur303262Fri10919Sat592887Sun581876All15787244

In the pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 2 method, we can also rename the len(df_tips.query("sex=='Male' and day=='Thur'")) 1 column. First, use all the same arguments from above. Then, set the argument len(df_tips.query("sex=='Male' and day=='Thur'")) 4 to len(df_tips.query("sex=='Male' and day=='Thur'")) 5 .

In [131]:

df_tips = sns.load_dataset('tips') 0

Out[131]:

gendercount_meals_served_by_malescount_meals_served_by_femalescount_meals_serveddayThur303262Fri10919Sat592887Sun581876count_meals_served15787244

For each cell value, we can calculate what percentage it is of the row's total. To do that, set the len(df_tips.query("sex=='Male' and day=='Thur'")) 6 argument to len(df_tips.query("sex=='Male' and day=='Thur'")) 7 (since index applies to each row).

In [132]:

df_tips = sns.load_dataset('tips') 1

Out[132]:

genderproportion_meals_served_by_malesproportion_meals_served_by_femalesdayThur0.4838710.516129Fri0.5263160.473684Sat0.6781610.321839Sun0.7631580.236842proportion_meals_served0.6434430.356557

For each cell value, we can also calculate what percentage it is of the column's total. To do that, set the len(df_tips.query("sex=='Male' and day=='Thur'")) 6 argument to len(df_tips.query("sex=='Male' and day=='Thur'")) 9.

In [133]:

df_tips = sns.load_dataset('tips') 2

Out[133]:

genderproportion_meals_served_by_malesproportion_meals_served_by_femalesproportion_meals_serveddayThur0.1910830.3678160.254098Fri0.0636940.1034480.077869Sat0.3757960.3218390.356557Sun0.3694270.2068970.311475

Given two categorical columns, the pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 2 method can additionally utilize a column with numerical values to perform an aggregate operation. That sentence may sound daunting - so let's walk through it with a simple example.

We know there exists pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 0 values in our datasets for males that served meals on Thursday. Below, I preview the first few pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 0 values that meet this criteria.

In [134]:

df_tips = sns.load_dataset('tips') 3

Out[134]:

df_tips = sns.load_dataset('tips') 4

We may want to know the average bill size that meet the criteria above. So, given that series, we can calculate the mean and we arrive at a result of 18.71.

In [135]:

df_tips = sns.load_dataset('tips') 5

Out[135]:

df_tips = sns.load_dataset('tips') 6

Now, we can perform this same operation using the pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 2 method. Same as before, we want our returned index to be the unique values from pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 4 and our returned columns to be the unique values from pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 2. Additionally, we want the values inside the table to be from our pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 0 column so we'll set the argument 307 to be 308. We also want to calculate the mean total bill for each combination of a day and gender so we'll set the 309 argument to pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"}) 0.

In [136]:

df_tips = sns.load_dataset('tips') 7

Out[136]:

gendermean_bill_size_meals_served_by_malesmean_bill_size_meals_served_by_femalesdayThur18.71466716.715312Fri19.85700014.145556Sat20.80254219.680357Sun21.88724119.872222

This crosstab calculation outputted the same 18.71 value as expected!

We can pass in many other aggregate methods to the 309 method too such as mean and standard deviation.

You can learn more about details of using pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 2 from the official pandas documentation page.

Grouping By Multiple Fields for the Index and/or Columns

We can also use the pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"}) 3 method to group by multiple pandas columns for the pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 3 or pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"}) 5 arguments.

For example, we can find out for days of the week, for each gender, what was the count of meals they served for lunch or dinner. In pandas We want our returned index to be the unique values from pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 4 and our returned columns to be the unique values from all combinations of the pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 2 (gender) and pd.crosstab(index=df_tips['day'], columns=df_tips['sex']) 5 (meal time) columns.

To interpret a value from our table below, across all Thursdays in our dataset, females served 31 lunch meals.

In [137]:

df_tips = sns.load_dataset('tips') 8

Out[137]:

genderMaleFemalemealcount_lunch_meals_servedcount_dinner_meals_servedparty_people_size123456123456daysexThurMale0242211000000Female1232302010000FriMale120000060100Female031000050000SatMale000000034131110Female0000002195200SunMale00000003291421Female000000076410

Postingan terbaru

LIHAT SEMUA