Cara menggunakan crosstab order by python

A crosstab computes aggregated metrics among two or more columns in a dataset that contains categorical values.

Import Modules

In [122]:

import pandas as pd
import seaborn as sns

Get Tips Dataset

Let's get the

df_tips.head()
7 dataset from the
df_tips.head()
8 library and assign it to the DataFrame
df_tips.head()
9.

In [123]:

df_tips = sns.load_dataset('tips')

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column namecolumn description
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
0financial amount of meal in U.S. dollars
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
1financial amount of the meal's tip in U.S. dollars
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
2gender of server
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
3boolean to represent if server smokes or not
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
4day of week
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
5meal name (Lunch or Dinner)
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
6count of people eating meal

Preview the first 5 rows of

df_tips.head()
9.

In [124]:

df_tips.head()

Out[124]:

total_billtipsexsmokerdaytimesize016.991.01FemaleNoSunDinner2110.341.66MaleNoSunDinner3221.013.50MaleNoSunDinner3323.683.31MaleNoSunDinner2424.593.61FemaleNoSunDinner4

Implement Crosstabs with Tips Dataset

Grouping By a Single Field for the Index and Column

Let's compute a simple crosstab across the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
4 and
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
2 column. We want our returned index to be the unique values from
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
4 and our returned columns to be the unique values from
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
2. By default in pandas, the
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
2 computes an aggregated metric of a count (aka frequency).

So, each of the values inside our table represent a count across the index and column. For example, males served 30 unique groups across all Thursdays in our dataset.

In [125]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

Out[125]:

sexMaleFemaledayThur3032Fri109Sat5928Sun5818

One issue with this crosstab output is the column names are nonsensical. Just saying

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
3 or
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
4 isn't very specific. They should be renamed to be clearer. We can use the rename() method and set the argument
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
5 to be a dictionary in which the keys are the current column names and the values are the respective new names to set.

In [126]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[126]:

sexcount_meals_served_by_malescount_meals_served_by_femalesdayThur3032Fri109Sat5928Sun5818

Also, in the output above, see where it says

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
2 as the column name and
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
4 for the row name? We can modify those names using arguments in the
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
2 method. Let's set the
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
9 argument to
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
0 since that's a more specific name than
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
2.

In [127]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[127]:

gendercount_meals_served_by_malescount_meals_served_by_femalesdayThur3032Fri109Sat5928Sun5818

In this example, we passed in two columns from our DataFrame. However, one nice feature of

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
2 is that you don't need the data to be in a DataFrame. For the
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
3 and
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
5 arguments, you can pass in two numpy arrays.

Let's double check the logic from above makes sense. Let's use filtering in pandas to verify that there were 30 meals served by a male on Thursday. Our query below matches the 30 number we see above.

In [128]:

len(df_tips.query("sex=='Male' and day=='Thur'"))

Out[128]:

30

Alternatively, given the crosstab output above, you can present it in a different format that may be easier for further analysis. I won't dive into details of this operation, but in addition to the code above, you can chain the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
5 method and then the
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
6 method to pivot the DataFrame so each row is a unique combination of a value from
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
2 and
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
4 with the appropriate count of meals served.

In [129]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})

Out[129]:

genderdaycount_meals_served0MaleThur301MaleFri102MaleSat593MaleSun584FemaleThur325FemaleFri96FemaleSat287FemaleSun18

For each row and column of this previous crosstab, we can modify an argument to get the totals. Set the argument

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
9 to
len(df_tips.query("sex=='Male' and day=='Thur'"))
0 to get these totals. By default, the returned output will have a column and row name of
len(df_tips.query("sex=='Male' and day=='Thur'"))
1.

In [130]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[130]:

gendercount_meals_served_by_malescount_meals_served_by_femalesAlldayThur303262Fri10919Sat592887Sun581876All15787244

In the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
2 method, we can also rename the
len(df_tips.query("sex=='Male' and day=='Thur'"))
1 column. First, use all the same arguments from above. Then, set the argument
len(df_tips.query("sex=='Male' and day=='Thur'"))
4 to
len(df_tips.query("sex=='Male' and day=='Thur'"))
5 .

In [131]:

df_tips = sns.load_dataset('tips')
0

Out[131]:

gendercount_meals_served_by_malescount_meals_served_by_femalescount_meals_serveddayThur303262Fri10919Sat592887Sun581876count_meals_served15787244

For each cell value, we can calculate what percentage it is of the row's total. To do that, set the

len(df_tips.query("sex=='Male' and day=='Thur'"))
6 argument to
len(df_tips.query("sex=='Male' and day=='Thur'"))
7 (since index applies to each row).

In [132]:

df_tips = sns.load_dataset('tips')
1

Out[132]:

genderproportion_meals_served_by_malesproportion_meals_served_by_femalesdayThur0.4838710.516129Fri0.5263160.473684Sat0.6781610.321839Sun0.7631580.236842proportion_meals_served0.6434430.356557

For each cell value, we can also calculate what percentage it is of the column's total. To do that, set the

len(df_tips.query("sex=='Male' and day=='Thur'"))
6 argument to
len(df_tips.query("sex=='Male' and day=='Thur'"))
9.

In [133]:

df_tips = sns.load_dataset('tips')
2

Out[133]:

genderproportion_meals_served_by_malesproportion_meals_served_by_femalesproportion_meals_serveddayThur0.1910830.3678160.254098Fri0.0636940.1034480.077869Sat0.3757960.3218390.356557Sun0.3694270.2068970.311475

Given two categorical columns, the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
2 method can additionally utilize a column with numerical values to perform an aggregate operation. That sentence may sound daunting - so let's walk through it with a simple example.

We know there exists

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
0 values in our datasets for males that served meals on Thursday. Below, I preview the first few
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
0 values that meet this criteria.

In [134]:

df_tips = sns.load_dataset('tips')
3

Out[134]:

df_tips = sns.load_dataset('tips')
4

We may want to know the average bill size that meet the criteria above. So, given that series, we can calculate the mean and we arrive at a result of 18.71.

In [135]:

df_tips = sns.load_dataset('tips')
5

Out[135]:

df_tips = sns.load_dataset('tips')
6

Now, we can perform this same operation using the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
2 method. Same as before, we want our returned index to be the unique values from
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
4 and our returned columns to be the unique values from
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
2. Additionally, we want the values inside the table to be from our
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
0 column so we'll set the argument
30
7 to be
30
8. We also want to calculate the mean total bill for each combination of a day and gender so we'll set the
30
9 argument to
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})
0.

In [136]:

df_tips = sns.load_dataset('tips')
7

Out[136]:

gendermean_bill_size_meals_served_by_malesmean_bill_size_meals_served_by_femalesdayThur18.71466716.715312Fri19.85700014.145556Sat20.80254219.680357Sun21.88724119.872222

This crosstab calculation outputted the same 18.71 value as expected!

We can pass in many other aggregate methods to the

30
9 method too such as mean and standard deviation.

You can learn more about details of using

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
2 from the official pandas documentation page.

Grouping By Multiple Fields for the Index and/or Columns

We can also use the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})
3 method to group by multiple pandas columns for the
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
3 or
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})
5 arguments.

For example, we can find out for days of the week, for each gender, what was the count of meals they served for lunch or dinner. In pandas We want our returned index to be the unique values from

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
4 and our returned columns to be the unique values from all combinations of the
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
2 (gender) and
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])
5 (meal time) columns.

To interpret a value from our table below, across all Thursdays in our dataset, females served 31 lunch meals.

In [137]:

df_tips = sns.load_dataset('tips')
8

Out[137]:

genderMaleFemalemealcount_lunch_meals_servedcount_dinner_meals_servedparty_people_size123456123456daysexThurMale0242211000000Female1232302010000FriMale120000060100Female031000050000SatMale000000034131110Female0000002195200SunMale00000003291421Female000000076410