Cara menggunakan crosstab order by python

A crosstab computes aggregated metrics among two or more columns in a dataset that contains categorical values.

Table of Contents Show

Import Modules
Get Tips Dataset
Implement Crosstabs with Tips Dataset

Import Modules

In [122]:

import pandas as pd
import seaborn as sns

Get Tips Dataset

Let's get the

df_tips.head()

7 dataset from the

df_tips.head()

8 library and assign it to the DataFrame

df_tips.head()

In [123]:

df_tips = sns.load_dataset('tips')

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column namecolumn description

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

0financial amount of meal in U.S. dollars

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

1financial amount of the meal's tip in U.S. dollars

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

2gender of server

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

3boolean to represent if server smokes or not

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

4day of week

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

5meal name (Lunch or Dinner)

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

6count of people eating meal

Preview the first 5 rows of

df_tips.head()

In [124]:

df_tips.head()

Out[124]:

total_billtipsexsmokerdaytimesize016.991.01FemaleNoSunDinner2110.341.66MaleNoSunDinner3221.013.50MaleNoSunDinner3323.683.31MaleNoSunDinner2424.593.61FemaleNoSunDinner4

Implement Crosstabs with Tips Dataset

Grouping By a Single Field for the Index and Column

Let's compute a simple crosstab across the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

4 and

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

2 column. We want our returned index to be the unique values from

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

4 and our returned columns to be the unique values from

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

2. By default in pandas, the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

2 computes an aggregated metric of a count (aka frequency).

So, each of the values inside our table represent a count across the index and column. For example, males served 30 unique groups across all Thursdays in our dataset.

In [125]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

Out[125]:

sexMaleFemaledayThur3032Fri109Sat5928Sun5818

One issue with this crosstab output is the column names are nonsensical. Just saying

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

3 or

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

4 isn't very specific. They should be renamed to be clearer. We can use the rename() method and set the argument

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

5 to be a dictionary in which the keys are the current column names and the values are the respective new names to set.

In [126]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[126]:

sexcount_meals_served_by_malescount_meals_served_by_femalesdayThur3032Fri109Sat5928Sun5818

Also, in the output above, see where it says

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

2 as the column name and

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

4 for the row name? We can modify those names using arguments in the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

2 method. Let's set the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

9 argument to

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

0 since that's a more specific name than

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

In [127]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[127]:

gendercount_meals_served_by_malescount_meals_served_by_femalesdayThur3032Fri109Sat5928Sun5818

In this example, we passed in two columns from our DataFrame. However, one nice feature of

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

2 is that you don't need the data to be in a DataFrame. For the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

3 and

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

5 arguments, you can pass in two numpy arrays.

Let's double check the logic from above makes sense. Let's use filtering in pandas to verify that there were 30 meals served by a male on Thursday. Our query below matches the 30 number we see above.

In [128]:

len(df_tips.query("sex=='Male' and day=='Thur'"))

Out[128]:

Alternatively, given the crosstab output above, you can present it in a different format that may be easier for further analysis. I won't dive into details of this operation, but in addition to the code above, you can chain the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

5 method and then the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

6 method to pivot the DataFrame so each row is a unique combination of a value from

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

2 and

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

4 with the appropriate count of meals served.

In [129]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})

Out[129]:

genderdaycount_meals_served0MaleThur301MaleFri102MaleSat593MaleSun584FemaleThur325FemaleFri96FemaleSat287FemaleSun18

For each row and column of this previous crosstab, we can modify an argument to get the totals. Set the argument

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

9 to

len(df_tips.query("sex=='Male' and day=='Thur'"))

0 to get these totals. By default, the returned output will have a column and row name of

len(df_tips.query("sex=='Male' and day=='Thur'"))

In [130]:

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender'], margins=True).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

Out[130]:

gendercount_meals_served_by_malescount_meals_served_by_femalesAlldayThur303262Fri10919Sat592887Sun581876All15787244

In the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

2 method, we can also rename the

len(df_tips.query("sex=='Male' and day=='Thur'"))

1 column. First, use all the same arguments from above. Then, set the argument

len(df_tips.query("sex=='Male' and day=='Thur'"))

4 to

len(df_tips.query("sex=='Male' and day=='Thur'"))

5 .

In [131]:

df_tips = sns.load_dataset('tips')

Out[131]:

gendercount_meals_served_by_malescount_meals_served_by_femalescount_meals_serveddayThur303262Fri10919Sat592887Sun581876count_meals_served15787244

For each cell value, we can calculate what percentage it is of the row's total. To do that, set the

len(df_tips.query("sex=='Male' and day=='Thur'"))

6 argument to

len(df_tips.query("sex=='Male' and day=='Thur'"))

7 (since index applies to each row).

In [132]:

df_tips = sns.load_dataset('tips')

Out[132]:

genderproportion_meals_served_by_malesproportion_meals_served_by_femalesdayThur0.4838710.516129Fri0.5263160.473684Sat0.6781610.321839Sun0.7631580.236842proportion_meals_served0.6434430.356557

For each cell value, we can also calculate what percentage it is of the column's total. To do that, set the

len(df_tips.query("sex=='Male' and day=='Thur'"))

6 argument to

len(df_tips.query("sex=='Male' and day=='Thur'"))

In [133]:

df_tips = sns.load_dataset('tips')

Out[133]:

genderproportion_meals_served_by_malesproportion_meals_served_by_femalesproportion_meals_serveddayThur0.1910830.3678160.254098Fri0.0636940.1034480.077869Sat0.3757960.3218390.356557Sun0.3694270.2068970.311475

Given two categorical columns, the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

2 method can additionally utilize a column with numerical values to perform an aggregate operation. That sentence may sound daunting - so let's walk through it with a simple example.

We know there exists

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

0 values in our datasets for males that served meals on Thursday. Below, I preview the first few

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

0 values that meet this criteria.

In [134]:

df_tips = sns.load_dataset('tips')

Out[134]:

df_tips = sns.load_dataset('tips')

We may want to know the average bill size that meet the criteria above. So, given that series, we can calculate the mean and we arrive at a result of 18.71.

In [135]:

df_tips = sns.load_dataset('tips')

Out[135]:

df_tips = sns.load_dataset('tips')

Now, we can perform this same operation using the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

2 method. Same as before, we want our returned index to be the unique values from

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

4 and our returned columns to be the unique values from

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

2. Additionally, we want the values inside the table to be from our

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

0 column so we'll set the argument

7 to be

8. We also want to calculate the mean total bill for each combination of a day and gender so we'll set the

9 argument to

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})

In [136]:

df_tips = sns.load_dataset('tips')

Out[136]:

gendermean_bill_size_meals_served_by_malesmean_bill_size_meals_served_by_femalesdayThur18.71466716.715312Fri19.85700014.145556Sat20.80254219.680357Sun21.88724119.872222

This crosstab calculation outputted the same 18.71 value as expected!

We can pass in many other aggregate methods to the

9 method too such as mean and standard deviation.

You can learn more about details of using

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

2 from the official pandas documentation page.

Grouping By Multiple Fields for the Index and/or Columns

We can also use the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).unstack().reset_index().rename(columns={0: "count_meals_served"})

3 method to group by multiple pandas columns for the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], colnames=['gender']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

3 or

pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

5 arguments.

For example, we can find out for days of the week, for each gender, what was the count of meals they served for lunch or dinner. In pandas We want our returned index to be the unique values from

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

4 and our returned columns to be the unique values from all combinations of the

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

2 (gender) and

pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

5 (meal time) columns.

To interpret a value from our table below, across all Thursdays in our dataset, females served 31 lunch meals.

In [137]:

df_tips = sns.load_dataset('tips')

Out[137]:

genderMaleFemalemealcount_lunch_meals_servedcount_dinner_meals_servedparty_people_size123456123456daysexThurMale0242211000000Female1232302010000FriMale120000060100Female031000050000SatMale000000034131110Female0000002195200SunMale00000003291421Female000000076410