Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information. - NYTimes (2014) Show What are some common things you like to do with your data? Maybe remove rows or columns, do calculations and maybe add new columns? This is called data wrangling. It’s not data management or data manipulation: you keep the raw data raw and do these things programatically in R with the tidyverse. We are going to introduce you to data wrangling in R first with the tidyverse. The tidyverse is a suite of packages that match a philosophy of data science developed by Hadley Wickham and the RStudio team. I find it to be a more straight-forward way to learn R. We will also show you by comparison what code will look like in “Base R”, which means, in R without any additional packages (like the “tidyverse” package) installed. I like David Robinson’s blog post on the topic of teaching the tidyverse first. For some things, base-R is more straight forward, and we’ll show you that too. Whenever we use a function that is from the tidyverse, we will prefix it so you’ll know for sure.
Today’s materials are again borrowing from some excellent sources, including: Let’s start off discussing Tidy Data. Hadley Wickham, RStudio’s Chief Scientist, and his team have been building R packages for data wrangling and visualization based on the idea of tidy data. Tidy data has a simple convention: put variables in the columns and observations in the rows. The Ocean Health Index dataset we were working with this morning was an example of tidy data. When data are tidy, you are set up to work with it for your analyses, plots, etc. Right now we are going to use Conceptually, making data tidy first is really critical. Instead of building your analyses around whatever (likely weird) format your data are in, take deliberate steps to make your data tidy. When your data are tidy, you can use a growing assortment of powerful analytical and visualization tools instead of inventing home-grown ways to accommodate your data. This will save you time since you aren’t reinventing the wheel, and will make your work more clear and understandable to your collaborators (most importantly, Future You). And actually, Hadley Wickham and RStudio have created a ton of packages that help you at every step of the way here. This is from one of Hadley’s recent presentations: We’ll do this in a new RMarkdown file. Here’s what to do:
In your R Markdown file, let’s make sure we’ve got our libraries loaded. Write the following: This is becoming standard practice for how to load a library in a file, and if you get an error that the library doesn’t exist, you can install the package easily by running the code within the comment (highlight In the ggplot2 chapter, we explored the Ocean Health Index data visually. Today, we’ll explore a different dataset by the numbers. We will work with some of the data from the Gapminder project. The data are on GitHub. Navigate there by going to: github.com > ohi-science > data-science-training > data > gapminder.csv or by copy-pasting url for data-view: This is data-view mode: so we can have a quick look at the data. It’s a .csv file, which you’ve probably encountered before, but GitHub has formatted it nicely so it’s easy to look at. You can see that for every country and year, there are several columns with data in them. We can read this data into R directly from GitHub, without downloading it. But we can’t read this data in view-mode. We have to click on the Raw button on the top-right of the data. This displays it as the raw csv file, without formatting. Copy the url for raw data: https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/gapminder.csv Now, let’s go back to RStudio. In our R Markdown, let’s read this csv file and name the variable “gapminder”. We will use the Note: Let’s inspect: Let’s use
We can also see the More ways to learn basic info on a data.frame. A statistical overview can be obtained with To specify a single variable from a data.frame, use the dollar sign OK, so let’s start wrangling with dplyr. There are five
These can all be used in conjunction with All verbs work similarly:
Together these properties make it easy to chain together multiple simple steps to achieve a complex result. You will want to isolate bits of your data; maybe you want to only look at a single country or a few years. R calls this subsetting.
Visually, we are doing this (thanks RStudio for your cheatsheet): dplyr 6 and dplyr 7 here.You can say this out loud: “Filter the gapminder data for life expectancy less than 29”. Notice that when we do this, all the columns are returned, but only the rows that have the life expectancy less than 29. We’ve subsetted by row. Let’s try another: “Filter the gapminder data for the country Mexico”. How about if we want two country names? We can’t use the How about if we want Mexico in 2002? You can pass filter different criteria:
This is one way to do it based on what we have learned so far: We use Visually, we are doing this (thanks RStudio for your cheatsheet): We can select multiple columns with a comma, after we specify the data frame (gapminder). We can also use - to deselect columns Let’s filter for Cambodia and remove the continent and lifeExp columns. We’ll save this as a variable. Actually, as two temporary variables, which means that for the second one we need to operate on We also could have called them both Good thing there is an awesome alternative. Before we go any further, we should explore the new pipe operator that Here’s what it looks like: Let’s demo then I’ll explain: This is equivalent to Never fear, you can still specify other arguments to this function! To see the first 3 rows of Gapminder, we could say I’ve advised you to think “gets” whenever you see the assignment operator, One of the most awesome things about this is that you START with the data before you say what you’re doing to DO to it. So above: “take the gapminder data, and then give me the first three entries”. This means that instead of this: So you can see that we’ll start with gapminder in the first example line, and then gap_cambodia in the second. This makes it a bit easier to see what data we are starting with and what we are doing to it. …But, we still have those temporary variables so we’re not truly that better off. But get ready to be majorly impressed: We can use the pipe to chain those two operations together: What’s happening here? In the second line, we were able to delete By using multiple lines I can actually read this like a story and there aren’t temporary variables that get super confusing. In my head:
Being able to read a story out of code like this is really game-changing. We’ll continue using this syntax as we learn the other dplyr verbs. Alright, let’s keep going. Let’s say we needed to add an index column so we know which order these data came in. Let’s not make a new variable, let’s add a column to our gapminder data frame. How do we do that? With the Visually, we are doing this (thanks RStudio for your cheatsheet): Imagine we want to know each country’s annual GDP. We can multiply
What if we wanted to know the total population on each continent in 2002? Answering this question requires a grouping variable. Visually, we are doing this (thanks RStudio for your cheatsheet): By using OK, this is great. But what if we don’t care about the other columns and we only want each continent and their population in 2002? Here’s the next function: We want to operate on a group, but actually collapse or distill the output from that group. The Visually, we are doing this (thanks RStudio for your cheatsheet): How cool is that! We can use more than one grouping variable. Let’s get total populations by continent and year. This is ordered alphabetically, which is cool. But let’s say we wanted to order it in ascending order for
We have done a pretty incredible amount of work in a few lines. Our whole analysis is this. Imagine the possibilities from here. It’s very readable: you see the data as the first thing, it’s not nested. Then, you can read the verbs. This is the whole thing, with explicit package calls from I actually am borrowing this “All together now” from Tony Fischetti’s blog post How dplyr replaced my most common R idioms). With that as inspiration, this is how what we have done would look like in Base R. Let’s compare with some base R code to accomplish the same things. Base R requires subsetting with the If we don’t write anything after the comma, that means “all columns”. And if we don’t write anything before the comma, that means “all rows”. Also, the Instead of calculating the max for each country like we did with Note too that the chain operator Get your RMarkdown file cleaned up and sync it for the last time today! We’ve learned a ton in this session and we may not get to this right now. If we don’t have time, we’ll start here before getting into the next chapter: Most of the time you will have data coming from different places or in different files, and you want to put them together so you can analyze them. Datasets you’ll be joining can be called relational data, because it has some kind of relationship between them that you’ll be acting upon. In the tidyverse, combining data that has a relationship is called “joining”. From the RStudio cheatsheet (note: this is an earlier version of the cheatsheet but I like the graphics):
If you wanted to combine these two tables, how would you do it? There are some decisions you’d have to make about what was important to you. The cheatsheet visualizes it for us: We will only talk about this briefly here, but you can refer to this more as you have your own datasets that you want to join. This describes the figure above::
Let’s play with these CO2 emissions data to illustrate: That’s all we’re going to talk about today with joining, but there are more ways to think about and join your data. Check out the Relational Data Chapter in R for Data Science.
If you get this error, it is probably because you have a line that starts with a pipe. The pipe should be at the end of the previous line, not the start of the current line. What does data wrangling mean in R?Data Wrangling is a process reimaging the raw data to a more structured format, which will help to get better insights and make better decisions from the data.
What is the purpose of Tidyverse?Tidyverse is an R programming package that helps to transform and better present data. It assists with data import, tidying, manipulation, and data visualization. The tidyverse package is open source, meaning that it is freely available to use and is constantly being modified and improved.
What is meant by data wrangling?Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources available today, storing and organizing large quantities of data for analysis is becoming increasingly necessary.
Why is Tidyverse so popular?Why is Tidyverse so popular? Developed by RStudio's chief scientist Hadley Wickham, the Tidyverse provides an efficient, fast, and well-documented workflow for general data modeling, wrangling, and visualization tasks. The Tidyverse uses a consistent approach to build an ecosystem of applications.
|