Experimentation with Tidyverse in R

Aruna Singh
Artificial Intelligence in Plain English
7 min readJan 21, 2021

--

In this article, you’ll learn the intertwined processes of data manipulation, extraction and visualization using the tools dplyr and ggplot2. You’ll learn to manipulate data by filtering, sorting, and summarizing a real dataset in order to answer exploratory questions. Henceforth, you’ll get a taste of the exploratory data analysis and the power of Tidyverse tools. If you have prior experience in R, you can continue with this article. Otherwise, I would recommend to have a glance in this tutorial Quick Tutorial on R for better understanding.

Data Wrangling

In this section, you’ll learn to do three things with a table: filter for particular observations, arrange the observations in the desired order and mutate to add or change a column. You’ll see how each of these steps allows you to answer questions about your data.

At every step, you will be analyzing the real dataset called gapminder. The gapminder tracks the economic and social indicators like gdp per capita, life expectancy of countries over time. To do this, install the package & load it.

Structure of the gapminder Data Set

Verbs in the dplyr package

  1. filter() : To filter the subset of the observations based on a particular condition. It’s a common step in an analysis.

Every time you access the verb, use pipe %>% which gets feeded into the next step.

library(dplyr)
library(gapminder)
gapminder %>%
filter(year == 2007) #gapminder dataset for the year 2007
gapminder %>%
filter(year == 2007, country == "United States")
Filtered Data for year “2007”

2. arrange() : To. sort the data in the dataset in ascending and descending order.

gapminder %>%
filter(year == 2007) %>%
arrange(desc(gdpPercap))
Filter for 1957, then arrange in descending order of gdp per capita

3. mutate() : To change any variables or add a new variable in the dataset. Make sure to keep column name in one word.

gapminder %>%
mutate(gdp = gdpPercap * pop)
Added gdp as the product of gdpPercap and population

Combining Verbs: Let’s combine all three of the verbs you’ve learned just now, to find the countries with the highest life expectancy, in months, in the year 2007.

gapminder %>%
filter(year == 2007) %>%
mutate(lifeExpMonths = 12 * lifeExp) %>%
arrange(desc(lifeExpMonths))

Data Visualization

In this section, you’ll learn the essential skills of data visualization using the ggplot2 package, and you’ll see how the dplyr and ggplot2 packages work closely together to create informative graphs.

In particular, you are going to see how to create scatterplot that compares two variables on x and y axis.

Variable Assignment: When you’re working with just that subset, it’s useful to save the filtered data, as a new data frame. To do this, you use the assignment operator. This is a less then and a minus sign, like an arrow facing to the left.

In this operation, you’re taking the gapminder dataset, filtering it for the observations from the year 2007, and then saving it as gapminder_2007. Now if you print the gapminder_2007 dataset, we can see that it’s another table. But this one has only 142 rows,for the year 2007.

gapminder_2007 <- gapminder %>%
filter(year == 2007)

Now that you’ve saved this variable, you can use it to create our visualization.

Visualizing with ggplot2: Suppose you want to examine the relationship between a country’s wealth and its life expectancy. You could do this with a scatterplot comparing two variables in our gapminder dataset: GDP per capita on the X axis and life expectancy on the y-axis. You’ll be creating this plot using the ggplot2 package.

library(ggplot2)gapminder_1952 <- gapminder %>%
filter(year == 1952)
ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) +
geom_point()
ScatterPlot with pop on the x-axis and lifeExp on the y-axis

This plot communicates an interesting observation: higher income countries tend to have higher life expectancy. But there is a problem, however, a lot of countries get crammed into the leftmost part of the x-axis. This is because the distribution of GDP per capita spans several orders of magnitude, with some countries in the tens of thousands of dollars and others in the hundreds. When one of your axes has that kind of distribution, it’s useful to work with a logarithmic.

Log scale: Change the existing scatter plot (code provided) to put the x-axis (representing population) on a log scale. A scale where each fixed distance represents a multiplication of the value. This is what the scatter plot looks like when x is on a log scale. This is the same data, but now each unit on the x-axis represents a change of 10 times the GDP.

ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) +
geom_point() + scale_x_log10()
Change this plot to put the x-axis on a log scale

Suppose you want to create a scatter plot with population on the x-axis and GDP per capita on the y-axis. Both population and GDP per-capita are better represented with log scales, since they vary over many orders of magnitude.

ggplot(gapminder_1952, aes(x = pop, y = gdpPercap)) +
geom_point() + scale_x_log10() + scale_y_log10()
Comparing pop and gdpPercap, with both axes on a log scale

Adding size and color to the plot: Create a scatter plot communicating information about each country’s population, life expectancy, and continent. Now you’ll use the color and size of the points to communicate even more.

ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent , size = gdpPercap)) + geom_point() + scale_x_log10()
Add the color and size aesthetic to represent a country’s gdpPercap

Faceting: It is used to divide a graph into subplots based on one of its variables, such as the continent.

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp), color = continent , size = pop) + geom_point() + scale_x_log10() + facet_wrap(~ year)
Scatter plot comparing gdpPercap and lifeExp, with color representing continent and size representing population, faceted by year

Grouping and Summarizing

So far you’ve been answering questions about individual country-year pairs, but you may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll understand “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

  1. summarize() : To collapse each group into a single-row summary. It does this by applying an aggregating or summary function to each group.

Let’s find the median life expectancy by using medium function within summarize function.

gapminder %>%
summarize(medianLifeExp = median(lifeExp))
Median life expectancy

Rather than summarizing the entire dataset, you may want to find the median life expectancy for only one particular year. In this case, you can combine the summarize verb with filter. You can create multiple summaries at once with the summarize verb.

gapminder %>%
filter(year == 1957) %>%
summarize(medianLifeExp = median(lifeExp))
Median life expectancy for year 1957

In the last point, you’ve learnt to calculate the average life expectancy and the total population in the year 2007. What if we weren’t interested just in the average for the year 2007, but for each of the years in the dataset? You could rerun this code and change the year each time, but that’s very tedious. Instead, you can use the group_by verb, which tells dplyr to summarize within groups instead of summarizing the entire dataset.

2. group_by() : It splits the data into groups. When the data is grouped then use summarize function.

Suppose you’re interested in the average life expectancy and the total population in 2007 within each continent. You can find this by first filtering for the year 2007, grouping by continent (instead of year), and then performing your summary.

gapminder %>%
filter(year == 1957) %>%
group_by(continent) %>%
summarize(medianLifeExp=median(lifeExp), maxGdpPercap= max(gdpPercap))

Visualizing Summarized Data

Instead of viewing the summarized data as a table, let’s save it as an object called by_year, so you can visualize the data using ggplot2 package. You would construct the graph with the three steps of ggplot2: the data, which is by_year. The aesthetics, which puts year on the x-axis and total population on the y-axis. And the type of graph, which in this case is a scatter plot, represented by geom_point.

Here, you’ll examine the median GDP per capita instead, and see how the trend differs among continents.

library(ggplot2) #import ggplot2# Summarize medianGdpPercap within each continent within each year: by_year_continentby_year_continent <- gapminder %>%
group_by(continent, year) %>%
summarize(medianGdpPercap = median(gdpPercap))
ggplot(by_year_continent, aes(x = year, y = medianGdpPercap, color = continent), expand_limits(y = 0)) + geom_point()
Change in medianGdpPercap in each continent over time

Hence, we have covered the principles of transforming and visualizing data with R, and in the process learned some real insights from the Gapminder dataset. We have also learnt about ggplot2 to create more informative and customized data visualizations.

Thank you for reading the article and I am sure, it would be pretty helpful. Do let me know your inputs and follow me on LinkedIn or Twitter for more updates.

--

--

As a BIE at Amazon, I explore why we call data, the new oil by interpreting and generating meaningful insights.