Bulding inforgraphics layer by layer
Data visualization for everyone
Gapminder data
We’re going to work with a different dataset for this section. It’s a cleaned-up excerpt from the Gapminder data.
-
country
a categorical variable 142 levels -
continent
, a categorical variable with 5 levels -
year
: going from 1952 to 2007 in increments of 5 years -
pop
: population -
gdpPercap
: GDP per capita -
lifeExp
: life expectancy
dplyr review
The dplyr package gives you a handful of useful verbs for managing data. On their own they don’t do anything that base R can’t do. Here are some of the single-table verbs we’ll be working with in this lesson (single-table meaning that they only work on a single table – contrast that to two-table verbs used for joining data together). They all take a data.frame
or tbl
as their input for the first argument, and they all return a data.frame
or tbl
as output.
-
filter()
: filters rows of the data where some condition is true -
select()
: selects out particular columns of interest -
mutate()
: adds new columns or changes values of existing columns -
arrange()
: arranges a data frame by the value of a column -
summarize()
: summarizes multiple values to a single value, most useful when combined with… -
group_by()
: groups a data frame by one or more variable. Most data operations are useful done on groups defined by variables in the the dataset. Thegroup_by
function takes an existing data frame and converts it into a grouped data frame wheresummarize()
operations are performed by group.
Additionally, the %>%
operator allows you to “chain” operations together. Rather than nesting functions inside out, the %>%
operator allows you to write operations left-to-right, top-to-bottom. Let’s say we wanted to get the average life expectancy and GDP (not GDP per capita) for Asian countries for each year.
About ggplot2
ggplot2 is a widely used R package that extends R’s visualization capabilities. It takes the hassle out of things like creating legends, mapping other variables to scales like color, or faceting plots into small multiples. We’ll learn about what all these things mean shortly.
Specifically, ggplot2 allows you to build a plot layer-by-layer by specifying:
- a geom, which specifies how the data are represented on the plot (points, lines, bars, etc.),
- aesthetics that map variables in the data to axes on the plot or to plotting size, shape, color, etc.,
- a stat, a statistical transformation or summary of the data applied prior to plotting,
- facets, which we’ve already seen above, that allow the data to be divided into chunks on the basis of other categorical or continuous variables and the same plot drawn for each chunk.
Plotting bivariate data: continuous Y by continuous X
The ggplot
function has two required arguments: the data used for creating the plot, and an aesthetic mapping to describe how variables in said data are mapped to things we can see on the plot.
First let’s load the package: it has already been loaded for you but just do it!
Now, let’s lay out the plot. If we want to plot a continuous Y variable by a continuous X variable we’re probably most interested in a scatter plot. Here, we’re telling ggplot that we want to use the gapminder
dataset, and the aesthetic mapping will map gdpPercap
onto the x-axis and lifeExp
onto the y-axis. Remember that the variable names are case sensitive!
When we do that we get a blank canvas with no data showing (you might get an error if you’re using an old version of ggplot2). That’s because all we’ve done is laid out a two-dimensional plot specifying what goes on the x and y axes, but we haven’t told it what kind of geometric object to plot. The obvious choice here is a point. such that we add geom_point()
Here, we’ve built our plot in layers. First, we create a canvas for plotting layers to come using the ggplot
function, specifying which data to use (here, the gapminder data frame), and an aesthetic mapping of gdpPercap
to the x-axis and lifeExp
to the y-axis. We next add a layer to the plot, specifying a geom, or a way of visually representing the aesthetic mapping.
Now, the typical workflow for building up a ggplot2 plot is to first construct the figure and save that to a variable (for example, p
), and as you’re experimenting, you can continue to re-define the p
object as you develop “keeper commands”.
First, let’s construct the graphic. Notice that we don’t have to specify x=
and y=
if we specify the arguments in the correct order (x is first, y is second).
The p
object now contains the canvas, but nothing else. Try displaying it by just running p
. Let’s experiment with adding points and a different scale to the x-axis.
- Experiment with adding poings
- Experiment with a different scale (
+ scale_x_log10
)
Above we implied the aesthetic mappings for the x- and y- axis should be gdpPercap
and lifeExp
, but we can also add aesthetic mappings to the geoms themselves. For instance, what if we wanted to color the points by the value of another variable in the dataset, say, continent
?
Notice the difference here. If I wanted the colors to be some static value, I wouldn’t wrap that in a call to aes()
. I would just specify it outright. Same thing with other features of the points. For example, lets make all the points huge (size=8
) blue (color="blue"
) semitransparent (alpha=(1/4)
) triangles (pch=17
):
Now, this time, let’s map the aesthetics of the point character to certain features of the data. For instance, let’s give the points different colors and character shapes according to the continent, and map the size of the point onto the life Expectancy
:
your turn
Re-create this same plot from scratch without saving anything to a variable. That is, start from the ggplot
call.
- Start with the
ggplot()
function. - Use the gapminder data.
- Map
gdpPercap
to the x-axis andlifeExp
to the y-axis. - Add points to the plot
- Make the points size 3
- Map continent onto the aesthetics of the point
- Use a log10 scale for the x-axis.
The %>%
would allow us to do this:
Instead of this:
Adding layers
Let’s add a fitted curve to the points. Recreate the plot in the p
object if you need to.
By default geom_smooth()
will try to lowess for data with n<1000 or generalized additive models
for data with n>1000. We can change that behavior by tweaking the parameters to use a thick red line, use a linear model ( method="lm"
) instead of a GAM, and to turn off the standard error stripes by adding se=FALSE
.
But let’s add back in our aesthetic mapping to the continents. Notice what happens here. We’re mapping continent as an aesthetic mapping to the color of the points only – so geom_smooth()
still works only on the entire data.
Faceting
Facets display subsets of the data in different panels. There are a couple ways to do this, but facet_wrap()
tries to sensibly wrap a series of facets into a 2-dimensional grid of small multiples. Just give it a formula specifying which variables to facet by. We can continue adding more layers, such as smoothing. specify continent
in the gap
Plotting bivariate data: continuous Y by categorical X
With the last example we examined the relationship between a continuous Y variable against a continuous X variable. A scatter plot was the obvious kind of data visualization. But what if we wanted to visualize a continuous Y variable against a categorical X variable? First, let’s set up the basic plot of x=continent vs y=lifeExp
Probably a more common visualization is to show a box plot using geom_boxplot()
:
Let’s make the jitter layer go on top. Also, go back to just the boxplots. Notice that the outliers are represented as points. But there’s no distinction between the outlier point from the boxplot geom and all the other points from the jitter geom. Let’s change that. Notice the British spelling.
- There’s another geom that’s useful here, called a voilin plot (
geom_violin()
).
- add
geom_violin() and geom_jitter()
Let’s go back to our boxplot for a moment.
This plot would be a lot more effective if the continents were shown in some sort of order other than alphabetical. To do that, we’ll have to go back to our basic build of the plot again and use the reorder
function in our original aesthetic mapping. Here, reorder is taking the first variable, which is some categorical variable, and ordering it by the level of the mean of the second variable, which is a continuous variable. It looks like this
Your Turn
- Make a jittered strip plot of GDP per capita against continent.
- Make a box plot of GDP per capita against continent.
- Using a log10 y-axis scale, overlay semitransparent jittered points on top of box plots, where outlying points are colored.
Plotting univariate continuous data
What if we just wanted to visualize distribution of a single continuous variable? A histogram is the usual go-to visualization. Here we only have one aesthetic mapping instead of two.
- specify
geom_histogram()
inorder to plot a histogram
When we do this ggplot lets us know that we’re automatically selecting the width of the bins, and we might want to think about this a little further.
- changes bins to 30
- changes bins to 10
- changes bins to 200
- changes bins to 60
- Alternative we could plot a smoothed density curve instead of a histogram:
Back to histograms. What if we wanted to color this by continent?
That’s not what we had in mind. That’s just the outline of the bars. We want to change the fill color of the bars.
Let’s change the position argument.
But the problem there is that the histograms are blocking each other. What if we tried transparency?
That’s somewhat helpful, and might work for two distributions, but it gets cumbersome with 5. Let’s go back and try this with density plots, first changing the color of the line:
Then by changing the color of the fill and setting the transparency to 25% (1/4):
Your turn
- Plot a histogram of GDP Per Capita.
- Do the same but use a log10 x-axis.
- Still on the log10 x-axis scale, try a density plot mapping continent to the fill of each density distribution, and reduce the opacity.
- Still on the log10 x-axis scale, make a histogram faceted by continent and filled by continent.
Good Job on completing this