The Grammar of ggplot

R Bootcamp

Nate Hall

August 23, 2019

Why does a plotting package have grammar?

While it may sound strange, understanding the grammar of the ggplot2 package is fundamental to being able to use it effectively. Grammar refers to the syntactic rules of a language that can be combined with a variety of substantive material. Grammar or syntax provide the structure of how language can be expressed.

In other words, the grammar of ggplot2 is provides the rules of how to write code to give us a beautiful graph as the output. Importantly, other ways of creating graphics have syntax as well (including Excel and SPSS), but ggplot is slightly different.

Layering is a key concept in plotting with the ggplot2 package. That is, ggplot2 attempts to move users away from thinking about point and click interfaces that produce one single graphic. Instead, the hope is to give users the freedom to think about the graphics they would like to create in a highly customizable framework. In the references below, you’ll see a link to the “R Graphics Cookbook”, which should further one’s intuition that ggplot2 has the desirable property of allowing users to use more modular pieces of code to create beautiful plots exactly to the user’s specification.

Layering in ggplot

We’ll work quickly through an example, which will end with the basic layers added to make this plot:

When we talk about layers in ggplot, we’re talking about creating a graphic from scratch by stacking layers on top of one another, like such:

In other words, without understanding how one stacks layers in a ggplot object we will never get to the above plot. Instead we start from scratch with the raw data, which looks like this:

str(sat.act); head(sat.act)
## 'data.frame':    699 obs. of  6 variables:
##  $ gender   : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 1 2 1 2 2 ...
##  $ education: int  3 3 3 4 2 5 5 3 4 5 ...
##  $ age      : int  19 23 20 27 33 26 30 19 23 40 ...
##  $ ACT      : int  24 35 21 26 31 28 36 22 22 35 ...
##  $ SATV     : int  500 600 480 550 600 640 610 520 400 730 ...
##  $ SATQ     : int  500 500 470 520 550 640 500 560 600 800 ...
##   gender education age ACT SATV SATQ
## 1   Male         3  19  24  500  500
## 2   Male         3  23  35  600  500
## 3   Male         3  20  21  480  470
## 4 Female         4  27  26  550  520
## 5 Female         2  33  31  600  550
## 6 Female         5  26  28  640  640

Creating a base ggplot object

If we ran blindly into creating the above graph we might want to run a command that says:

Plot my data please!!

Okay, then… here we go:

gg_object <- ggplot(data = sat.act)
plot(gg_object)

…womp.

What happened here is that we set up a ggplot object but have told it what to do with the data yet, it’s sitting there but R does not know how you want it so it will kick back a blank screen. We can think of this as our plotting “canvas”

Oh right, let’s run a command that says:

Plot my data please, and this time include ACT score on the x axis and
  verbal SAT scores on the y axis!! Also, let's visually separate the 
  data for males and females.


In this case, we can simply add aesthetic mappings to the x and y axis so the gg_object now knows what variables from the data.frame to plot on x and y and that males and females should be colored differently:

#don't worry about the difference between color and fill for now
gg_object_aes <- ggplot(data = sat.act, 
                        aes(x = ACT, y = SATV, color = gender, fill = gender))
plot(gg_object_aes)



Now we have x and y axes specified that correspond to in the aes() argument. This is what many would consider the base of a ggplot object that we can now layer on top of. We still can’t see our data, however. In order to do so, we need to add a geom_ layer to the plot.

Geometric layers

To actually see your data in geometric space we need to tell ggplot how to visualize our data. You can look on the online documentation for what might work best. In basic regression one of the best ways to look at the dependence of one variable on another, often a scatter plot with a regression line fit to the data.

We can pass this to R by including the geom_point() and geom_smoothcommand.

gg_object_aes <- gg_object_aes + geom_point() + geom_smooth(method = lm) + scale_color_viridis_d(begin = .3, end = 1) + scale_fill_viridis_d(begin = .3) #no need to worry about the viridis calls for now
plot(gg_object_aes)

Facets

The facet_wrap() and facet_grid() functions can split the data further into additional panels on the basis of a factor (categorical variable). This will yield a set of ‘small multiple’ plots in which each panel represents the same graphical idiom, but with data from a different level of the faceting variable.

For example, if we wanted different panels in our scatter plots to correspond to different levels of education, we could add:

gg_object_aes <- gg_object_aes + facet_wrap(~education)
plot(gg_object_aes)

Themes

To add a different aesthetic touch (for example, the yellow is kindof tough to see on the white background), there are different “themes” that are built in or included in the prettydoc (this .Rmd) package. One that I often use is gg_object + theme_bw(), which is a simple black and white background with tickmarks that seem reasonable. Anyways, given the yellow on white problem, let’s plot this using a dark background.

N.B. This is not the only way to solve this problem, it is just as easy to change the colors of your data points.

gg_object_aes <- gg_object_aes+ theme_dark()
plot(gg_object_aes)

Labels

We’ll finish up by changing the labels of the x and y axis and the main title using the labs() function.

gg_object_aes <- gg_object_aes + 
  labs(x = "ACT scores", y = "SAT Verbal", 
       title = "This is a pretty plot", subtitle =  "And also I'm done talking"
  )

plot(gg_object_aes)

And we’re back to where we started, yet we’ve constructed this plot layer-by-layer and hopefully we have a bit more of an understanding for what the ggplot2 package is capable.



To bring things full circle there is also the option to throw this all into one command from the beginning:

gg_verbal_object <- ggplot(data = sat.act, aes(x = ACT, y = SATV, color = gender, fill = gender)) + 
  geom_point() + scale_color_viridis_d(begin = .3, end = 1) + 
  scale_fill_viridis_d(begin = .3) + geom_smooth(method = lm) + 
  facet_wrap(~education) + theme_dark() + 
  labs(x = "ACT scores", y = "SAT Verbal", title = "This is a pretty plot")

There is plenty more that I haven’t included, but here are some resources for you to consult as you embark on your journey into the soul of ggplot!

Useful Resources

In order from easiest to use through the more conceptual.

ggplot cheatsheet

R Graphics Cookbook (Chang)

R for Data Science Book (Wickham)

A Layered Grammar of Graphics (Wickham)