Plotting with ggplot2

Published

May 22, 2025

Modified

March 24, 2025

About

This section describes some ways to generate plots using the R package ggplot2.

Why R

R is a programming language specialized for data analysis and visualization.

Background

ggplot2 is a package¹ for making 2D plots in R. It implements a special “language” for making plots based on the grammar of graphics suggested by Wilkinson, Wills, Rope, Norton, & Dubbs (2005). That’s where the ‘gg’ in the name comes from.

In ggplot, we create a base plot, then add layers to it using a plus sign + operator.

Set-up

To check whether ggplot2 is already installed, run the following chunk:

Code

# The require() function returns TRUE if ggplot2 is installed and FALSE if it is not. The exclamation point symbol ('!') turns FALSE into TRUE and TRUE into FALSE. So, `!require(ggplot2)` will be TRUE if require(ggplot2) is FALSE. When this occurs, the `install.packages(ggplot)` commands runs and installs ggplot2.
if (!require(ggplot2)) {
  install.packages(ggplot2)
}

Loading required package: ggplot2

Plotting one variable

Discrete/nominal

First, we make some data. We store two variables, category and value using the data.frame() command. A data frame is a convenient, spreadsheet-like structure for storing data.

Code

data_random_discrete <- data.frame(category = c('ab', 'xy', 'mn', 
                                                'qp', 'ea', 'f2',
                                                'gg', 'h*'),
                                   value = c(4.8, 5.5, 3.5, 
                                           4.6, 6.5, 6.6, 
                                           2.6, 3.0))

Here we create a data frame and give it the name data_random_discrete. You can choose any name you want, subject to R’s restrictions on naming things. I try to use descriptive names.

The data frame name suggests that we have a random, continuous variable (value) and a discrete or categorical variable (category). The data frame command stores the variable names along with the data. We can see this by using R’s names() command:

Sometimes it’s useful to look at the structure of something using the str() command.

Code

str(data_random_discrete)

'data.frame':   8 obs. of  2 variables:
 $ category: chr  "ab" "xy" "mn" "qp" ...
 $ value   : num  4.8 5.5 3.5 4.6 6.5 6.6 2.6 3

Now that we have some data, let’s plot it.

Barplot

The following code makes a barplot. I’ll explain what’s going on after we look at the plot.

Code

data_random_discrete |>
  ggplot() +
  aes(x = category, y = value) +
  geom_col()

Figure 1: A bar plot of our random data.

This code sends the data frame (data_random_discrete) to the ggplot() function using R’s pipe (|>) command.

Many ways to say the same thing

Like most (all?) programming languages, or really all languages, there are many ways to say the same thing.

I like using the pipe |> command because it makes the code a little easier for people to read.

The ggplot() function needs to know what data to plot, so this code

data_random_discrete |>
  ggplot()

tells ggplot(), “Hey, ggplot, create a plot with data from data_random_discrete.” An equivalent way to say the same thing is to assign data_random_discrete to ggplot’s data parameter, like this:

ggplot(data = data_random_discrete)

In fact, R turns the code above that uses the pipe into this format.

Now, running ggplot() alone like this doesn’t actually create a plot we can see. We have to tell ggplot() what variables to plot and what kind of plot to create. In the graphics grammar ggplot() understands we add information to the plot using the plus + character.

We tell ggplot() which variables to use by using the aesthetics aes() command.

ggplot() +
aes(x = category, y = value)

Then, we add a graphic object or ‘geom’ to the plot.

aes(x = category, y = value) +
geom_col()

There are different types of geoms for different types of plots. You can think of each geom as a way to map data to graphic elements.

If the variable we want to plot contains the values we care about plotting, then we use geom_col. A column plot is just a type of barplot.

Here, we have a categorical variable imaginatively named category and a continuous variable named value.

See how the plot maps our category variable to the X axis and the value variable to the Y axis?

That’s the essence of making plots using ggplot(). We give ggplot() a data frame. Tell it which variables to assign to which aesthetics (properties of the plot). Add one or more geoms to make something we can see.

To break this down even further, the following code shows a way to make a plot step-by-step. This can be useful when you’re trying to figure out how you want to plot something.

We’ll explain the code first then run it.

1p <- ggplot(data = data_random_discrete)
2p

1: Make a ggplot with data_random_discrete as the data.frame; give it the name p.
2: Show p.

Code

p <- ggplot(data = data_random_discrete)
p

Figure 2: A ‘bare bones’ plot with data, but no aesthetics or graphics.

Notice that the plot is blank.

The blank plot isn’t very useful, so we have to add information to it about what variables we want to map to which aesthetic properties of the plot. The following code shows how this is done.

1p_aes <- p + aes(x = category, y = value)
2p_aes

1: “Add” aesthetics to p that assign the values in category to the x axis and values in value to the y axis. Give the combination the name p_aes.
2: Show p_aes.

The following results when we run the code:

Code

# Add aesthetics
p_aes <- p + aes(x = category, y = value)
p_aes

Figure 3: A plot with aesthetics mapped to the X and Y axes but no other graphic elements.

Now the plot has axes, but no data points. We need to add a geom. Geoms are mappings between data and visual graphic elements.

Let’s explain how to do that, then show the result.

1p_col <- p_aes + geom_col()
2p_col

1: Add to p_aes a geom called geom_col(), a column or bar plot. Assign the combination to a new variable called p_col.
2: Show p_col.

Code

p_col <- p_aes + geom_col()
p_col

Figure 4: A complete plot with data, aesthetics, and a geom

That looks good.

Let’s add some color to the bars. We do this by mapping the category variable to the fill property.

Code

p_colors <- p_col + aes(fill = category)
p_colors

Figure 5: Adding an aesthetic based on ‘category’ to fill the bars with color.

We can wrap all of this up in a nice little sequence of code as follows:

Code

data_random_discrete |>
  ggplot() +
  aes(x = category, y = value, fill = category) +
  geom_col()

Figure 6: The same figure generated from a short sequence of commands.

Note that the category variables on the horizontal axis print in alphabetical order. Sometimes that might be what we want. But, what if we want to follow best practice and sort the bars in numeric order?

There are several ways to to this (see https://r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html), but the way shown below uses the forcats package to ‘mutate’ or change the ‘category’ variable into one that is reordered by the ‘value’ variable. Note that the default is to sort in ascending order.

Code

library(forcats)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

data_random_discrete |>
  mutate(category = fct_reorder(category, value)) |>
  ggplot() +
  aes(x = category, y = value, fill = category) +
  geom_col()

Figure 7: The same plot but with the columns ordered by ‘value’.

To reverse the order from ascending to descending, the default, we use the desc() function to sort value from highest to lowest value, as in the following:

Code

data_random_discrete |>
  mutate(category = fct_reorder(category, desc(value))) |>
  ggplot() +
  aes(x = category, y = value, fill = category) +
  geom_col()

Figure 8: The same plot but with the columns ordered by ‘value’ in descending order.

Your turn

What would we need to do to make the category variable ordinal?
Why do we think that value is continuous?

Continuous

First, let’s generate some random data.

Code

# Set a seed for our 'random' number generator
set.seed(19680801)
n_values <- 100000

data_random_norm <- data.frame(val = rnorm(n = n_values, mean = 0, sd = 1))

Histogram

Code

hist_1 <- data_random_norm |>
  ggplot() +
  aes(x = val) +
  geom_histogram()

What the code is saying is this: Send data_random_norm to ggplot; make the plot and its various layers; then give the plot a name (hist_1) so we can use it later. Like now, for instance:

Code

hist_1

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

To replicate the side-by-side histograms in ?@fig-dual-histograms-matplotlib, we do the following:

Make two random data sets that differ slightly.

Code

x1 <- rnorm(n = n_values, mean = 0, sd = 1)
data_random_norm_2 <- data.frame(side = c(rep('l', n_values), rep('r', n_values)),
                            val = c(x1,
                                    0.4 * x1 + 5))

Then plot the data.

Code

hist_2 <- data_random_norm_2 |>
  ggplot() +
  aes(x = val) +
  geom_histogram() +
  facet_wrap(vars(side), ncol = 2)

hist_2

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Figure 10: Side-by-side histograms of two random (normal) sets of data.

Violin

Code

violin_2 <- data_random_norm_2 |>
  ggplot() +
  aes(x = side, y = val) +
  geom_violin()

violin_2

Figure 11: Side-by-side violin plots of two random (normal) sets of data.

Boxplot

Code

boxplot_2 <- data_random_norm_2 |>
  ggplot() +
  aes(x = side, y = val) +
  geom_boxplot()

boxplot_2

Figure 12: Side-by-side violin plots of two random (normal) sets of data.

Comparing distributions

Let’s see how these plots can help us see when the distributions differ by more than just magnitude or standard deviation.

Code

# Normal "bell"-shaped like before
x_norm <- rnorm(n = n_values, mean = 0, sd = 1)

# Uniform-shaped
x_unif <- runif(n = n_values, min = -2.75, max = 2.75)

data_random_2 <- data.frame(side = c(rep('norm', n_values), rep('unif', n_values)),
                            val = c(x_norm, x_unif))

Code

hist_2_diff <- data_random_2 |>
  ggplot() +
  aes(x = val) +
  geom_histogram() +
  facet_wrap(vars(side), ncol = 2)

hist_2_diff

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Figure 13: Side-by-side histograms of two random sets of data with **different** distributions.

Code

violin_2_diff <- data_random_2 |>
  ggplot() +
  aes(x = side, y = val) +
  geom_violin()

violin_2_diff

Figure 14: Side-by-side violin plots of two random sets of data with **different** distributions.

Code

boxplot_2_diff <- data_random_2 |>
  ggplot() +
  aes(x = side, y = val) +
  geom_boxplot()

boxplot_2_diff

Figure 15: Side-by-side box plots of two random sets of data with **different** distributions.

Plotting two variables

Warning

This page is under construction. Many components are missing.

Scatterplot

A scatterplot is a great way to compare two continuous variables.

Code

scatter_data <- tibble::tibble(norm = x_norm,
                               unif = x_unif)

scatter_data |>
  ggplot() +
  aes(x = x_norm, y = x_unif) +
  geom_point()

References

Wilkinson, L., Wills, D., Rope, D., Norton, A., & Dubbs, R. (2005). The grammar of graphics (statistics and computing) (2nd edition). Springer. Retrieved from https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448

Footnotes

A collection of special commands for doing specific things.↩︎

--- title: "Plotting with ggplot2" format: html --- ## About {-} This section describes some ways to generate plots using the R package [`ggplot2`](https://ggplot2.tidyverse.org). ## Why R {-} R is a programming language specialized for data analysis and visualization. ## Background {-} `ggplot2` is a package^[A collection of special commands for doing specific things.] for making 2D plots in R. It implements a special "language" for making plots based on the grammar of graphics suggested by @Wilkinson2005-do. That's where the 'gg' in the name comes from. In `ggplot`, we create a base plot, then add layers to it using a plus sign `+` operator. ## Set-up {-} To check whether `ggplot2` is already installed, run the following chunk: ```{r} # The require() function returns TRUE if ggplot2 is installed and FALSE if it is not. The exclamation point symbol ('!') turns FALSE into TRUE and TRUE into FALSE. So, `!require(ggplot2)` will be TRUE if require(ggplot2) is FALSE. When this occurs, the `install.packages(ggplot)` commands runs and installs ggplot2. if (!require(ggplot2)) { install.packages(ggplot2) } ``` ## Plotting one variable {-} ### Discrete/nominal {-} First, we make some data. We store two variables, `category` and `value` using the `data.frame()` command. A data frame is a convenient, spreadsheet-like structure for storing data. ```{r} data_random_discrete <- data.frame(category = c('ab', 'xy', 'mn', 'qp', 'ea', 'f2', 'gg', 'h*'), value = c(4.8, 5.5, 3.5, 4.6, 6.5, 6.6, 2.6, 3.0)) ``` Here we create a data frame and give it the name `data_random_discrete`. You can choose any name you want, subject to R's restrictions on naming things. I try to use descriptive names. The data frame name suggests that we have a random, continuous variable (`value`) and a discrete or categorical variable (`category`). The data frame command stores the variable names along with the data. We can see this by using R's `names()` command: Sometimes it's useful to look at the _structure_ of something using the `str()` command. ```{r} str(data_random_discrete) ``` Now that we have some data, let's plot it. #### Barplot {-} The following code makes a barplot. I'll explain what's going on after we look at the plot. ```{r} #| label: fig-barplot-ggplot2 #| fig-cap: "A bar plot of our random data." data_random_discrete |> ggplot() + aes(x = category, y = value) + geom_col() ``` This code sends the data frame (`data_random_discrete`) to the `ggplot()` function using R's pipe (`|>`) command. ::: {.callout-note} ## Many ways to say the same thing Like most (all?) programming languages, or really all languages, there are many ways to say the same thing. I like using the pipe `|>` command because it makes the code a little easier for *people* to read. ::: The `ggplot()` function needs to know what data to plot, so this code ```r data_random_discrete |> ggplot() ``` tells `ggplot()`, "Hey, ggplot, create a plot with data from `data_random_discrete`." An equivalent way to say the same thing is to assign `data_random_discrete` to ggplot's `data` parameter, like this: ```r ggplot(data = data_random_discrete) ``` In fact, R turns the code above that uses the pipe into this format. Now, running `ggplot()` alone like this doesn't actually create a plot we can see. We have to tell `ggplot()` what variables to plot and what kind of plot to create. In the graphics grammar `ggplot()` understands we add information to the plot using the plus `+` character. We tell `ggplot()` which variables to use by using the aesthetics `aes()` command. ```r ggplot() + aes(x = category, y = value) ``` Then, we add a graphic object or 'geom' to the plot. ``` aes(x = category, y = value) + geom_col() ``` There are different types of geoms for different types of plots. You can think of each geom as a way to map data to graphic elements. If the variable we want to plot contains the *values* we care about plotting, then we use `geom_col`. A column plot is just a type of barplot. Here, we have a *categorical* variable imaginatively named `category` and a *continuous* variable named `value`. See how the plot maps our category variable to the X axis and the value variable to the Y axis? That's the essence of making plots using `ggplot()`. We give `ggplot()` a data frame. Tell it which variables to assign to which aesthetics (properties of the plot). Add one or more geoms to make something we can see. To break this down even further, the following code shows a way to make a plot step-by-step. This can be useful when you're trying to figure out how you want to plot something. We'll explain the code first then run it. ```r p <- ggplot(data = data_random_discrete) # <1> p # <2> ``` 1. Make a ggplot with `data_random_discrete` as the data.frame; give it the name `p`. 2. Show `p`. ```{r} #| label: fig-blank-ggplot #| fig-cap: "A 'bare bones' plot with data, but no aesthetics or graphics." p <- ggplot(data = data_random_discrete) p ``` Notice that the plot is blank. The blank plot isn't very useful, so we have to add information to it about what variables we want to map to which *aesthetic* properties of the plot. The following code shows how this is done. ```r p_aes <- p + aes(x = category, y = value) # <1> p_aes # <2> ``` 1. "Add" aesthetics to `p` that assign the values in `category` to the x axis and values in `value` to the y axis. Give the combination the name `p_aes`. 2. Show `p_aes`. The following results when we run the code: ```{r} #| label: fig-ggplot-w-aes #| fig-cap: "A plot with aesthetics mapped to the X and Y axes but no other graphic elements." # Add aesthetics p_aes <- p + aes(x = category, y = value) p_aes ``` Now the plot has axes, but no data points. We need to add a *geom*. Geoms are mappings between data and visual graphic elements. Let's explain how to do that, then show the result. ```r p_col <- p_aes + geom_col() # <1> p_col # <2> ``` 1. Add to `p_aes` a geom called `geom_col()`, a column or bar plot. Assign the combination to a new variable called `p_col`. 2. Show `p_col`. ```{r} #| label: fig-ggplot-data-aes-geom #| fig-cap: "A complete plot with data, aesthetics, and a geom" p_col <- p_aes + geom_col() p_col ``` That looks good. Let's add some color to the bars. We do this by mapping the `category` variable to the `fill` property. ```{r} #| label: fig-ggplot-ggplot-data-aes-geom-fill #| fig-cap: "Adding an aesthetic based on 'category' to fill the bars with color." p_colors <- p_col + aes(fill = category) p_colors ``` We can wrap all of this up in a nice little sequence of code as follows: ```{r} #| label: fig-complete-ggplot-figure #| fig-cap: "The same figure generated from a short sequence of commands." data_random_discrete |> ggplot() + aes(x = category, y = value, fill = category) + geom_col() ``` Note that the category variables on the horizontal axis print in *alphabetical* order. Sometimes that might be what we want. But, what if we want to follow best practice and sort the bars in numeric order? There are several ways to to this (see <https://r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html>), but the way shown below uses the `forcats` package to 'mutate' or change the 'category' variable into one that is reordered by the 'value' variable. Note that the default is to sort in *ascending* order. ```{r} #| label: fig-ggplot-sort-ascending #| fig-cap: "The same plot but with the columns ordered by 'value'." #| library(forcats) library(dplyr) data_random_discrete |> mutate(category = fct_reorder(category, value)) |> ggplot() + aes(x = category, y = value, fill = category) + geom_col() ``` To reverse the order from ascending to descending, the default, we use the `desc()` function to sort value from highest to lowest value, as in the following: ```{r} #| label: fig-ggplot-sort-descending #| fig-cap: "The same plot but with the columns ordered by 'value' in descending order." #| data_random_discrete |> mutate(category = fct_reorder(category, desc(value))) |> ggplot() + aes(x = category, y = value, fill = category) + geom_col() ``` ::: {.callout-note} ## Your turn 1. What would we need to do to make the `category` variable *ordinal*? 2. Why do we think that `value` is *continuous*? ::: ### Continuous {-} First, let's generate some random data. ```{r} # Set a seed for our 'random' number generator set.seed(19680801) n_values <- 100000 data_random_norm <- data.frame(val = rnorm(n = n_values, mean = 0, sd = 1)) ``` #### Histogram {-} ```{r} hist_1 <- data_random_norm |> ggplot() + aes(x = val) + geom_histogram() ``` What the code is saying is this: Send `data_random_norm` to `ggplot`; make the plot and its various layers; then give the plot a name (`hist_1`) so we can use it later. Like now, for instance: ```{r} #| label: fig-histogram-ggplot2 #| fig-cap: "A simple histogram" hist_1 ``` To replicate the side-by-side histograms in @fig-dual-histograms-matplotlib, we do the following: Make two random data sets that differ slightly. ```{r} x1 <- rnorm(n = n_values, mean = 0, sd = 1) data_random_norm_2 <- data.frame(side = c(rep('l', n_values), rep('r', n_values)), val = c(x1, 0.4 * x1 + 5)) ``` Then plot the data. ```{r} #| label: fig-dual-histograms-ggplot2 #| fig-cap: "Side-by-side histograms of two random (normal) sets of data." hist_2 <- data_random_norm_2 |> ggplot() + aes(x = val) + geom_histogram() + facet_wrap(vars(side), ncol = 2) hist_2 ``` #### Violin {-} ```{r} #| label: fig-dual-violin-plots-ggplot2 #| fig-cap: "Side-by-side violin plots of two random (normal) sets of data." violin_2 <- data_random_norm_2 |> ggplot() + aes(x = side, y = val) + geom_violin() violin_2 ``` #### Boxplot {-} ```{r} #| label: fig-dual-boxplots-ggplot2 #| fig-cap: "Side-by-side violin plots of two random (normal) sets of data." boxplot_2 <- data_random_norm_2 |> ggplot() + aes(x = side, y = val) + geom_boxplot() boxplot_2 ``` ## Comparing distributions {-} Let's see how these plots can help us see when the distributions *differ* by more than just magnitude or standard deviation. ```{r} # Normal "bell"-shaped like before x_norm <- rnorm(n = n_values, mean = 0, sd = 1) # Uniform-shaped x_unif <- runif(n = n_values, min = -2.75, max = 2.75) data_random_2 <- data.frame(side = c(rep('norm', n_values), rep('unif', n_values)), val = c(x_norm, x_unif)) ``` ```{r} #| label: fig-dual-histograms-diff-ggplot2 #| fig-cap: "Side-by-side histograms of two random sets of data with **different** distributions." hist_2_diff <- data_random_2 |> ggplot() + aes(x = val) + geom_histogram() + facet_wrap(vars(side), ncol = 2) hist_2_diff ``` ```{r} #| label: fig-dual-violin-plots-diff-ggplot2 #| fig-cap: "Side-by-side violin plots of two random sets of data with **different** distributions." violin_2_diff <- data_random_2 |> ggplot() + aes(x = side, y = val) + geom_violin() violin_2_diff ``` ```{r} #| label: fig-dual-boxplots-diff-ggplot2 #| fig-cap: "Side-by-side box plots of two random sets of data with **different** distributions." boxplot_2_diff <- data_random_2 |> ggplot() + aes(x = side, y = val) + geom_boxplot() boxplot_2_diff ``` ## Plotting two variables {-} ::: {.callout-warning} This page is under construction. Many components are missing. ::: ### Scatterplot {-} A scatterplot is a great way to compare two continuous variables. ```{r} scatter_data <- tibble::tibble(norm = x_norm, unif = x_unif) scatter_data |> ggplot() + aes(x = x_norm, y = x_unif) + geom_point() ```