Published

January 18, 2025

Modified

January 13, 2025

Plotting with ggplot2

About

This section describes some ways to generate plots using the R package ggplot2.

Why R

R is a programming language specialized for data analysis and visualization.

Background

‘ggplot2’ is a package1 for making 2D plots in R. It implements a special “language” for making plots based on the grammar of graphics suggested by Cleveland CITE. That’s where the ‘gg’ in the name comes from.

In ggplot, we create a base plot, then add layers to it using a plus sign + operator.

Set-up

To check whether {ggplot2} is already installed, run the following chunk:

Code
# The require() function returns TRUE if ggplot2 is installed and FALSE if it is not. The exclamation point symbol ('!') turns FALSE into TRUE and TRUE into FALSE. So, `!require(ggplot2)` will be TRUE if require(ggplot2) is FALSE. When this occurs, the `install.packages(ggplot)` commands runs and installs ggplot2.
if (!require(ggplot2)) {
  install.packages(ggplot2)
}
Loading required package: ggplot2

Plotting one variable

Discrete/nominal

First, we make some data.

Code
data_random_discrete <- data.frame(category = c('ab', 'xy', 'mn', 
                                                'qp', 'ea', 'f2',
                                                'gg', 'h*'),
                                   value = c(4.8, 5.5, 3.5, 
                                           4.6, 6.5, 6.6, 
                                           2.6, 3.0))

Barplot

If the variable we want to plot contains the values we care about plotting, then we use geom_col. A column plot is just a type of barplot.

Code
data_random_discrete |>
  ggplot() +
  aes(x = category, y = value) +
  geom_col()
Figure 1

Here, we have a categorical variable imaginatively named category and a continuous variable named value.

Your turn
  1. What would we need to do to make the category variable ordinal?
  2. Why do we think that value is continuous?

Continuous

First, let’s generate some random data.

Code
# Set a seed for our 'random' number generator
set.seed(19680801)
n_values <- 100000

data_random_norm <- data.frame(val = rnorm(n = n_values, mean = 0, sd = 1))

Histogram

Code
hist_1 <- data_random_norm |>
  ggplot() +
  aes(x = val) +
  geom_histogram()

What the code is saying is this: Send data_random_norm to ggplot; make the plot and its various layers; then give the plot a name (hist_1) so we can use it later. Like now, for instance:

Code
hist_1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 2: A simple histogram

To replicate the side-by-side histograms in ?@fig-dual-histograms-matplotlib, we do the following:

Make two random data sets that differ slightly.

Code
x1 <- rnorm(n = n_values, mean = 0, sd = 1)
data_random_norm_2 <- data.frame(side = c(rep('l', n_values), rep('r', n_values)),
                            val = c(x1,
                                    0.4 * x1 + 5))

Then plot the data.

Code
hist_2 <- data_random_norm_2 |>
  ggplot() +
  aes(x = val) +
  geom_histogram() +
  facet_wrap(vars(side), ncol = 2)

hist_2
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 3: Side-by-side histograms of two random (normal) sets of data.

Violin

Code
violin_2 <- data_random_norm_2 |>
  ggplot() +
  aes(x = side, y = val) +
  geom_violin()

violin_2
Figure 4: Side-by-side violin plots of two random (normal) sets of data.

Boxplot

Code
boxplot_2 <- data_random_norm_2 |>
  ggplot() +
  aes(x = side, y = val) +
  geom_boxplot()

boxplot_2
Figure 5: Side-by-side violin plots of two random (normal) sets of data.

Comparing distributions

Let’s see how these plots can help us see when the distributions differ by more than just magnitude or standard deviation.

Code
# Normal "bell"-shaped like before
x_norm <- rnorm(n = n_values, mean = 0, sd = 1)

# Uniform-shaped
x_unif <- runif(n = n_values, min = -2.75, max = 2.75)

data_random_2 <- data.frame(side = c(rep('norm', n_values), rep('unif', n_values)),
                            val = c(x_norm, x_unif))
Code
hist_2_diff <- data_random_2 |>
  ggplot() +
  aes(x = val) +
  geom_histogram() +
  facet_wrap(vars(side), ncol = 2)

hist_2_diff
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 6: Side-by-side histograms of two random sets of data with different distributions.
Code
violin_2_diff <- data_random_2 |>
  ggplot() +
  aes(x = side, y = val) +
  geom_violin()

violin_2_diff
Figure 7: Side-by-side violin plots of two random sets of data with different distributions.
Code
boxplot_2_diff <- data_random_2 |>
  ggplot() +
  aes(x = side, y = val) +
  geom_boxplot()

boxplot_2_diff
Figure 8: Side-by-side box plots of two random sets of data with different distributions.

Plotting two variables

Warning

This page is under construction. Many components are missing.

Scatterplot

A scatterplot is a great way to compare two continuous variables.

Code
scatter_data <- tibble::tibble(norm = x_norm,
                               unif = x_unif)

scatter_data |>
  ggplot() +
  aes(x = x_norm, y = x_unif) +
  geom_point() 

Footnotes

  1. A collection of special commands for doing specific things.↩︎