Goals of this markdown

This markdown is designed to provide an introduction to data visualization in R. Primarily it will cover ggplot2; although a few advanced options are covered here or available in supplementary materials. Questions about code can be directed to Alicia Vallorani (auv27 at psu.edu).

What is ggplot()?

The ggplot() function sets the foundation for building any kind of plot. We need to pass ggplot() two main pieces of information: the name of the dataset and the names of the columns in the dataset that we want to plot.

psych::sat.act dataset

Here, we can see that there’s one column per variable and one row per subject. This is how we want the data to be set up for plotting in ggplot(). This way, we can easily specify the variables we are interested in, and we know that we have a unique value for each subject. Here’s an example with the age and ACT variables.

# Load data and convert sex and education to factors
source(paste0(params$path_2_scripts, "load_sat_act.R"))

# Print the first five rows in the dataset
head(sat.act, n=5)
##          sex education age ACT SATV SATQ
## 29442 female         3  19  24  500  500
## 29457 female         3  23  35  600  500
## 29498 female         3  20  21  480  470
## 29503   male         4  27  26  550  520
## 29504   male         2  33  31  600  550
# Examine the structure of the dataset
str(sat.act)
## 'data.frame':    700 obs. of  6 variables:
##  $ sex      : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
##  $ education: Factor w/ 6 levels "0","1","2","3",..: 4 4 4 5 3 6 6 4 5 6 ...
##  $ age      : int  19 23 20 27 33 26 30 19 23 40 ...
##  $ ACT      : int  24 35 21 26 31 28 36 22 22 35 ...
##  $ SATV     : int  500 600 480 550 600 640 610 520 400 730 ...
##  $ SATQ     : int  500 500 470 520 550 640 500 560 600 800 ...

Examine univariate distributions - histograms

First, let’s examine the distributions of the variables included in the dataset. We can build histograms that look at a single variable or multiple variables simultaneously.

# Looking at a histogram for a single variable
ggplot(sat.act, aes(ACT)) +
  geom_histogram(bins = 20) # you can change the bin value to best fit your data

# Looking at histograms for all variables ggplot option
ggplot(sat.act %>% dplyr::select(-sex, -education) %>% # removing dichotomous variables
         gather(), aes(value)) + # grouping for visualization
    geom_histogram(bins = 20) + 
    facet_wrap(~key, scales = "free_x") # free_x allows for differing x-axes 
## Warning: Removed 13 rows containing non-finite values (stat_bin).

# 13 rows containing non-finite values = NA values in SATQ
# warnings =/= errors; your code will still run if you get a warning, it just
# lets you know that there may be an issue that you want to consider

Examining zero-order relations - scatterplots

This section walks through how to make a simple scatterplot between two variables. Additionally, you can add a fit line and look at how scatterplots may vary across groups.

# Simple descriptive scatter plot
ggplot(sat.act, aes(age, ACT)) +
  # geom_ allows you to select the type of object you would like to comprise the graph
  geom_point() + 
  # You can add axis labels and titles
  ylab("Score") + 
  xlab("Age") +
  ggtitle("ACT scores by age") +
  # You can set different themes to alter the general appearance of your graphs (more description in the aesthetics section)
  theme_classic()

We can also add more information, like how education factors into the distribution of scores.

# Descriptive scatterplot with additional element
scatter <- ggplot(sat.act, aes(age, ACT, color=education)) +
  geom_point() +
  ylab("Score") +
  xlab("Age") +
  labs(color="Education level") +
  ggtitle("ACT scores by age and education level") +
  theme_classic()

# If you have saved your graph into an object, as above, you can call the object to view
scatter

Saving your graph to an object allows you to easily add elements. For example, we can add a regression line.

scatter <- scatter + geom_smooth(method="lm", se=FALSE, color="gray50")
scatter

In addition to considering bivariate relations two variables at a time (like above), the package GGally contains ggpairs() which allows us to visualize multiple relations simultaneously.

ggpairs(sat.act %>% na.omit(), progress=FALSE, 
        lower = list(combo = wrap("facethist", bins=6)))

Additional advanced suggestions for scatterplots can be found in the supplemental materials.

Examining group differences - bar graphs

This section walks through a basic bar graph. If you don’t specify a value for the y axis, it’ll create counts for you.

# Bar graph with counts
ggplot(sat.act, aes(education)) +
  geom_bar() +
  ylab("Number of subjects") +
  xlab("Education level") +
  ggtitle("Count of subjects at each education level") +
  theme_classic()

If you want to perform a different summary calculation than a count, you can include the column of data you want and the kind of calculation.

# Bar graph with means
ggplot(sat.act, aes(education, ACT)) +
  geom_bar(stat="summary", fun.y="mean") +
  ylab("Average score") +
  xlab("Education level") +
  ggtitle("Average ACT scores at each education level") +
  theme_classic()

We can also include error bars using the summarySE() function from the Rmisc package.

# Summary data for error bars
sat.act.sum <- summarySE(sat.act, measurevar="ACT", groupvars=c("education")) 
sat.act.sum
##   education   N      ACT       sd        se        ci
## 1         0  57 27.47368 5.206813 0.6896592 1.3815535
## 2         1  45 27.48889 6.055134 0.9026461 1.8191636
## 3         2  44 26.97727 5.808929 0.8757290 1.7660759
## 4         3 275 28.29455 4.846227 0.2922385 0.5753181
## 5         4 138 29.26087 4.345153 0.3698840 0.7314202
## 6         5 141 29.60284 3.954887 0.3330616 0.6584807
# we use the summary we created to plot
ggplot(sat.act.sum, aes(education, ACT)) + 
  geom_bar(stat="summary", fun.y="mean") +
  ylab("Average score") +
  xlab("Education level") +
  ggtitle("Average ACT scores at each education level") +
  geom_errorbar(aes(ymin=ACT-se, ymax=ACT+se), 
                width=.2, position=position_dodge(.9)) +
  theme_classic()

Additional advanced suggestions for bar graphs can be found in the supplemental materials.

Setting a theme

You can set a theme at any point and from that point on all graphs will use the theme supplied.

theme_set(theme_minimal()) # sets the theme for all graphs

Aesthetics (aes)

Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of visualizations.

What goes inside aes() and what goes outside?
Data goes outside aes.
Mapping a variable in your data to an aesthetic goes inside aes() (e.g. having the points’ color vary based on a variable in the data).
Setting an aesthetic to a single value can go outside aes() (e.g. making all the points red).

When do aesthetic mappings go inside ggplot() vs inside geom_*()?
If you want the aesthetic mapping to apply to all the geoms, put it inside ggplot(). If you want it to apply only to a single geom, put it inside geom_*().

What aesthetics are there?

Color/fill

Both change the color, but different geoms use one or the other or both.

This can be categorical:

ggplot(sat.act, aes(x = SATV, y = SATQ, color = sex)) + 
  geom_point()
## Warning: Removed 13 rows containing missing values (geom_point).

ggplot(sat.act, aes(x = SATV, fill = sex)) + 
  geom_density(alpha = .5)

or continuous:

ggplot(sat.act, aes(x = SATV, y = SATQ, color = age)) + 
  geom_point() +
  scale_color_continuous(low = "lightblue", high = "darkblue")
## Warning: Removed 13 rows containing missing values (geom_point).

Size

ggplot(sat.act, aes(x = SATV, y = SATQ, size = age)) + 
  geom_point() +
  scale_size_continuous(range = c(.5,3))
## Warning: Removed 13 rows containing missing values (geom_point).

Shape

ggplot(sat.act, aes(x = SATV, y = SATQ, shape = sex)) + 
  geom_point() +
  scale_shape_manual(values = c(17, 19))
## Warning: Removed 13 rows containing missing values (geom_point).

Alpha (transparency)

This is often used to prevent overplotting

ggplot(sat.act, aes(x = SATV, y = SATQ)) + 
  geom_point(alpha = .6)
## Warning: Removed 13 rows containing missing values (geom_point).

There are other aesthetic mappings for specific use cases, but these are the most common.

Interactions

Dichotomous by dichotomous

Bar graph

ggplot(sat.act, aes(x=education, y=SATV, fill = sex)) +
  geom_bar(stat = "summary", fun.y = "mean", position = "dodge") +
  labs(fill = "Sex",
       x = "Education",
       y = "Mean SATV")

Does the effect of sex on SATV scores differ by education level? Maybe for people without a high school education (0). Additional advanced suggestions for bar graphs can be found in the supplemental materials.

Dichotomous by continuous

Scatter plot with best fit lines

ggplot(sat.act, aes(x = SATV, y = SATQ, color = sex)) + 
  geom_point() +
  geom_smooth(method = "lm", se = F)
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

Does the effect of SATV on SATQ differ by sex? Not a large visual difference between the best fit lines for each sex.

Continuous by continuous

Scatter plot with lines for specified values of the continuous moderator (+/- 1 SD by default). Uses the interactions package.

# First run your model
lm1 <- lm(ACT~SATQ*SATV, sat.act) 

# Provide that model to the interact_plot function
interactions::interact_plot(lm1, pred = SATQ, modx = SATV,
              plot.points = TRUE,
              x.label = "SATQ",
              y.label = "ACT",
              legend.main = "SATV")

Is there an interaction between SATQ and SATV in predicting ACT? Looks like it! As SATV scores increase, the relationship between SATQ and ACT becomes stronger.

Three-way interaction

Use faceting

plot <- ggplot(sat.act, aes(x=SATQ, y=SATV, color = sex)) +
  geom_point(alpha = .6) +
  stat_smooth(method = lm, se = FALSE, size = 1.2) +
  facet_wrap(vars(education))
plot
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

Finishing touches

Axis titles, labels and breaks, plot title, plot caption, font sizes, theme, gridlines, axis lines, facet spacing, legend positioning, etc… All the fiddly details.

plot +
  labs(title = "Relationship between SATQ and SATV by sex and education",
       subtitle = "Data from SAPA project",
       caption = "N = 687") + 
  scale_y_continuous(breaks = seq(200, 800, by = 100)) +
  theme_bw() +
  theme(panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank()) +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 13),
        legend.text = element_text(size = 11),
        legend.title = element_text(size = 13)) +
  theme(panel.spacing.x = unit(1, "lines")) + 
  theme(legend.title = element_blank())
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

Exporting plots

ggsave easily exports plots in a variety of file formats, to your specified dimensions and resolution.
You can export a saved plot object, or if no plot is specified, it will export the last plot you produced.

ggsave("plot.png",
       width = 10, height = 6, dpi = 300) # make a 10 x 6 inch PNG file with 300 DPI