This markdown is designed to provide an introduction to data visualization in R. Primarily it will cover ggplot2; although a few advanced options are covered here or available in supplementary materials. Questions about code can be directed to Alicia Vallorani (auv27 at psu.edu).
The ggplot() function sets the foundation for building any kind of plot. We need to pass ggplot() two main pieces of information: the name of the dataset and the names of the columns in the dataset that we want to plot.
Here, we can see that there’s one column per variable and one row per subject. This is how we want the data to be set up for plotting in ggplot(). This way, we can easily specify the variables we are interested in, and we know that we have a unique value for each subject. Here’s an example with the age and ACT variables.
# Load data and convert sex and education to factors
source(paste0(params$path_2_scripts, "load_sat_act.R"))
# Print the first five rows in the dataset
head(sat.act, n=5)
## sex education age ACT SATV SATQ
## 29442 female 3 19 24 500 500
## 29457 female 3 23 35 600 500
## 29498 female 3 20 21 480 470
## 29503 male 4 27 26 550 520
## 29504 male 2 33 31 600 550
# Examine the structure of the dataset
str(sat.act)
## 'data.frame': 700 obs. of 6 variables:
## $ sex : Factor w/ 2 levels "male","female": 2 2 2 1 1 1 2 1 2 2 ...
## $ education: Factor w/ 6 levels "0","1","2","3",..: 4 4 4 5 3 6 6 4 5 6 ...
## $ age : int 19 23 20 27 33 26 30 19 23 40 ...
## $ ACT : int 24 35 21 26 31 28 36 22 22 35 ...
## $ SATV : int 500 600 480 550 600 640 610 520 400 730 ...
## $ SATQ : int 500 500 470 520 550 640 500 560 600 800 ...
First, let’s examine the distributions of the variables included in the dataset. We can build histograms that look at a single variable or multiple variables simultaneously.
# Looking at a histogram for a single variable
ggplot(sat.act, aes(ACT)) +
geom_histogram(bins = 20) # you can change the bin value to best fit your data
# Looking at histograms for all variables ggplot option
ggplot(sat.act %>% dplyr::select(-sex, -education) %>% # removing dichotomous variables
gather(), aes(value)) + # grouping for visualization
geom_histogram(bins = 20) +
facet_wrap(~key, scales = "free_x") # free_x allows for differing x-axes
## Warning: Removed 13 rows containing non-finite values (stat_bin).
# 13 rows containing non-finite values = NA values in SATQ
# warnings =/= errors; your code will still run if you get a warning, it just
# lets you know that there may be an issue that you want to consider
This section walks through how to make a simple scatterplot between two variables. Additionally, you can add a fit line and look at how scatterplots may vary across groups.
# Simple descriptive scatter plot
ggplot(sat.act, aes(age, ACT)) +
# geom_ allows you to select the type of object you would like to comprise the graph
geom_point() +
# You can add axis labels and titles
ylab("Score") +
xlab("Age") +
ggtitle("ACT scores by age") +
# You can set different themes to alter the general appearance of your graphs (more description in the aesthetics section)
theme_classic()
We can also add more information, like how education factors into the distribution of scores.
# Descriptive scatterplot with additional element
scatter <- ggplot(sat.act, aes(age, ACT, color=education)) +
geom_point() +
ylab("Score") +
xlab("Age") +
labs(color="Education level") +
ggtitle("ACT scores by age and education level") +
theme_classic()
# If you have saved your graph into an object, as above, you can call the object to view
scatter
Saving your graph to an object allows you to easily add elements. For example, we can add a regression line.
scatter <- scatter + geom_smooth(method="lm", se=FALSE, color="gray50")
scatter
In addition to considering bivariate relations two variables at a time (like above), the package GGally contains ggpairs() which allows us to visualize multiple relations simultaneously.
ggpairs(sat.act %>% na.omit(), progress=FALSE,
lower = list(combo = wrap("facethist", bins=6)))
Additional advanced suggestions for scatterplots can be found in the supplemental materials.
This section walks through a basic bar graph. If you don’t specify a value for the y axis, it’ll create counts for you.
# Bar graph with counts
ggplot(sat.act, aes(education)) +
geom_bar() +
ylab("Number of subjects") +
xlab("Education level") +
ggtitle("Count of subjects at each education level") +
theme_classic()
If you want to perform a different summary calculation than a count, you can include the column of data you want and the kind of calculation.
# Bar graph with means
ggplot(sat.act, aes(education, ACT)) +
geom_bar(stat="summary", fun.y="mean") +
ylab("Average score") +
xlab("Education level") +
ggtitle("Average ACT scores at each education level") +
theme_classic()
We can also include error bars using the summarySE() function from the Rmisc package.
# Summary data for error bars
sat.act.sum <- summarySE(sat.act, measurevar="ACT", groupvars=c("education"))
sat.act.sum
## education N ACT sd se ci
## 1 0 57 27.47368 5.206813 0.6896592 1.3815535
## 2 1 45 27.48889 6.055134 0.9026461 1.8191636
## 3 2 44 26.97727 5.808929 0.8757290 1.7660759
## 4 3 275 28.29455 4.846227 0.2922385 0.5753181
## 5 4 138 29.26087 4.345153 0.3698840 0.7314202
## 6 5 141 29.60284 3.954887 0.3330616 0.6584807
# we use the summary we created to plot
ggplot(sat.act.sum, aes(education, ACT)) +
geom_bar(stat="summary", fun.y="mean") +
ylab("Average score") +
xlab("Education level") +
ggtitle("Average ACT scores at each education level") +
geom_errorbar(aes(ymin=ACT-se, ymax=ACT+se),
width=.2, position=position_dodge(.9)) +
theme_classic()
Additional advanced suggestions for bar graphs can be found in the supplemental materials.
You can set a theme at any point and from that point on all graphs will use the theme supplied.
theme_set(theme_minimal()) # sets the theme for all graphs
Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of visualizations.
What goes inside aes()
and what goes outside?
Data goes outside aes.
Mapping a variable in your data to an aesthetic goes inside aes()
(e.g. having the points’ color vary based on a variable in the data).
Setting an aesthetic to a single value can go outside aes()
(e.g. making all the points red).
When do aesthetic mappings go inside ggplot()
vs inside geom_*()
?
If you want the aesthetic mapping to apply to all the geoms, put it inside ggplot(). If you want it to apply only to a single geom, put it inside geom_*()
.
What aesthetics are there?
Both change the color, but different geoms use one or the other or both.
This can be categorical:
ggplot(sat.act, aes(x = SATV, y = SATQ, color = sex)) +
geom_point()
## Warning: Removed 13 rows containing missing values (geom_point).
ggplot(sat.act, aes(x = SATV, fill = sex)) +
geom_density(alpha = .5)
or continuous:
ggplot(sat.act, aes(x = SATV, y = SATQ, color = age)) +
geom_point() +
scale_color_continuous(low = "lightblue", high = "darkblue")
## Warning: Removed 13 rows containing missing values (geom_point).
ggplot(sat.act, aes(x = SATV, y = SATQ, size = age)) +
geom_point() +
scale_size_continuous(range = c(.5,3))
## Warning: Removed 13 rows containing missing values (geom_point).
ggplot(sat.act, aes(x = SATV, y = SATQ, shape = sex)) +
geom_point() +
scale_shape_manual(values = c(17, 19))
## Warning: Removed 13 rows containing missing values (geom_point).
This is often used to prevent overplotting
ggplot(sat.act, aes(x = SATV, y = SATQ)) +
geom_point(alpha = .6)
## Warning: Removed 13 rows containing missing values (geom_point).
There are other aesthetic mappings for specific use cases, but these are the most common.
Bar graph
ggplot(sat.act, aes(x=education, y=SATV, fill = sex)) +
geom_bar(stat = "summary", fun.y = "mean", position = "dodge") +
labs(fill = "Sex",
x = "Education",
y = "Mean SATV")
Does the effect of sex on SATV scores differ by education level? Maybe for people without a high school education (0). Additional advanced suggestions for bar graphs can be found in the supplemental materials.
Scatter plot with best fit lines
ggplot(sat.act, aes(x = SATV, y = SATQ, color = sex)) +
geom_point() +
geom_smooth(method = "lm", se = F)
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
Does the effect of SATV on SATQ differ by sex? Not a large visual difference between the best fit lines for each sex.
Scatter plot with lines for specified values of the continuous moderator (+/- 1 SD by default). Uses the interactions
package.
# First run your model
lm1 <- lm(ACT~SATQ*SATV, sat.act)
# Provide that model to the interact_plot function
interactions::interact_plot(lm1, pred = SATQ, modx = SATV,
plot.points = TRUE,
x.label = "SATQ",
y.label = "ACT",
legend.main = "SATV")
Is there an interaction between SATQ and SATV in predicting ACT? Looks like it! As SATV scores increase, the relationship between SATQ and ACT becomes stronger.
Use faceting
plot <- ggplot(sat.act, aes(x=SATQ, y=SATV, color = sex)) +
geom_point(alpha = .6) +
stat_smooth(method = lm, se = FALSE, size = 1.2) +
facet_wrap(vars(education))
plot
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
Axis titles, labels and breaks, plot title, plot caption, font sizes, theme, gridlines, axis lines, facet spacing, legend positioning, etc… All the fiddly details.
plot +
labs(title = "Relationship between SATQ and SATV by sex and education",
subtitle = "Data from SAPA project",
caption = "N = 687") +
scale_y_continuous(breaks = seq(200, 800, by = 100)) +
theme_bw() +
theme(panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank()) +
theme(axis.text = element_text(size = 10),
axis.title = element_text(size = 13),
legend.text = element_text(size = 11),
legend.title = element_text(size = 13)) +
theme(panel.spacing.x = unit(1, "lines")) +
theme(legend.title = element_blank())
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
ggsave
easily exports plots in a variety of file formats, to your specified dimensions and resolution.
You can export a saved plot object, or if no plot is specified, it will export the last plot you produced.
ggsave("plot.png",
width = 10, height = 6, dpi = 300) # make a 10 x 6 inch PNG file with 300 DPI