Published

January 18, 2025

Modified

January 9, 2025

Making data

About

This tutorial shows how we can construct data of different types.

Code

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Nominal data

Let’s focus on nominal or categorical data, specifically the favorite colors of some imaginary set of people.

Code

colors <- c("red", "orange", "yellow", "green", "blue", "violet", "white", "black", "gray", "brown")

There are n=10 colors in this set.

Your turn

Why are these data nominal or nominally scaled? Or rather, why aren’t they ordinal, interval, or ratio?

Generating

For demonstration purposes, we want to take some number of random samples of these colors. Let’s pick n=200 and sample with replacement (replace=TRUE in the code below), meaning that we could have any number of colors in our sample of 200 imaginary people.

Code

our_color_sample <- sample(colors, size=200, replace=TRUE)
our_color_sample

  [1] "yellow" "gray"   "red"    "orange" "brown"  "green"  "brown"  "black" 
  [9] "orange" "black"  "orange" "white"  "white"  "blue"   "violet" "yellow"
 [17] "white"  "orange" "brown"  "violet" "orange" "brown"  "yellow" "gray"  
 [25] "white"  "black"  "gray"   "orange" "blue"   "gray"   "gray"   "gray"  
 [33] "red"    "gray"   "gray"   "yellow" "black"  "blue"   "green"  "black" 
 [41] "orange" "black"  "yellow" "green"  "black"  "gray"   "brown"  "green" 
 [49] "violet" "black"  "orange" "red"    "red"    "violet" "green"  "orange"
 [57] "orange" "blue"   "black"  "violet" "blue"   "gray"   "brown"  "white" 
 [65] "white"  "green"  "red"    "gray"   "brown"  "brown"  "black"  "orange"
 [73] "violet" "orange" "blue"   "brown"  "brown"  "gray"   "yellow" "blue"  
 [81] "white"  "gray"   "white"  "orange" "yellow" "yellow" "green"  "violet"
 [89] "blue"   "violet" "violet" "gray"   "red"    "green"  "green"  "brown" 
 [97] "yellow" "green"  "orange" "brown"  "green"  "red"    "violet" "green" 
[105] "white"  "violet" "black"  "brown"  "blue"   "orange" "orange" "yellow"
[113] "green"  "black"  "red"    "violet" "white"  "gray"   "orange" "blue"  
[121] "yellow" "white"  "yellow" "gray"   "gray"   "gray"   "white"  "green" 
[129] "brown"  "white"  "brown"  "green"  "black"  "green"  "blue"   "white" 
[137] "white"  "orange" "green"  "white"  "white"  "orange" "brown"  "gray"  
[145] "gray"   "white"  "orange" "brown"  "orange" "green"  "green"  "green" 
[153] "violet" "blue"   "black"  "green"  "white"  "gray"   "blue"   "yellow"
[161] "white"  "blue"   "violet" "red"    "blue"   "black"  "white"  "white" 
[169] "white"  "orange" "green"  "blue"   "black"  "blue"   "brown"  "blue"  
[177] "green"  "red"    "black"  "yellow" "gray"   "gray"   "gray"   "white" 
[185] "red"    "green"  "gray"   "blue"   "yellow" "blue"   "orange" "blue"  
[193] "brown"  "brown"  "brown"  "black"  "black"  "gray"   "red"    "gray"

On its own, that’s not especially easy to visualize. What if we sorted it?

Code

sort(our_color_sample)

  [1] "black"  "black"  "black"  "black"  "black"  "black"  "black"  "black" 
  [9] "black"  "black"  "black"  "black"  "black"  "black"  "black"  "black" 
 [17] "black"  "black"  "black"  "blue"   "blue"   "blue"   "blue"   "blue"  
 [25] "blue"   "blue"   "blue"   "blue"   "blue"   "blue"   "blue"   "blue"  
 [33] "blue"   "blue"   "blue"   "blue"   "blue"   "blue"   "blue"   "blue"  
 [41] "brown"  "brown"  "brown"  "brown"  "brown"  "brown"  "brown"  "brown" 
 [49] "brown"  "brown"  "brown"  "brown"  "brown"  "brown"  "brown"  "brown" 
 [57] "brown"  "brown"  "brown"  "brown"  "brown"  "gray"   "gray"   "gray"  
 [65] "gray"   "gray"   "gray"   "gray"   "gray"   "gray"   "gray"   "gray"  
 [73] "gray"   "gray"   "gray"   "gray"   "gray"   "gray"   "gray"   "gray"  
 [81] "gray"   "gray"   "gray"   "gray"   "gray"   "gray"   "gray"   "gray"  
 [89] "green"  "green"  "green"  "green"  "green"  "green"  "green"  "green" 
 [97] "green"  "green"  "green"  "green"  "green"  "green"  "green"  "green" 
[105] "green"  "green"  "green"  "green"  "green"  "green"  "green"  "green" 
[113] "orange" "orange" "orange" "orange" "orange" "orange" "orange" "orange"
[121] "orange" "orange" "orange" "orange" "orange" "orange" "orange" "orange"
[129] "orange" "orange" "orange" "orange" "orange" "orange" "orange" "red"   
[137] "red"    "red"    "red"    "red"    "red"    "red"    "red"    "red"   
[145] "red"    "red"    "red"    "violet" "violet" "violet" "violet" "violet"
[153] "violet" "violet" "violet" "violet" "violet" "violet" "violet" "violet"
[161] "violet" "white"  "white"  "white"  "white"  "white"  "white"  "white" 
[169] "white"  "white"  "white"  "white"  "white"  "white"  "white"  "white" 
[177] "white"  "white"  "white"  "white"  "white"  "white"  "white"  "white" 
[185] "white"  "yellow" "yellow" "yellow" "yellow" "yellow" "yellow" "yellow"
[193] "yellow" "yellow" "yellow" "yellow" "yellow" "yellow" "yellow" "yellow"

Still not all that helpful.

Summarizing

What can we say to summarize categorical data?

We can report the number of total responses.

Code

length(our_color_sample)

[1] 200

We can report the number of unique categories.

Code

length(unique(our_color_sample))

[1] 10

We can report the number of responses per category.

Code

colors_df <- data.frame(favorite_color = our_color_sample)

xtabs(formula = ~favorite_color, data = colors_df)

favorite_color
 black   blue  brown   gray  green orange    red violet  white yellow 
    19     21     21     27     24     23     12     14     24     15

Visualizing

Single nominal variables don’t offer us many options for visualization. A bar plot showing the number of observations in each category seems to be it.

Code

colors_df |>
  ggplot() +
  aes(x = favorite_color) +
  geom_bar()

Figure 1: An (ugly) barplot of the random favorite color data

Is one of these colors a particular favorite? Maybe. The observed data show some differences. But how can we know they aren’t just due to random chance.

Stop the presses!

These particular data are entirely random. We made them up!

Keep that in mind while we work through what we would do to answer the question if these were not fake, random data.

To answer that question, we’d have to compare our results with some other pattern, a pattern where there is no favorite. In other words, we want a pattern where the counts are all equal. So, we’re back to summarizing.

Actually, data analysis in the real world usually involves a back-and-forth alternation between summarizing/analyzing and visualizing.

flowchart LR
  A[Summarize] --> B[Visualize]
  B ---> A

Figure 2: Illustration of the back and forth between summarizing and visualizing that characterizes real-life data analysis.

Summarizing II

Let’s make data that have an identical pattern across all colors.

There are several ways to do this, but here is one that works.

Code

no_favorite <- rep(colors, length(our_color_sample)/length(colors))

compare_df <- data.frame(favorite_color = no_favorite)

xtabs(formula = ~favorite_color, data = compare_df)

favorite_color
 black   blue  brown   gray  green orange    red violet  white yellow 
    20     20     20     20     20     20     20     20     20     20

How can we compare the data where there are no differences by category to the data where there are differences?