Code
library(ggplot2)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
We’re moving to 009 Sparks!
January 18, 2025
January 9, 2025
This tutorial shows how we can construct data of different types.
Let’s focus on nominal or categorical data, specifically the favorite colors of some imaginary set of people.
There are n=10 colors in this set.
Why are these data nominal or nominally scaled? Or rather, why aren’t they ordinal, interval, or ratio?
For demonstration purposes, we want to take some number of random samples of these colors. Let’s pick n=200 and sample with replacement (replace=TRUE
in the code below), meaning that we could have any number of colors in our sample of 200 imaginary people.
[1] "yellow" "gray" "red" "orange" "brown" "green" "brown" "black"
[9] "orange" "black" "orange" "white" "white" "blue" "violet" "yellow"
[17] "white" "orange" "brown" "violet" "orange" "brown" "yellow" "gray"
[25] "white" "black" "gray" "orange" "blue" "gray" "gray" "gray"
[33] "red" "gray" "gray" "yellow" "black" "blue" "green" "black"
[41] "orange" "black" "yellow" "green" "black" "gray" "brown" "green"
[49] "violet" "black" "orange" "red" "red" "violet" "green" "orange"
[57] "orange" "blue" "black" "violet" "blue" "gray" "brown" "white"
[65] "white" "green" "red" "gray" "brown" "brown" "black" "orange"
[73] "violet" "orange" "blue" "brown" "brown" "gray" "yellow" "blue"
[81] "white" "gray" "white" "orange" "yellow" "yellow" "green" "violet"
[89] "blue" "violet" "violet" "gray" "red" "green" "green" "brown"
[97] "yellow" "green" "orange" "brown" "green" "red" "violet" "green"
[105] "white" "violet" "black" "brown" "blue" "orange" "orange" "yellow"
[113] "green" "black" "red" "violet" "white" "gray" "orange" "blue"
[121] "yellow" "white" "yellow" "gray" "gray" "gray" "white" "green"
[129] "brown" "white" "brown" "green" "black" "green" "blue" "white"
[137] "white" "orange" "green" "white" "white" "orange" "brown" "gray"
[145] "gray" "white" "orange" "brown" "orange" "green" "green" "green"
[153] "violet" "blue" "black" "green" "white" "gray" "blue" "yellow"
[161] "white" "blue" "violet" "red" "blue" "black" "white" "white"
[169] "white" "orange" "green" "blue" "black" "blue" "brown" "blue"
[177] "green" "red" "black" "yellow" "gray" "gray" "gray" "white"
[185] "red" "green" "gray" "blue" "yellow" "blue" "orange" "blue"
[193] "brown" "brown" "brown" "black" "black" "gray" "red" "gray"
On its own, that’s not especially easy to visualize. What if we sorted it?
[1] "black" "black" "black" "black" "black" "black" "black" "black"
[9] "black" "black" "black" "black" "black" "black" "black" "black"
[17] "black" "black" "black" "blue" "blue" "blue" "blue" "blue"
[25] "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue"
[33] "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue"
[41] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown"
[49] "brown" "brown" "brown" "brown" "brown" "brown" "brown" "brown"
[57] "brown" "brown" "brown" "brown" "brown" "gray" "gray" "gray"
[65] "gray" "gray" "gray" "gray" "gray" "gray" "gray" "gray"
[73] "gray" "gray" "gray" "gray" "gray" "gray" "gray" "gray"
[81] "gray" "gray" "gray" "gray" "gray" "gray" "gray" "gray"
[89] "green" "green" "green" "green" "green" "green" "green" "green"
[97] "green" "green" "green" "green" "green" "green" "green" "green"
[105] "green" "green" "green" "green" "green" "green" "green" "green"
[113] "orange" "orange" "orange" "orange" "orange" "orange" "orange" "orange"
[121] "orange" "orange" "orange" "orange" "orange" "orange" "orange" "orange"
[129] "orange" "orange" "orange" "orange" "orange" "orange" "orange" "red"
[137] "red" "red" "red" "red" "red" "red" "red" "red"
[145] "red" "red" "red" "violet" "violet" "violet" "violet" "violet"
[153] "violet" "violet" "violet" "violet" "violet" "violet" "violet" "violet"
[161] "violet" "white" "white" "white" "white" "white" "white" "white"
[169] "white" "white" "white" "white" "white" "white" "white" "white"
[177] "white" "white" "white" "white" "white" "white" "white" "white"
[185] "white" "yellow" "yellow" "yellow" "yellow" "yellow" "yellow" "yellow"
[193] "yellow" "yellow" "yellow" "yellow" "yellow" "yellow" "yellow" "yellow"
Still not all that helpful.
What can we say to summarize categorical data?
We can report the number of total responses.
We can report the number of unique categories.
We can report the number of responses per category.
Single nominal variables don’t offer us many options for visualization. A bar plot showing the number of observations in each category seems to be it.
Is one of these colors a particular favorite? Maybe. The observed data show some differences. But how can we know they aren’t just due to random chance.
These particular data are entirely random. We made them up!
Keep that in mind while we work through what we would do to answer the question if these were not fake, random data.
To answer that question, we’d have to compare our results with some other pattern, a pattern where there is no favorite. In other words, we want a pattern where the counts are all equal. So, we’re back to summarizing.
Actually, data analysis in the real world usually involves a back-and-forth alternation between summarizing/analyzing and visualizing.
Let’s make data that have an identical pattern across all colors.
There are several ways to do this, but here is one that works.
favorite_color
black blue brown gray green orange red violet white yellow
20 20 20 20 20 20 20 20 20 20
How can we compare the data where there are no differences by category to the data where there are differences?
# Making data {-}
## About {-}
This tutorial shows how we can construct data of different types.
```{r}
library(ggplot2)
library(dplyr)
```
## Nominal data {-}
Let's focus on nominal or categorical data, specifically the favorite colors of some imaginary set of people.
```{r}
colors <- c("red", "orange", "yellow", "green", "blue", "violet", "white", "black", "gray", "brown")
```
There are *n*=`{r} length(colors)` colors in this set.
::: {.callout-note}
## Your turn
Why are these data nominal or nominally scaled?
Or rather, why *aren't* they ordinal, interval, or ratio?
:::
### Generating {-}
For demonstration purposes, we want to take some number of random samples of these colors.
Let's pick *n*=200 and sample *with replacement* (`replace=TRUE` in the code below), meaning that we could have any number of colors in our sample of 200 imaginary people.
```{r}
our_color_sample <- sample(colors, size=200, replace=TRUE)
our_color_sample
```
On its own, that's not especially easy to visualize.
What if we sorted it?
```{r}
sort(our_color_sample)
```
Still not all that helpful.
### Summarizing {-}
What can we say to summarize categorical data?
We can report the number of total responses.
```{r}
length(our_color_sample)
```
We can report the number of unique categories.
```{r}
length(unique(our_color_sample))
```
We can report the number of responses per category.
```{r}
colors_df <- data.frame(favorite_color = our_color_sample)
xtabs(formula = ~favorite_color, data = colors_df)
```
### Visualizing {-}
Single nominal variables don't offer us many options for visualization.
A bar plot showing the number of observations in each category seems to be it.
```{r}
#| label: fig-nominal-barplot
#| fig-cap: "An (ugly) barplot of the random favorite color data"
colors_df |>
ggplot() +
aes(x = favorite_color) +
geom_bar()
```
Is one of these colors a particular favorite?
Maybe.
The observed data show some differences.
But how can we know they aren't just due to random chance.
::: {.callout-warning}
## Stop the presses!
These *particular* data are entirely random.
We made them up!
Keep that in mind while we work through what we would do to answer the question if these were *not* fake, random data.
:::
To answer that question, we'd have to compare our results with some other pattern, a pattern where there is **no** favorite.
In other words, we want a pattern where the counts are all equal.
So, we're back to summarizing.
Actually, data analysis in the real world usually involves a back-and-forth alternation between summarizing/analyzing and visualizing.
```{mermaid}
%%| label: fig-data-dialetic
%%| fig-cap: "Illustration of the back and forth between summarizing and visualizing that characterizes real-life data analysis."
flowchart LR
A[Summarize] --> B[Visualize]
B ---> A
```
### Summarizing II {-}
Let's make data that have an identical pattern across all colors.
There are several ways to do this, but here is one that works.
```{r}
no_favorite <- rep(colors, length(our_color_sample)/length(colors))
compare_df <- data.frame(favorite_color = no_favorite)
xtabs(formula = ~favorite_color, data = compare_df)
```
How can we compare the data where there are no differences by category to the data where there *are* differences?