We’re not slow; we’re not fast. We’re half-fast.
(Joke my Dad didn’t think I got as a kid.)
This tutorial is intended as a very slow introduction to using R. If we’re going too slow for you, that’s ok. But if we’re going too fast, please say so! We think that by going slow at the start, it will be easier to speed up later.
We talk about programming a computer or coding, but most of the time we’re having a conversation. We say something, and the computer responds. We say something else, and the computer responds. And so on.
In this section, we’ll learn how to talk to the computer in a language it understands. That language is R.
It’s fun. It’s free You can amaze your friends and dazzle your rivals. It’s powerful, especially for manipulating, plotting, and analyzing data.
It will make you a more productive researcher. That’s the bottom line.
What if I want to learn another programming language?
Awesome! Good for you. Learn a bunch of languages. They’re a bit like human languages: It’s easier or more poetic to say some things in some languages than it is in others. But make sure you develop some mastery over one computer language before learning another one. Other useful languages for behavioral scientists to learn include the following: Python, Matlab, *nix shell programming, HTML/CSS/JavaScript, SQL, C/C++, and Java, for starters.
RStudio is an integrated development environment (IDE) for R. RStudio brings together a number of useful tools for talking to the computer in R. You don’t have to use RStudio to use R, but you should use it for this bootcamp, and we strongly recommend using it in the future. It’s suitable for beginners and experts.
I’m going to login to a version of RStudio that Penn State hosts so that Penn Staters can use RStudio from a web browser. Detailed instructions can be found at http://psu-psychology.github.io/r-bootcamp-2018/rstudio-tlt.html.
In brief, I enter https://lxclusterapps.tlt.psu.edu:8787 in my browser, enter my PSU Access ID (rog1) and password, then click on the the Sign In
button with my mouse or press return on my keyboard. Then I see an RStudio window that looks very much like this one:
This is the default view. It has several different “windows” or panels. They each provide us with helpful information. You can rearrange them or customize RStudio to your heart’s content. But do that later. For now, let’s concentrate on the panel on the left side called the Console
.
The console is where you do most of your talking to R. Notice that there is some text, and then a greater-than sign (>
) sign. Let’s read the text.
Besides the version of R and some other details, it tells us how to start a conversation with R. It says Type 'license()' or 'licence()' for distribution details.
Let’s try that. Type ‘license()’ right after the greater-than >
sign. Press the return
or enter
key on your keyboard to tell R you’ve finished saying something.
"license()"
## [1] "license()"
Well, that wasn’t very interesting. The computer responded by repeating what we’d typed, changing the single quotation marks for double quotation marks, but that’s about it.
Painful lesson #1: Computers are super-literal. They are anally literal. You’re not going to change them. Just deal.
Try typing license()
without the single quotation marks (and hit return/enter).
license()
##
## This software is distributed under the terms of the GNU General
## Public License, either Version 2, June 1991 or Version 3, June 2007.
## The terms of version 2 of the license are in a file called COPYING
## which you should have received with
## this software and which can be displayed by RShowDoc("COPYING").
## Version 3 of the license can be displayed by RShowDoc("GPL-3").
##
## Copies of both versions 2 and 3 of the license can be found
## at https://www.R-project.org/Licenses/.
##
## A small number of files (the API header files listed in
## R_DOC_DIR/COPYRIGHTS) are distributed under the
## LESSER GNU GENERAL PUBLIC LICENSE, version 2.1 or later.
## This can be displayed by RShowDoc("LGPL-2.1"),
## or obtained at the URI given.
## Version 3 of the license can be displayed by RShowDoc("LGPL-3").
##
## 'Share and Enjoy.'
Much better! So, this was our first ‘conversation’ with the computer. We said something in R, and the computer responded.
Why did typing license()
work but typing 'license()'
not work? The single quotation marks. Typing license()
without them gave R a command; surrounding the same characters with single quotation marks told R that all of those characters were a single unit a string of characters (called a string), but definitely not a command. As I said, computers are very literal. Many of the errors and frustrations you will encounter in your R journey will come down to your not telling the computer what to do in EXACTLY the way it needs to be told.
The console refers to the whole window or panel. Notice that as we type text or R does, that text scrolls up so we can see the recent history of our conversation. You can scroll (with your mouse or arrow keys) up and down in the console.
The greater-than (>
) character is called the ‘prompt’, and the vertical line or pipe character (|
) is called the ‘cursor’. You already knew about the cursor from your experiences in other computer programs. It’s where characters we type will be entered. The prompt is just a character to ‘prompt’ or remind us that R is waiting for us to say something.
Typing in the console is just one way to talk to the computer. It’s interactive, meaning we type, it responds. Or really, we command, it responds (if it can). This way of talking to computers is very old school. It goes back to the 60s. It might seems less powerful than say clicking buttons or menu items or talking to Siri or Alexa. But just wait and see. The console is our window into the computer’s brain.
What’s happening under the hood here? When the console displays the prompt it means that R is waiting for you to do something. That something is to type something and hit the return
key. When you hit return
, R tries to ‘understand’ what you typed and do something sensible in response.
Complex programs are just long sequences of commands entered into something like the computer’s console and the computer’s responses to those commands.
So, what can you say to R? You can give R commands and ask it simple questions.
When you type things that end with parentheses like license()
, that commands R to do something, in this case, to ‘print the license information’. Why doesn’t R require you to say print_license()
? To save typing.
Here’s another command: sum(1, 4, 7)
sum(1, 4, 7)
## [1] 12
This command says ‘calculate and print the sum of the numbers 1, 4, and 7’. R responds with the answer: 12. Notice that there are two parts to the command: the ‘what to do’ part (sum
) and the ‘what to do it with or on’ part inside the parentheses, here (1,4,7)
. We’ll return to this later, but many, many things we want to do or say in life (and R life) have two parts, the ‘verb’ or action we want to do and the ‘noun’ or objects/people we want to involve in that action. The parentheses tell R which is which when it reads the command from the console.
Before we talk about other things you can say to the computer, let’s talk about where and how computers store data and instructions.
Computers store data & instructions (programs, scripts etc.) in two types of memory:
We’ll talk about files first.
Files are organised in directories or folders. Directories/folders often have a hierarchy. The set of directories to a file is called the path.
I can list the files in a directory or folder using the list.files()
command or list the directories using the list.dirs()
command.
# List the files in my current directory
list.files()
## [1] "bib"
## [2] "bootcamp-day-1-intro_cache"
## [3] "bootcamp-day-1-intro_files"
## [4] "bootcamp-day-1-intro.html"
## [5] "bootcamp-day-1-intro.Rmd"
## [6] "bootcamp-survey_cache"
## [7] "bootcamp-survey_files"
## [8] "bootcamp-survey.html"
## [9] "bootcamp-survey.Rmd"
## [10] "codebook_survey.pdf"
## [11] "codebook_survey.Rmd"
## [12] "css"
## [13] "ggplot2_tutorial_vallorani.html"
## [14] "ggplot2_tutorial_vallorani.Rmd"
## [15] "gilmore-hallquist-bootcamp-2018-papaja_files"
## [16] "gilmore-hallquist-bootcamp-2018-papaja.fff"
## [17] "gilmore-hallquist-bootcamp-2018-papaja.pdf"
## [18] "gilmore-hallquist-bootcamp-2018-papaja.Rmd"
## [19] "gilmore-hallquist-bootcamp-2018-papaja.tex"
## [20] "gilmore-hallquist-bootcamp-2018-papaja.ttt"
## [21] "IntroBasicEFA_2018_0815.html"
## [22] "IntroBasicEFA_2018_0815.Rmd"
## [23] "parallel_r.html"
## [24] "parallel_r.Rmd"
## [25] "r-eproducible-science.html"
## [26] "r-eproducible-science.Rmd"
## [27] "r-references.bib"
## [28] "slow-r.html"
## [29] "slow-r.md"
## [30] "slow-r.Rmd"
Notice that files have names (e.g., slow-r
) and extensions that start with a period (e.g., .Rmd
). The extension is a way for the computer to communicate with itself and others what type of file is being stored. You’re probably familiar with files that have one of these extensions: .docx, .xlsx, .html, and .pdf.
In addition to knowing ‘what’ files are already stored, we often want to know ‘where’ they are stored. In computer terms, this means that we want to know where we are working in this hierarchy of directories or folders. The get working directory or getwd()
command shows us this information.
# Get and print the current working directory
getwd()
## [1] "/Users/rick/github/psu-psychology/r-bootcamp-2018/talks"
As you continue your jouRney with R, you’ll want to practice a form of ‘situational awareness’: Where am I now (in my computer’s file system). Don’t worry, this will eventually become second nature to you.
In the next section, we’ll see how we can store information in R’s short-term memory.
Here’s another useful command: my_age <- 55
.
my_age <- 55 # Rick's age
# Text that is preceded by the # character is called a comment. These are
# 'notes' you can keep about your code so you remember why you did
# something. Here, R ignores the part after the #: ' Rick's age'
This command tells R to ‘assign the name “my_age” to the number 55’. The ‘assign the name’ command is that leftward arrow (<-
) symbol. You can type it the easy way by typing option
and the minus -
keys at the same time (Mac OS) or the alt
and -
keys on Windows. You can also type it the hard way by typing a <
and then -
. But train your fingers to type it the easy way.
When we use the assignment command, we create a new object in R’s short-term memory with that object’s name. We can list the objects in R’s current short-term memory with the list objects or ls()
command:
# List the objects in R's current environment
ls()
## [1] "my_age"
The ls()
command doesn’t show me the contents of the object I just created, but it does show me that R has stored something. Unless I do something special to save it, this object will disappear when I quit R. But don’t worry. We’ll show you how to save information you care about for futuRe work sessions.
Why does R have two different ways of accepting commands, one that uses parentheses like do_something()
and one specific for the ‘assign the name’ command? Because the ‘assign the name’ command is very similar to what you learned in algebra class when you were told you could give names to numbers, like \(a=1\). R’s syntax assigns to the value (on the right side) the name (on the left). So, really, R is doing something like assign('my_age', 55)
when you type my_age <- 55
. In fact, they’re the same thing. Try it.
assign("my_age", 55)
my_age # Tells R to print the current value assigned to 'my_age'
## [1] 55
your_age <- 25
your_age # Print the value assigned to 'your_age'
## [1] 25
So, if you want to be consistent in commanding R, you can always use the command()
syntax. Even stranger, this works, too:
("your_age" <- 27) # Wonky way to show that the assignment operator is a command
## [1] 27
# Note that we have to surround the `<-` with back ticks.
your_age
## [1] 27
By the way, R accepts the equal sign =
character to assign names (on the left) to values (on the right), just like the convention in math.
Don’t use
=
. You can, but don’t. Use<-
.
Why? It’s a recommendation about style, not substance. But style matters. It’s like saying ‘like’ all the time. Like people will like understand you, but they’ll wonder why you like say like all the time when you really need not. It’s also a topic that will get you in a ‘flame war’. Avoid flame wars. Ok, there are more substantive reasons we’ll talk about later.
We can also ask R questions. By typing comparison operators (==
, !=
, >
, <
, etc.) we can ask R true/false questions.
1 == 0
## [1] FALSE
sqrt(9) < 4
## [1] TRUE
"rick" == "rick"
## [1] TRUE
"richard" > "rick"
## [1] FALSE
"richard" != "rick"
## [1] TRUE
R will respond with TRUE
or FALSE
to these questions. Yes, TRUE
and FALSE
must be in all caps.
"true" == TRUE
## [1] FALSE
"False" == FALSE
## [1] FALSE
This is because TRUE
and FALSE
are not just character strings to R. They have special meaning as Boolean (logical) variables. We’ll say more about them later.
Notice that ==
asks if the two things are equal and !=
asks if they are unequal. This is different from what you learned in math class, where \(a=1\) could be either a statement (or assignment command) or a question. If computers had invented math, rather than the reverse, they would have separated making statements or commands from asking questions–for clarity.
If you want to think of this in a “commanding” way, you could say that you are commanding R to ‘compare these two things and print TRUE or FALSE depending on the outcome of the comparison’.
R has rules for names. You’ll be fine if you do the following:
_
), andSo, bigly
, good_name
, a_longer_good_name
, Good_name1
, and even thisIsCamelCaseNoUnderscores
but not !good
, bad name
or 1_very_bad_name
. There are other rules and exceptions, but this is a good place to start.
You may be unimpressed with our conversations with R, at least so far. But let’s recap.
We can talk about numbers: 75
or 4^2
(4 to the 2nd power), 3.14159
, or 1.5e-3
(0.0015). Notice that we type numbers ‘in the nude’ or without surrounding them with quotation marks. We can surround numbers with parentheses, though: (75) == 75
. Here, parentheses function just like they do in math, so \((10-8)+1\) is equal to three:
(10 - 8) + 1 == 3
## [1] TRUE
So, R is a very big pocket calculator.
You can use R for other calculations: subtraction (-
), multiplication (*
), division (/
), or exponentiation (^
or **
, i.e. \(4^2\) is written 4^2
or 4**2
in R).
We can also talk about strings, or sets of letters, numbers, and characters: 'Fourscore and seven years ago'
or 'RStudio'
or 'R 3.5'
. Unlike names, strings can have spaces (and special characters like $ or !) in them or start with numbers–as long as the string starts and ends with quotation marks. The quotation marks tell R where the string starts and where it ends.
my_name <- "Rick"
my_quest <- "The Holy Grail"
favorite_color <- "Blue, no green"
R treats strings, numbers, and logical values (TRUE
and FALSE
) as different beasts most of the time.
"one" == "1"
## [1] FALSE
# More predictable way of checking whether two things are identical
identical(12, "12")
## [1] FALSE
We can also talk about commands, so help('sum')
commands R to give us helpful information about the sum()
command.
help("sum")
Notice that we have to put the name of the command in quotation marks. That tells R that we are telling R to ‘help us learn about the command “sum”’, not giving R the command to sum something.
You can use either single or double quotation marks for strings, but I recommend using double ones. And don’t mix and match:
'R will hate this"
These are the basic building blocks:
And very soon, we’ll see how we can talk about collections of numbers and strings (e.g., data) and sequences of commands (scripts or programs).
Let’s get our fingers working on a tutorial using the swirl
package. Install swirl
.
install.packages('swirl')
And load it.
library(swirl)
Then, start the program.
swirl()
Tell swirl your name, if you like. Then choose 1: R Programming
by entering the number 1
and pressing return
. Then choose 1: Basic Building Blocks
by entering the number 1
and pressing return
.
Your console should look something like this:
Work through the lesson and have fun!
swirl
The following commands may be useful when you are running swirl
:
| When you are at the R prompt (>):
| -- Typing skip() allows you to skip the current question.
| -- Typing play() lets you experiment with R on your own;
| swirl will ignore what you do...
| -- UNTIL you type nxt() which will regain swirl's
| attention.
| -- Typing bye() causes swirl to exit. Your progress will
| be saved.
| -- Typing main() returns you to swirl's main menu.
| -- Typing info() displays these options again.
Let’s pick up the pace. In this section, we’ll talk more about the various types of objects R can store and how to manipulate them.
At the deepest level, everything, and I mean everything in a computer is represented by sequences of 1’s and 0’s, data, programs, images, sounds, videos, everything. So, the computer needs to know what type of thing a given sequence of 1’s and 0’s is defined to be in order to know how to do computations on it. Does the sequence of binary digits (bits) “0011000” mean the number 48 (in binary or base 2), or the ASCII character “0”, or the “address” in the computer’s memory where a given piece of data is stored, or a SUM() function or command? It could be any of these. The computer doesn’t know, but working with R and your operating system (Mac OS, Windows, *nix), the computer can figure this out.
Why should you care? Remember we said that computers are super-literal? That’s why the computer returns FALSE
when we say identical('1', 1)
, or when we try to add things that can’t be added.
"1" + 1
The character '1'
is a completely different thing (different type) to the computer than the number 1
. So, when humans interact with the computer, we sometimes forget this, and that can cause errors and (to humans) grief.
One reason is pragmatic: It’s relatively easy to make electronic circuits that are either “on” or “off”. If we decide that “on” means 1, and “off” means 0, then we can do computations with these states.
Algebraic computations (addition, subraction, multiplication, division, etc.) and Boolean algebra sorts of computations, that’s what. Boolean algebra (or Boolean logic) takes as input the logical values of TRUE
or FALSE
and combines them using combinations of three basic operations: AND, OR, and NOT. R has operators–simple programs or functions (&
, |
, and !
–that implement Boolean algebra.
TRUE & TRUE # Ampersand '&' is the AND operator
## [1] TRUE
TRUE & FALSE
## [1] FALSE
TRUE | FALSE # The pipe '|' is the OR operator
## [1] TRUE
FALSE | FALSE
## [1] FALSE
!TRUE # the exclamation point '!' is the NOT operator
## [1] FALSE
It turns out that these simple operations are incredibly powerful when combined together. Basically, you can create simple arithmetic operations (addition, multiplication) from Boolean (AND, OR, NOT) elements. And from there, we can make any ‘computable’ function. What’s computable and what’s not? That is a deep and unresolved question we won’t touch, but one that computer scientists wrestle with all the time.
For more practice with these concepts, try this swirl
lesson: 8: Logic
.
There could be a lot more to say about types, but we won’t here. Instead, we’ll say that R does a lot of checking ‘under the hood’ to make sure we don’t try to do things that don’t make sense, like adding a character and a number. R does this by creating “classes” of things that obey certain rules.
is.numeric(99)
## [1] TRUE
is.numeric("99")
## [1] FALSE
is.character("99")
## [1] TRUE
ten_number <- 10
ten_character <- "10"
is.numeric(ten_number)
## [1] TRUE
is.numeric(ten_character)
## [1] FALSE
is.character(ten_number)
## [1] FALSE
is.character(ten_character)
## [1] TRUE
The is.numeric()
command tells R to tells us whether the inputs are ‘numeric’, and the is.character()
command tells it to report whether the input contains a set of characters.
The is.logical()
command reports whether its input is a Boolean (TRUE
/FALSE
) variable, and the is.na()
command reports whether the item is a special type R reserves for missing values. Yes, R has a special value called NA
it assigns to data elements that are missing for one reason or another:
my_data_are_missing <- NA
is.na(my_data_are_missing)
## [1] TRUE
Note that NA
is a ‘naked’ value like TRUE
or FALSE
, that is, it is not enclosed in quotation marks. This will be extremely useful later, as we can tell R exactly how to handle ‘missingness’ in our data.
Beyond these simple ones, R implements many different object classes. To see some of them, type is.
then pause or type the ‘tab’ character. This will show you a scrollable list of all of the is.
commands you can try out on a given object.
The first item on the list tests to see if the input is an ‘array’. That is what we turn to next.
You’ll often want to make groups of things or compute over them.
c()
commandThe ‘combine’ or c()
command is just for this purpose.
my_numbers <- c(1, 2, 3, 4, 5)
my_numbers
## [1] 1 2 3 4 5
You can also combine characters.
(my_initials <- c("R", "O", "G"))
## [1] "R" "O" "G"
# Surround an expression with parentheses to print it
(my_name <- c("Rick", "Owen", "Gilmore"))
## [1] "Rick" "Owen" "Gilmore"
But if you mix numbers and characters, the c()
command will force the outputs to be characters.
(lyrics <- c("It's easy as", 123))
## [1] "It's easy as" "123"
We’ll show you how to make a list that mixes numbers and characters very soon.
But for now, if you just want to make one big character string from separate parts use the paste0()
command, as in paste0("My ", "country ", " tis", " of thee.")
paste0("My ", "country ", "tis ", "of thee...")
## [1] "My country tis of thee..."
Notice that the combine (c()
) command puts things in a long row. How long is that row?
my_name
## [1] "Rick" "Owen" "Gilmore"
length(my_name)
## [1] 3
Notice that the length()
command reports the number of individual components.
# The colon ':' operator (a:b) tells R to generate a sequence from a to b
(more_numbers <- c(-6:15))
## [1] -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
length(more_numbers)
## [1] 22
And it works for combined sets of character strings.
length(c("a", "bc", "def", "ghij"))
## [1] 4
seq()
and replicate rep()
commandsThere is a command for creating orderly sequences.
# The sequence 'seq()' command can also do this
(more_numbers_alt <- seq(from = -6, to = 15))
## [1] -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(just_evens <- seq(from = -6, to = 15, by = 2))
## [1] -6 -4 -2 0 2 4 6 8 10 12 14
And you can go backward.
seq(from = 5, to = -5, by = -1)
## [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
As you might predict, the replicate rep()
command makes copies of things.
rep(7, times = 7)
## [1] 7 7 7 7 7 7 7
all_you_need_is <- rep("Love", times = 3)
These commands create a vector or 1 dimensional set of items–they have a single length. But what about other types of structures with more than one dimension?
Take a few minutes to complete the swirl
lesson related to this topic. If you’ve exited swirl, start it again.
swirl()
Enter your name (again), but this time in response to | Would you like to continue with one of these lessons?
select 2: No. Let me start something new.
.
Choose 1: R Programming
and then 3: Sequences of Numbers
.
If you wish, you may also try completing 4: Vectors
.
You can create a 2-dimensional array of numbers or matrix.
(square_matrix <- matrix(1:16, nrow = 4))
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
(not_square_matrix <- matrix(1:16, ncol = 2))
## [,1] [,2]
## [1,] 1 9
## [2,] 2 10
## [3,] 3 11
## [4,] 4 12
## [5,] 5 13
## [6,] 6 14
## [7,] 7 15
## [8,] 8 16
Notice that the nrow
and ncol
values tell R what shape the matrix would have. What happens if nrow * ncol
is not equal to the length of the input?
(bad_matrix <- matrix(1:25, nrow = 4))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1 5 9 13 17 21 25
## [2,] 2 6 10 14 18 22 1
## [3,] 3 7 11 15 19 23 2
## [4,] 4 8 12 16 20 24 3
R warns you that there is a problem, but it tries its best to create a matrix of the shape you want by recycling old values. You probably won’t make many matrices on your own.
A ‘matrix-like’ object with more than 2 dimensions is called an array.
(my_array <- array(1:24, dim = c(2, 3, 4)))
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18
##
## , , 4
##
## [,1] [,2] [,3]
## [1,] 19 21 23
## [2,] 20 22 24
Notice that the dim
parameter tells R how to build the array. It tells R that there will be 4 ‘rows’ of matrices, with 3 columns and 2 rows in each.
By the way, if your data is a matrix, the length()
command may not always work the way you want it to.
length(my_array)
## [1] 24
But the dimension command dim()
does.
dim(my_array) # Usually what you actually want
## [1] 2 3 4
You may find yourself with matrix or array data like this, but I find it easy to get confused about what the different dimensions mean and what order they are produced.
It’s possible to give the dimensions some plausible meanings using the dimnames
parameter.
# dimnames must be a list with each component a vector of labels that has
# the same length as the dimensions
my_named_array <- array(1:24, dim = c(2, 3, 4), dimnames = list(c("M", "F"),
c("Mon", "Wed", "Fri"), c("ht", "wt", "shoe", "IQ")))
my_named_array
## , , ht
##
## Mon Wed Fri
## M 1 3 5
## F 2 4 6
##
## , , wt
##
## Mon Wed Fri
## M 7 9 11
## F 8 10 12
##
## , , shoe
##
## Mon Wed Fri
## M 13 15 17
## F 14 16 18
##
## , , IQ
##
## Mon Wed Fri
## M 19 21 23
## F 20 22 24
This is ‘old school’ though. These days, most R data analysts work with data frames. Let’s talk about them now.
R has a powerful way of organizing data that makes it both human- and machine-friendly: data frames. Data frames are like arrays and matrices, but much more useful for humans because they can contain mixtures of numbers and character strings. Under the hood, data frames can contain many dimensions like arrays.
# Create the data frame
my_df <- data.frame(data = 1:24, gender = c("M", "F"), day = c("Mon", "Wed",
"Fri"), measure = c("ht", "wt", "shoe", "IQ"))
# Print the data frame
my_df
## data gender day measure
## 1 1 M Mon ht
## 2 2 F Wed wt
## 3 3 M Fri shoe
## 4 4 F Mon IQ
## 5 5 M Wed ht
## 6 6 F Fri wt
## 7 7 M Mon shoe
## 8 8 F Wed IQ
## 9 9 M Fri ht
## 10 10 F Mon wt
## 11 11 M Wed shoe
## 12 12 F Fri IQ
## 13 13 M Mon ht
## 14 14 F Wed wt
## 15 15 M Fri shoe
## 16 16 F Mon IQ
## 17 17 M Wed ht
## 18 18 F Fri wt
## 19 19 M Mon shoe
## 20 20 F Wed IQ
## 21 21 M Fri ht
## 22 22 F Mon wt
## 23 23 M Wed shoe
## 24 24 F Fri IQ
Notice that R has turned our 3 dimensional data (gender, day, measure) into a 2 dimensional table. The rows contain complete observations; the columns the variables. This ‘rectangular’ data shape is called ‘tidy’. It may seem wasteful to repeat ‘M’ and ‘F’ in each row, but it’s crystal clear what we mean. We’ll talk more about the virtues of tidy data 1 tomorrow, but keeping your data tidy (observations in rows, variables in columns, longer than wider) is a best practice you’ll want to adopt.
Data frames are one way to combine numbers and character strings. They work great for data, but no so much for other, less well-structured, sorts of information. Lists are just what you’d expect: flexible aggregations of things.
(my_list <- list("donald", 72, "1600 Pennsylvania Ave"))
## [[1]]
## [1] "donald"
##
## [[2]]
## [1] 72
##
## [[3]]
## [1] "1600 Pennsylvania Ave"
# Can give names to the components
(my_list_wnames <- list(first_name = "donald", age = 72, address = "1600 Pennsylvania Ave"))
## $first_name
## [1] "donald"
##
## $age
## [1] 72
##
## $address
## [1] "1600 Pennsylvania Ave"
R uses lists extensively, especially when reporting various sort of statistical tests.
# Take a random (normally distributed) sample of numbers with a mean = 3
r_100 <- rnorm(n = 100, mean = 3, sd = 1)
(t_test_r_100 <- t.test(r_100))
##
## One Sample t-test
##
## data: r_100
## t = 27.409, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 2.942701 3.402006
## sample estimates:
## mean of x
## 3.172353
is.list(t_test_r_100)
## [1] TRUE
names(t_test_r_100)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
So, it turns out that that nice print out is made up of a list with different named components. That will come in handy soon.
Before we break for lunch, let’s get our hands dirty again by taking the relevant swirl
lesson: 7: Matrices and Data Frames
.
We’ve talked about how to put things into vectors, arrays, matrices, data frames, and lists. How do we get things out? It’s easy.
We just tell R the index (or ‘address’) of the item.
(one_to_ten <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
one_to_ten[5]
## [1] 5
Notice that we put the index in square brackets []
, not parentheses. Again, square brackets are for indexing. Parentheses are for commands and functions. You’ll make this mistake a lot. I still do. Don’t worry. R will tell you.
# See what happens when you type this
(one_to_ten(5))
What about matrices or arrays? They have more than one dimension. Yes, you’ll need more than one index.
dim(my_array)
## [1] 2 3 4
my_array[2, 3, 4] # The last item in each dimension
## [1] 24
my_array[1, 1, 1] # The first item in each dimension
## [1] 1
Of course, you have to keep track of how to map the indices to the row/column/table values. In fact, Dr. Hallquist tells me that he strongly discourages the use of numeric indices to extract information from arrays. It’s too easy to make a mistake. So, learn our mistakes. Make your own new ones.
If you label your matrices and arrays, you can do something slick.
my_named_array["M", "Fri", "IQ"]
## [1] 23
You can pull data our of data frames using numeric indices, but you have to remember that your data frame is now 2 dimensional, e.g., you need to use something like this: my_df[row_indices, col_indices]
.
my_df
## data gender day measure
## 1 1 M Mon ht
## 2 2 F Wed wt
## 3 3 M Fri shoe
## 4 4 F Mon IQ
## 5 5 M Wed ht
## 6 6 F Fri wt
## 7 7 M Mon shoe
## 8 8 F Wed IQ
## 9 9 M Fri ht
## 10 10 F Mon wt
## 11 11 M Wed shoe
## 12 12 F Fri IQ
## 13 13 M Mon ht
## 14 14 F Wed wt
## 15 15 M Fri shoe
## 16 16 F Mon IQ
## 17 17 M Wed ht
## 18 18 F Fri wt
## 19 19 M Mon shoe
## 20 20 F Wed IQ
## 21 21 M Fri ht
## 22 22 F Mon wt
## 23 23 M Wed shoe
## 24 24 F Fri IQ
my_df[1, 4] # row 1, column 4
## [1] ht
## Levels: ht IQ shoe wt
my_df[1:3, 3:4] # rows 1-3, and columns 3-4
## day measure
## 1 Mon ht
## 2 Wed wt
## 3 Fri shoe
my_df[, 2:3] # All of the rows from cols 2 and 3
## gender day
## 1 M Mon
## 2 F Wed
## 3 M Fri
## 4 F Mon
## 5 M Wed
## 6 F Fri
## 7 M Mon
## 8 F Wed
## 9 M Fri
## 10 F Mon
## 11 M Wed
## 12 F Fri
## 13 M Mon
## 14 F Wed
## 15 M Fri
## 16 F Mon
## 17 M Wed
## 18 F Fri
## 19 M Mon
## 20 F Wed
## 21 M Fri
## 22 F Mon
## 23 M Wed
## 24 F Fri
my_df[1, ] # All of the columns from row 1
## data gender day measure
## 1 1 M Mon ht
An often more useful way to index data in data frames is to use the names of the columns. We can extract entire columns using the dollar sign $
or ‘extract’ operator.
names(my_df) # What are the column names?
## [1] "data" "gender" "day" "measure"
my_df$data
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24
my_df$gender
## [1] M F M F M F M F M F M F M F M F M F M F M F M F
## Levels: F M
If you read the help for this operator (help('$')
) you’ll note that you can also use it to create new or replace old values.
# Make new lower case gender variable
(my_df$new_gender <- tolower(my_df$gender))
## [1] "m" "f" "m" "f" "m" "f" "m" "f" "m" "f" "m" "f" "m" "f" "m" "f" "m"
## [18] "f" "m" "f" "m" "f" "m" "f"
Knowing this, we can use Boolean (true/false) expressions to create a vector of TRUE/FALSE values that correspond to some condition.
(only_males <- (my_df$gender == "M"))
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## [12] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [23] TRUE FALSE
Let’s unpack this. We’re selecting all of the gender
values from my_df
with $
, so mf_df$gender
returns only the gender
column from the data frame. Then we ask R to test (==
) which of those are males 'M'
. R returns a vector of TRUE/FALSE
values that correspond to whether each item in my_df$gender
is equal to M
.
With this in hand, we can then select from my_df
all of the data for the males.
my_df[only_males, ] # All rows with TRUE values for only_males
## data gender day measure new_gender
## 1 1 M Mon ht m
## 3 3 M Fri shoe m
## 5 5 M Wed ht m
## 7 7 M Mon shoe m
## 9 9 M Fri ht m
## 11 11 M Wed shoe m
## 13 13 M Mon ht m
## 15 15 M Fri shoe m
## 17 17 M Wed ht m
## 19 19 M Mon shoe m
## 21 21 M Fri ht m
## 23 23 M Wed shoe m
Notice that we put only_males
as the first index in my_df
, before the comma: my_df[only_males,]
. This means that R will return the rows of the data frame where only_males
has TRUE
values. Recall that data frames are two dimensional. The syntax for indexing them is of this form: df[row_index, column_index]
. We didn’t specify which columns to return–the column index is empty–so R returns all of the columns.
What about the females?
# M and F are mutually exclusive, so F is just the opposite of M
only_females <- !(only_males)
my_df[only_females, ]
## data gender day measure new_gender
## 2 2 F Wed wt f
## 4 4 F Mon IQ f
## 6 6 F Fri wt f
## 8 8 F Wed IQ f
## 10 10 F Mon wt f
## 12 12 F Fri IQ f
## 14 14 F Wed wt f
## 16 16 F Mon IQ f
## 18 18 F Fri wt f
## 20 20 F Wed IQ f
## 22 22 F Mon wt f
## 24 24 F Fri IQ f
my_df[!(only_males), ] # Equivalent
## data gender day measure new_gender
## 2 2 F Wed wt f
## 4 4 F Mon IQ f
## 6 6 F Fri wt f
## 8 8 F Wed IQ f
## 10 10 F Mon wt f
## 12 12 F Fri IQ f
## 14 14 F Wed wt f
## 16 16 F Mon IQ f
## 18 18 F Fri wt f
## 20 20 F Wed IQ f
## 22 22 F Mon wt f
## 24 24 F Fri IQ f
We can mix and match logical indices.
fav_day_mon <- (my_df$day == "Mon")
my_df[fav_day_mon & only_females, ]
## data gender day measure new_gender
## 4 4 F Mon IQ f
## 10 10 F Mon wt f
## 16 16 F Mon IQ f
## 22 22 F Mon wt f
And we can select one or more specific columns by name.
my_df[fav_day_mon, c("measure", "gender")]
## measure gender
## 1 ht M
## 4 IQ F
## 7 shoe M
## 10 wt F
## 13 ht M
## 16 IQ F
## 19 shoe M
## 22 wt F
It can be confusing when to surround a variable name with quotations and when not to. When you use the extract ($
) operator, don’t surround the variable you’re extracting with quotations: my_df$gender
not my_df$'gender'
. When you’re indexing a variable by name, surround it with quotation marks: my_df[,'gender']
not my_df[,gender]
. Why? If you aren’t using $
or quotations ''
, R won’t look for gender
inside my_df
, but will look for it, and fail to find it, among the objects it has already stored.
By the way, you don’t have to create a separate logical variable.
my_df[my_df$gender == "M", ]
## data gender day measure new_gender
## 1 1 M Mon ht m
## 3 3 M Fri shoe m
## 5 5 M Wed ht m
## 7 7 M Mon shoe m
## 9 9 M Fri ht m
## 11 11 M Wed shoe m
## 13 13 M Mon ht m
## 15 15 M Fri shoe m
## 17 17 M Wed ht m
## 19 19 M Mon shoe m
## 21 21 M Fri ht m
## 23 23 M Wed shoe m
But it may be easier for you (or someone else) to read your code if you make a separate logical variable. Be kind to your future self.
Remember our \(t\) test from a bit ago?
(t_test_r_100)
##
## One Sample t-test
##
## data: r_100
## t = 27.409, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 2.942701 3.402006
## sample estimates:
## mean of x
## 3.172353
What if we want to use these data without having to copy and paste? Can we pull the values out?
(names(t_test_r_100))
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
t_test_r_100$statistic # t value
## t
## 27.40939
t_test_r_100$parameter # df value
## df
## 99
t_test_r_100$p.value # p value
## [1] 5.013097e-48
So, using this information along with some syntax we’ll learn later, we can retrieve the information we need from R without error-prone copying and pasting!:
Our randomly generated set of numbers has a mean value of 3.1723531, \(t\)(99) = 27.4093864, \(p\)<5.013096810^{-48}.
We’ll talk about how to use R Markdown to implement this technique later this afternoon.
Here are some rules of thumb about indexing different objects:
[]
.$
operator.If you want additional practice with some of these ideas, try the swirl
lesson 6: Subsetting Vectors
.
We’ve learned how groups of things can be made and how to extract the parts of aggregate objects. Now let’s see what happens when we combine groups of commands together.
Base R can do quite a bit, but the real power of R is that there is a huge community of developers creating sets of powerful and useful functions that you can install and use. Functions are specialized commands to R that take inputs and generate outputs. Sets of functions that relate to some topic, technique, or theme are called ‘packages’. Some include datasets, are called ‘packages’. To view the packages that are already installed, press ctrl-7
to view the ‘Packages’ panel. The packages with checkmarks are currently active R.
As an illustration, let’s look at the help for R’s ‘Base’ Package, loaded by default when you start R.
library(help = "base")
We see that there are all sorts of basic functions here, including the sum()
, and seq()
commands we used earlier and mathematical functions that can come in handy in other contexts.
You might want to look at the help for the ‘stats’ package, which is also loaded by default:
library(help = "stats")
You’ll see a number of functions that look like they do familiar things, the t.test()
function, for example. Unless you customize your R environment, R will automatically load the following packages at start-up: stats
, graphics
, grDevices
, utils
, datasets
, methods
, and base
. When there is a new version of R released, it involves changes to these core packages.
Packages often depend on other packages; these are called dependencies. A well designed package specifies what dependencies it has, and the default behavior of R is to download and install those, too.
The ‘Install Packages…’ item in the ‘Tools’ menu opens a window where you can specify the package or packages you want R to install. To install a package from the Console, use the install.packages()
command. You may note that R assumes you want to download packages from the Comprehensive R Archive Network (CRAN) website. This is the ‘official’ repository for R packages that have undergone a certain level of automated testing and peer review. Some very useful packages have not yet made it into CRAN and are hosted elsewhere, such as on GitHub.
Once you’ve installed a package, you’ll need to tell R that you want to use the commands in the package. The clearest way to do this is to use the package name when you call a particular function. So, if we’ve already installed the ggplot2
package via install.packages('ggplot2')
, we can do this:
ggplot2::qplot(rnorm(100))
This tells R to use the qplot()
function from the ggplot2
package to i) generate a random normal set of 100 numbers and ii) plot them.
Try typing ggplot2
and pausing or hitting the tab
key. You’ll see a pop-up list of functions in the ggplot2
package. Again, this is the best way to call function from a package because it is unambiguous about what function we want.
An older, alternative way is to load the package into memory using the library()
command. If we enter library('ggplot2')
at the Console, then we can type the simpler expression qplot(rnorm(100))
to create the plot. This seems like less typing.
So, why do I recommend the other way? Because some packages use the same names for very similar functions. If you load both into memory, R won’t know which package you want, and so it will choose for you. R will warn you when it makes this choice saying ‘function X is masked by package Y’. Usually, the developers of packages don’t create functions with identical names that do incompatible things, so it’s not a huge source of worry. The problem is magnified when you start to share your code with other researchers who don’t have all of the same packages installed as you do. So, that’s why it’s better practice to be specific about the package(s) you’re using when you use particular functions. In fact, CRAN requires this when you create a package for review.
The real power of R, or really any programming language, stems from the combination of single steps to form more complex and useful wholes. Not only can we use functions other R experts have provided via packages, but we can create our own customized sequences of work and save them for later reuse in the form of scripts and functions.
While it can be fun and instructive to work solely in the Console, for most real work, you’ll want and need to put commands in a place you can easily reuse them. That place is called the ‘Source’ pane. It’s ctrl-1
on your keyboard.
Why is it called source? It’s the source of inspiration, the source of your soon-to-be-realized statistical genius. And the term comes from the computer world where every project starts out as ‘source code’.
Let’s make some source code. Press shift-(command or control)-n
(that’s 3 keys at one time) to create a new R script file called Untitled1
.
Press ctrl-1
to make sure that you are typing in the Source pane in the Untitled1
window.
You can type in this window now, but I urge you to get in the habit of doing two things:
#
in the first column.# R Bootcamp work 2018-08-16
might be a good name.courses/2018_r_bootcamp/
2018_08_16_script.R
under that directory.Later today, I’m going to sharpen and modify this recommendation slightly by adding a step that takes advantage of RStudio’s project management functionality. For now, this will be okay.
Now you can type R code and comments in the Source pane. Let’s try something (modestly) useful.
# Generate data
data_mean_zero <- rnorm(n = 100, mean = 0, sd = 1)
data_mean_nonzero <- rnorm(n = 100, mean = 3, sd = 1)
When you’ve typed this, press (command/control)-s
to save it.
What do these commands do? The rnorm()
command creates a random sample of normally distributed numbers. The n
, mean
, and sd
parameters specify the size of our sample, its mean and standard deviation. Notice the different sample means.
Now, type in two additional rows of commands.
# Print histograms
hist(data_mean_zero)
hist(data_mean_nonzero)
Hit (command/control)-s
to save.
What is this new hist()
command? Let’s ask R to tell us. Hit ctrl-2
to go to the Console. Type help('hist')
and hit enter. You can also type ??hist
to see a broader search of R’s help pages for other commands that start with these letters.
So, this script is going to generate some data and plot histograms. Let’s run the hist()
commands. Move up or down arrow into the line with the hist(data_mean_zero)
command. Hit the Run
button at the top of the Source panel. Notice that R moves the highlighted line down by one each time you press the Run
button.
What if you don’t want to take your hands off the keyboard to do this? Good for you! You can run the current line by pressing shift+enter
. Try that.
Now, let’s add two more lines to our script.
# Run t-tests
(t.test(data_mean_zero))
(t.test(data_mean_nonzero))
You can predict what these will do. Confirm it by running the lines.
If you didn’t save the file, the name will be highlighted in red. Save it now.
Now that you’ve saved a sequence of commands, you can run the whole (saved) sequence from the console. Switch to the console with ctrl-2
.
Enter source('2018_08_16_script.R')
(or whatever name you have given your script) but do not press enter yet. Let’s see what the source()
command does. Press esc
to clear the console. Type help('source')
to view the documentation about source()
. Now, we know that it will take the source file we provide and send it to the R console. Make sure you’re still in the console by pressing ctrl-2
. Type source(file = "
and hit tab
to see a list of files in your local directory. If you see the script file you created and saved, hit return to select it. Then press the right arrow key to the end of the line, and type )
to close the command and enter
to run it.
You may also run the script from the Source panel. Press ctrl-1
to switch to the Source. Press ctrl-s
to make sure the file is saved. Press the Source
button in the right corner of the Source panel.
Why do the plot shapes change each time?
So, this script is fine, but it could be better. What if we want to try different values for the means or the standard deviations or the number of samples?
Add this text to the top of your script file.
# Parameters
mean_1 <- 0
mean_2 <- 3
sd_1 <- 1
sd_2 <- 1
samples_1 <- 100
samples_2 <- 100
Then, edit the next few lines as follows
(data_mean_zero <- rnorm(n = samples_1,
mean = mean_1,
sd = sd_1))
(data_mean_nonzero <- rnorm(n = samples_2,
mean = mean_2,
sd = sd_2))
See how we’ve substituted the new parameters for specific values? Save the script and source it.
If you want to see different plots, switch to the Plots panel (ctrl-6
) and press the left or right arrow buttons.
The virtue of this approach is that we can edit the parameters in the script and see the effects of the changes pretty quickly. The downside of this approach is that we still have a lot of duplicate typing.
Scripts are great. Functions are even better. Functions further reduce typing and other sorts of errors.
Let’s make our own function to see why. It can live (for now) in the same script file. The basic function will look like this:
my_hist_t <- function() {}
So, we’ll assign the function a name–here my_hist_t
. Why this particular name? Well, it’s my function, not R’s, and it plots a histogram and runs a \(t\) test. It’s boring, but helpful to your future self to name things in ways that are as clear as possible. Some even suggest that functions should have verbs in their names – print_hist_t()
– so we can tell them apart from other kinds of objects.
Our function will have some input parameters–not specified yet–entered inside the parentheses ()
. And then there are the squiggly braces {}
. Those are new. That’s where the meat of the function goes, the commands we want to execute.
What input parameters will we want to enter? The number of samples, mean, and standard deviation, of course.
my_hist_t <- function(my_samples, my_mean, my_sd) {}
Notice that I called the parameters my_samples
, etc. This is to make it easier to see what’s going on inside the function. It’s often good practice to give some default values for the parameters so that your function runs even if you forget to give it input.
my_hist_t <- function(my_samples = 100, my_mean = 0, my_sd = 1) {}
If we forget (or choose not) to enter a value for one of the parameters, our function will use the defaults. Now we type the code.
my_hist_t <- function(my_samples = 100, my_mean = 0, my_sd = 1) {
my_data <- rnorm(n = my_samples, mean = my_mean, sd = my_sd)
hist(my_data)
t.test(my_data)
}
Let’s make sure we understand what’s going on. Inside the function (inside the curly brackets), we compute a random normal data set and assign it to my_data
. Then we calculate the histogram and \(t\) test. R will ‘return’ to us the result of the last command in the function, in this case, the results of t.test(my_data)
.
Let’s save the script and source it. We have to do this every time we edit the script file.
Now switch to the Environment
pane (ctrl-8
). See how there is now a function called my_hist_t
listed? We can now go back to the console (ctrl-2
) and run our function with all sorts of parameter combinations:
# Run these in the Console
my_hist_t()
##
## One Sample t-test
##
## data: my_data
## t = -1.0562, df = 99, p-value = 0.2934
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.29578788 0.09028256
## sample estimates:
## mean of x
## -0.1027527
my_hist_t(my_mean = 10)
##
## One Sample t-test
##
## data: my_data
## t = 97.013, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 9.929035 10.343675
## sample estimates:
## mean of x
## 10.13635
my_hist_t(my_samples = 200, my_sd = 5)
##
## One Sample t-test
##
## data: my_data
## t = -0.82963, df = 199, p-value = 0.4077
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.9155860 0.3733232
## sample estimates:
## mean of x
## -0.2711314
Because we’ve specified defaults and are naming our parameters when we call the function, R will just use the defaults when we don’t specify a parameter.
There are many subtleties about writing functions that we haven’t covered here. I recommend the free DataCamp course by Hadley Wickham on “Writing Functions in R” to learn more.
The careful observer will note that I used the equal sign to give my function input when I called it from the console
my_hist_t(my_samples = 200, my_sd=5)
and even in the script where I created the function:
my_hist_t <- function(my_samples = 100, my_mean = 0, my_sd = 1) {
my_data <- rnorm(n = my_samples, mean = my_mean, sd = my_sd)
hist(my_data)
t.test(my_data)
}
What gives?
So, remember when I said that R stores objects in volatile memory? In fact, R stores objects in separate compartments called ‘environments’. When I create a function, R stores any variable names (and values) I assign inside it, like my_data <- rnorm(...)
to an environment specific to the function. This means that the value my function assigns to my_data
is unique to that function.
When I define the my_hist_t()
function to have inputs called my_samples
, my_mean
, and my_sd
, I’m telling R to add these names to the environment for my function. That way R is ready to use these names when their values are defined by the user when she calls my function.
Do not use the assignment
<-
operator when defining a function’s parameters. Use=
. That is usemy_function <- function(my_parameter = 2)
notmy_function <- function(my_parameter <-2)
Think of it this way. When you used source()
to load your function into memory, you told R the names of the variables your function will use in that function’s environment. You don’t need to assign those names again. R will remember them. You just need to some specific values to the names R already has set aside for them. The equal sign =
does that.
We’ve learned how to install packages via the install.packages()
command, load packages into R’s ‘working memory’ via library()
and to view documentation about packages using the library(help = "package_name")
command.
We’ve also learned how to type sequences of assignments and commands into the Source pane to create scripts, and we’ve learned how to execute those commands using the source()
command. We’ve also seen how to make a function that encapsulates these sequences into easily reproducible units, using the function_name <- function(...arguments...) {...function commands...}
syntax.
There is much more to say and learn about functions (e.g., see this Software Carpentry lesson), but this is a good start on why they’re so important and useful.
In a short first course in R, we can never cover everything, but we can hope to give you enough momentum to get started. In this short section, we’ll talk about some skills that will help accelerate your progress in learning R and RStudio.
The absent-minded maestro was racing up New York’s Seventh Avenue to a rehearsal, when a stranger stopped him. “Pardon me,” he said, “can you tell me how to get to Carnegie Hall?” “Yes,” answered the maestro breathlessly. “Practice!”
https://quoteinvestigator.com/2010/05/06/how-do-you-get-to-carnegie-hall/
Using R and RStudio is a skill. To gain expertise, you need to practice. I recommend practicing at least 3-5 times a week.
Train your fingers to limit mousing. Press ctrl-2
to switch to the console. Press up arrow
to see the recent history of your console commands. Use the left and right arrow keys to move within/edit a prior command. Press ctrl-a
to go to the beginning of a line in the console; ctrl-e
to go to the end. Type the first few characters of a variable or command and then i) wait a bit or ii) press tab to see a list of possible completions. Hit up and down arrows to scroll through the list. Hit enter
to select an item. Hit escape
to clear the current line of the console.
If you want to see a window of your recent commands, go to the History
pane by pressing ctrl-4
. You can use the up and down arrow keys to scroll through your previous commands. Hitting enter
immediately copies that command to the console and executes it. I find it somewhat less useful than just the up/down arrow functions in the console, but YMMV.
There will be other keyboard commands. For example, if you’re not already using (command/ctrl)+c
for copying, (command/ctrl)+v
for pasting, and (command/ctrl)+x
for cutting, start now. Also, I use (command/ctrl)+
(enlarge text) and (command/ctrl)-
(shrink text).
It’s well known that the best writers of prose, poetry, or music, are often voracious consumers. The more comfortable you get reading others’ code, the better your own writing will become. Yes, code can be dense at times, and some coders write cleaner and more comprehensible code than others. But in this workshop, we’re trying to help speed you along the learning curve by helping you learn to read R code from the get-go. Remember, it’s critically important that code be human-readable. Computers want 1’s and 0’s and don’t really care about things like comments or elegant syntax or any of that.
Press ctrl-3
to go to the help pane. RStudio has extensive documentation and help resources. Google is your friend, at least in this regard.. There is also an active R community on Stack Overflow.
I often find it useful to read the documentation about a function to see what the other parameters do. For example, typing help('plot')
shows us that the plot()
function takes (typically) two inputs, x
, and y
. The ellipsis ...
says that it also takes other optional parameters. So, this code
plot(x = rnorm(10), y = rnorm(10), main = "Random scatterplot", ylab = "Random Y",
xlab = "Random X")
Creates a scatter plot with a title and labels for the x and y axes.
You’ll want to practice ‘situational awareness’. That means knowing what’s going on in your local R environment.
Press ctrl-8
to go to the ‘Environment’ pane. This shows us all of the Data, Variables, and Functions we have defined in the current session. Some of the items can be expanded by clicking on the small arrow, the data frame for example.
The Environment pane is one way to keep track of what you’ve assigned, but there are some other ways to do this from the console.
ls()
tells you what objects and functions are in R’s active memory ready for your use.enter
will give you the current value.If you’re exploring and putting your commands in a script, you may find it useful to ‘clean the slate’ or start with a fresh environment before you source()
your script. To do this, you can ‘remove individual objects’ using the rm()
command, or remove all of them using rm(list = ls())
.
We’ve mostly ignored it thus far, but when you start RStudio, all of your work is in a default directory (or folder) on your computer. You can change this default in the ‘Tools’ menu, ‘Global Options…’ item. Later this afternoon, I’m going to recommend that you let RStudio help you organize your work by creating a project with its own directory for each new project you take on. But for now, let’s see how we can use console commands to find out where we are.
One of the reasons people gravitate toward careers in independent research is that they relish their independence. So, you’ll find lots of advice about how to structure the information in your projects. Rather than tell you what structure to use, I’m going to show you some structures that I find sensible and workable for many research projects:
my_awesome_project/
README.md
data/
raw/
sub_001.xlsx # or ...
sub_002.xlsx
csv/
sub_001.csv
sub_002.csv
aggregate/
R/ # Helper scripts
figs/
pubs/
papers/
talks/
posters/
This means that the my_awesome_project/
directory is the ‘home’ directory for this project.
Some people prefer to organize their data by measure:
my_awesome_project/
README.md
data/
behavior/
mri_structural/
mri_functional/
...
And others prefer to organize it by participant:
my_awesome_project/
README.md
data/
sub_001/
sub_001_behavior.csv
sub_001_mri_structural/
sub_002_mri_functional/
...
Choose a directory structure that works for you and try to be as consistent as possible about it. This will make writing reproducible scripts (within and across projects) much easier.
By the way, you can automate the creation of directories using the dir.create()
command, but I usually use the RStudio Files panel and its associated buttons.
You can customize your RStudio environment to your heart’s content, including rearranging how the different panels appear. Go for it. You can modify the settings using the ‘Tools’ menu ‘Project Options’ item. The Project Options window opens. Here’s mine.
I have strong feelings about two settings:
Restore .RData into workspace at startup
should be unchecked.Save workspace to .RData on exit
should be never
.These settings ensure that you have a ‘clean’ R session when you start-up. That is, that old values of variables and functions aren’t still lurking about. This may seem counter-intuitive, but it is considered a best practice for reproducible research. You’ll want to be able to re-generate your results from scratch without having old (stale, possibly outdated) values interfering with your current analyses.
There are three main sources of data: R packages, the internet, and files on local computers or servers you have access to.
Base R comes with a number of datasets that can be good for learning your way around. Type library(help="datasets")
to see a list of them.
library(help = "datasets")
Type help('mtcars')
for information about the Motor Trend cars dataset, for example. Type data('mtcars')
to load that dataset into memory.
There is a lot of publicly available data on the internet. But each dataset and site has its own way of downloading.
Just to illustrate how this might work, try downloading a simple heart rate timeseries using the read.csv()
command:
hr_mit <- read.csv(file = "http://ecg.mit.edu/time-series/hr.11839", header = FALSE,
col.names = "HR", skip = 0)
The read.csv()
command creates a data frame. Note that we specified a URL or web address for the file and some other parameters. This file has no header with the column names, so we said header = FALSE
; we provided a name for the single column col.names = 'HR'
; and we said not to skip any rows at the top of the file skip = 0
.
class(hr_mit)
## [1] "data.frame"
Note that we don’t have to surround ‘hr_mit’ with quotation marks because it is a name that R should know because we just defined it.
Then we can plot it using plot(hr_mit$HR)
plot(hr_mit$HR)
Samuel Mehr et al. (2018) have provided some data files using the Open Science Framework:
Mehr, S. A., Singh, M., York, H., Glowacki, L., & Krasnow, M. (2018, January 9) Datasets. https://doi.org/10.17605/OSF.IO/M9RXV
Let’s download the survey data.
mehr_etal_survey <- read.csv(file = "https://osf.io/etg8z/download")
names(mehr_etal_survey)
## [1] "ethno" "theory" "musoth" "psych" "tenure" "sex" "age" "naiv1"
## [9] "naiv2"
This will produce a cross tabulated table about the sex and tenure status of the respondents:
xtabs(formula = ~sex + tenure, data = mehr_etal_survey)
## tenure
## sex 0 1
## Female 203 187
## I prefer not to answer 46 62
## Male 150 289
## Other 2 1
Here is the same table formatted in a nice way using the knitr::kable()
function:
knitr::kable(xtabs(formula = ~sex + tenure, data = mehr_etal_survey))
0 | 1 | |
---|---|---|
Female | 203 | 187 |
I prefer not to answer | 46 | 62 |
Male | 150 | 289 |
Other | 2 | 1 |
Spreadsheets are incredibly powerful computational tools. But they are terrible ways of storing data you want to use in R.
Imagine a future for yourself where you rarely do any manipulation or visualization of data using a spreadsheet. Work to realize that future as soon as possible.
Why? Because you want to produce open, transparent, and reproducible scientific workflows using data and processing pipelines that are findable, accessible, interoperable, and reusable (FAIR) 2.
Comma-separated value (CSV) files are text files with items separated by commas, and rows separated by new line or return characters. CSV files have the file extension .csv
. All spreadsheet programs can export and import CSV files.
CSV files are the lingua franca of the data science world, followed closely by tab-delimited files (where tab
characters replace the commas), JavaScript Object Notation (JSON), and eXtensible Markup Language (XML). Here is a tutorial about CSV files: https://swcarpentry.github.io/r-novice-inflammation/11-supp-read-write-csv/
Notwithstanding the fact that we urge you to store data in comma-separated text files, you will often want and need to work with data files stored in other formats.
The ‘foreign’ package (install.packages('foreign')
) contains commands to read SPSS files (foreign::read.spss()
), Stata files (foreign::read.dta()
), Systat files (foreign::read.systat()
), and Minitab files (foreign::read.mtp()
).
SAS files can be imported using the sas7bdat
package (install.packages('sas7bdat')
).
The xlsx
package (install.packages('xlsx')
) contains commands to read MS Excel (.xlsx) spreadsheets.
These functions all operate to create an R data frame.
Much or most of the time, you’ll want to load data into R that is stored on your computer or on a server you have access to.
It is a best practice to read-in your data file each time you return to work on your project.
Once we have loaded a dataset, we can poke around. We’ll use the mtcars
dataset since it’s part of base R.
data(mtcars)
class(mtcars)
## [1] "data.frame"
str(mtcars) # What is the structure of mtcars?
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
These commands load the dataset, show that it’s a data frame, and show us the names of the columns or variables. We can look at the top (head) and bottom (tail) of the dataset.
head(mtcars) # display first 6 cases
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
tail(mtcars) # display last 6 cases
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
We can also summarize the data with the summary()
command.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
And, we can even plot all of the data.
plot(mtcars)
We can choose specific subsets of variables like before, and even customize the way the histogram looks:
guzzlers <- mtcars$mpg < 20
hist(mtcars$mpg[guzzlers], main = "Gas Guzzling Cars in the mtcars dataset",
xlab = "Miles Per Gallon (mpg)")
read.csv()
to read comma-separated value text filesstr()
to show structure of data framenames()
to show column namessummary()
produces summary statisticsplot()
creates a matrix of plots, pitting each variable against the otherThis document was produced on 2018-08-16 09:47:12 in RStudio version 1.1.453 using R Markdown. The code and materials used to generate the slides may be found at https://github.com/psu-psychology/r-bootcamp-2018/. Information about the R Session that produced the slides is as follows:
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.18 bindr_0.1.1 knitr_1.20 magrittr_1.5
## [5] tidyselect_0.2.4 munsell_0.5.0 colorspace_1.3-2 R6_2.2.2
## [9] rlang_0.2.1 highr_0.7 stringr_1.3.1 plyr_1.8.4
## [13] dplyr_0.7.6 tools_3.5.1 grid_3.5.1 gtable_0.2.0
## [17] htmltools_0.3.6 assertthat_0.2.0 yaml_2.1.19 lazyeval_0.2.1
## [21] rprojroot_1.3-2 digest_0.6.15 tibble_1.4.2 crayon_1.3.4
## [25] bindrcpp_0.2.2 purrr_0.2.5 ggplot2_3.0.0 formatR_1.5
## [29] glue_1.3.0 evaluate_0.11 rmarkdown_1.10 labeling_0.3
## [33] stringi_1.2.4 compiler_3.5.1 pillar_1.3.0 scales_0.5.0
## [37] backports_1.1.2 pkgconfig_2.0.1
Wickham, H. (2014). Tidy Data. Journal of Statistical Software. Retrieved November 19, 2016, from https://www.jstatsoft.org/article/view/v059i10↩
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., et al. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3, 160018. Retrieved from http://dx.doi.org/10.1038/sdata.2016.18↩