Last week, you learned about basic objects and data structures in R. Recall:
R thinks of all values as being one of four different data types:
character
logical
numeric (i.e., “doubles” and “integers”)
factor
All values can be stored in objects. The main types of objects we will be working with are:
vectors - sequences of values of the same type
data frames - two-dimensional objects containing multiple vectors of the same length as columns
A helpful analogy to remember these two objects is that the data frame is R’s version of a spreadsheet, and vectors are columns in that spreadsheet (though in R you can also have a stand-alone vector outside of that spreadsheet - equivalent to one column on its own in a spreadsheet).
Today’s Objectives
This week, you will refer to these objects in your code to conduct your first statistical analyses. Questions and instructions in the first section will serve as a refresher on how to read in, create, and manipulate objects in R. In the second section, you will learn how to compute summary statistics for two different types of variables: numeric variables (interval and ratio levels), and categorical variables (nominal and ordinal levels). In the third section, you will learn how to perform basic data visualization for these two broad types of data.
Playing with Vectors
Recall that you created vectors last class manually using the c() function or automatically using the runif() function. The example below illustrates what you did in lab:
#creating a numeric vector with 5 values (AKA "length 5") by entering each value manuallynum_vec_1 <-c(1,3,5,7,9)num_vec_1
[1] 1 3 5 7 9
#creating a numeric vector, length 5, drawn from a random uniform distributionnum_vec_2 <-runif(n=5, min=0, max=1)num_vec_2
#creating a character vector, length 5, by entering each value manuallychar_vec_1 <-c("A", "B", "J", "N", "Q") #note the ""char_vec_1
[1] "A" "B" "J" "N" "Q"
#creating a logical vector, length 5, by entering each value manuallylogi_vec_1 <-c(TRUE, FALSE, TRUE, FALSE, TRUE) logi_vec_1
[1] TRUE FALSE TRUE FALSE TRUE
In the logical vector creation above, note how the words changed color when I typed them out - R recognizes the words TRUE and FALSE as logical values - not character values - when you type them in all-caps and do not wrap them in quotes. In the background, R treats logical TRUE values as 1’s and logical FALSE values as 0’s. The “data type” is still logical - not numeric. But because of these background 1’s and 0’s, you can easily count the number of TRUE values in a logical vector using the sum() function:
sum(logi_vec_1)
[1] 3
We will ignore factor vectors for now. They are useful, but trickier than the other vector types because they can look exactly like numeric or character vectors until you check them with a function like typeof(), class(), or glimpse(). We will introduce them gradually as you progress in your programming skills.
Revisiting Data Frames
Recall that data frames are two-dimensional objects that contain multiple vectors of the same length. They are essentially groups of vectors (think of the spreadsheet metaphor from the first section of these instructions). The code below combines the vectors I created above (which are all the same length) to make a data frame.
#Use the data.frame() function to create a data frame from a set of vectors that are all the same length. Each argument represents one vector. If your vectore are of different lengths, you will get an error.df <-data.frame(num_vec_1, num_vec_2, char_vec_1, logi_vec_1)#ALWAYS inspect objects like data frames after you create them or read them into R, to make sure the data look like you expected them to and to catch mistakes. Use the head() and summary() functions to do so. We will learn other ways to do this later.head(df)
num_vec_1 num_vec_2 char_vec_1 logi_vec_1
1 1 0.2185457 A TRUE
2 3 0.7096525 B FALSE
3 5 0.2349540 J TRUE
4 7 0.8886341 N FALSE
5 9 0.6953423 Q TRUE
summary(df)
num_vec_1 num_vec_2 char_vec_1 logi_vec_1
Min. :1 Min. :0.2185 Length:5 Mode :logical
1st Qu.:3 1st Qu.:0.2350 Class :character FALSE:2
Median :5 Median :0.6953 Mode :character TRUE :3
Mean :5 Mean :0.5494
3rd Qu.:7 3rd Qu.:0.7097
Max. :9 Max. :0.8886
Today we will use one or two functions from a package called dplyr. In R, “packages” are sets of functions that you can install and load that expand your capabilities. They are usually not developed by the same programmers who write and maintain “base R” (i.e., R without any packages installed), but they can be very useful.
To install a package for the first time, you use the function install.packages() and include the name of the package in quotes.
install.packages("dplyr") #quotes are necessary here
R will print a lot of language in the console during the install period. Check to see what this language says to make sure there are no messages such as ERROR or WARNING.
Once you install the package, you have it on your computer forever. There is no need to run install.packages() every time you want to use dplyr unless R prompts you and tells you that you need to update the package.
However, installing a packages is not the same thing as having the package loaded and ready to use during your work session. In order to use the functions that dplyr has inside of it, every time you open R and want to use it, you have to load the package with the library() function:
library(dplyr) #no quotes are necessary here
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
You need to load dplyr every time you restart RStudio. Think of opening R as restarting your phone, and loading packages as opening apps on your phone after the restart. Packages like dplyr are never open in the R system until you tell them to be open during your current work session.
Note that, when you run library(), you might get messages about whether the package you loaded is going to play nicely with the rest of R’s functions and other libraries. Be mindful of messages about “masking” - these tell you if any functions in the package you loaded are spelled exactly the same as functions in base R or other packages. It is unlikely that masking will cause any problems in this class, but always look for messages about masking when you load a package.
Summarizing Data
Your lab script will first get you to do a different version of what I did in this instructional document above. Then it will ask you to generate summary statistics for one vector containing nominal-level data and other vectors containing interval- and ratio-level data.
Go to blackboard, download your lab, and get started!