Welcome to Week 2 of Harp 130! This week, we will learn some
additional concepts and tools in R that will help you to better work
with, manage, and clean data for statistical analysis. Some of this
tutorial will follow along with the chapter in Learning Statistics with
R; in addition, I will introduce you to some techniques using the dplyr
package, which came out after the initial version of our textbook. Not
only is this package extremely useful for data managment tasks, but it
is also very popular in the R community and will be a very valuable tool
in your arsenal. Let’s get started!
#Let's start by continuing last week's discussion of variables
#When you work with a variable, there are some special values that you might see
#For example, do you remember what you get when you divide any number by 0? Try it!
1/0
[1] Inf
#The answer is infinity, or "Inf"
#Next, what happens if I try to perform an operation that results in an invalid number? For example:
0/0
[1] NaN
#Here, the result is "NaN" which stands for "Not a Number". This means that there is no meaningful numeric answer, or the answer may be undefined. This is distinct from the next two values because we know what the answer is, it just isn't meaningful
#By contrast, "NA" denotes missing data - we don't know what's supposed to be stored in this element. "NULL" on the other hand indicates that there is no value at all.
#In practice, you are more likely to see NaN or NA - it is very important to pay attention to these values. If they pop up, they can invalidate the results of your analysis or cause R to return an incorrect answer. For example, let's say we have this vector and there's accidentally a missing value in it:
obj <- c(1, 3, 5, NA, 2, 7)
#What if we want to take the sum of these values? What does R give us?
sum(obj)
[1] NA
#We get an NA! There are a number of ways to troubleshoot this; the most common way is to tell R to simply ignore missing values, by setting na.rm (short for NA Remove) to TRUE:
sum(obj, na.rm = T)
[1] 18
#As a reminder, you can always check the documentation for a function to see what arguments it takes; many functions will include some variation of the na.rm argument to allow you to ignore missing values. This may not always be the best option, but we don't need to worry about that right now.
#What if we want to add names to that vector that we just created? We can easily do that like so:
names(obj) <- c("E1", "E2", "E3", "E4", "E5", "E6")
#Here I've just used "E-" to generically name each element, but you can use any name that you want
#inspect our vector
obj
E1 E2 E3 E4 E5 E6
1 3 5 NA 2 7
#If I change my mind and don't want the names, I can always do the following to delete them:
names(obj) <- NULL
#We use "NULL" in this case to tell R that there is no value for the names - if we used NA instead, what would happen?
names(obj) <- NA
obj
<NA> <NA> <NA> <NA> <NA> <NA>
1 3 5 NA 2 7
#Finally, names are useful because we can index vectors using element names:
names(obj) <- c("E1", "E2", "E3", "E4", "E5", "E6")
obj["E2"]
E2
3
#Next, we'll talk about a data type that is specific to statistics. So far, we've learned about numeric and character data. The next data type is a Factor
#This becomes important when we want to distinguish between nominal, ordinal, interval and ratio scale data
#Numeric data makes a lot of sense for ratio data, and some sense for interval data (although note that this data does not have a natural zero value, so some operations wouldn't make sense to use)
#What about data types where performing numeric operations don't make a lot of sense? For example, let's think about nominal data. Say I've assigned you all to three groups: groups 1, 2, and 3. We can't say anything about these numbers! It wouldn't make sense to say that group 2 is 2x group 1, or that group 3 minus group 2 equals group 1, would it?
#But, if I store this as numeric data, what happens? Let's say I have this group data:
group <- c(1,2,3,2,1,3,2)
#Realistically, I shouldn't be able to perform any operations on this data, but R will let me:
group*3
[1] 3 6 9 6 3 9 6
group+2
[1] 3 4 5 4 3 5 4
#In order to ensure that our data is properly stored, we can transform this data into a different data type: Factor data.
group <- as.factor(group)
#did it work?
class(group)
[1] "factor"
#Can I perform any computations on this data?
group+1
Warning: ‘+’ not meaningful for factors
[1] NA NA NA NA NA NA NA
#No! This is what we wanted to see.
#What happens if we print out this data?
group
[1] 1 2 3 2 1 3 2
Levels: 1 2 3
#Levels will often refer to the categories in the data, while the data stores the values in our vector. We can also make the levels more meaningful here:
levels(group) <- c("group_1", "group_2", "group_3")
group
[1] group_1 group_2 group_3 group_2 group_1 group_3 group_2
Levels: group_1 group_2 group_3
#It's not necessarily pretty, but it's more meaningful than the previous version
So, you have now seen numeric, character, and factor data, and we’ll
use factors throughout the course to store nominal data. You have also
worked with different types of objects, including values and vectors.
Next, we will introduce a very important type of object: the data frame.
You may be familiar with this data type if you have ever worked in
Microsoft Excel before - a spreadsheet is one way to view a data frame.
At its simplest, a data frame is a way of organizing vectors of data
into a table.
#To start out, let's say we have some data on the students in this class
#We know everyone's major, their home state, and age
major <- c("math", "sociology", "psychology", "geography", "physics", "english")
state <- c("New York", "New York", "Pennsylvania", "New Jersey", "New York", "New York")
age <- c(21,18,19,20,18,19)
#we can view and work with each vector individually, but this doesn't tell us how the different variables relate to each other! Additionally, we know that each vector is the same size, and each element represents the same student in each vector (e.g element 1 is student 1 in each vector)
#In order to visualize this data in a more organized way, we can put it into a data frame!
students <- data.frame(age, major, state)
students
That is so much nicer! Notice now that, in the environment pane, the
“students” data frame is now listed under Data, while our vectors are
listed as Values - this is how R organizes this kind of data. When we
manipulate data, its also important to pay attention to how it is being
categorized - if you convert a dataframe to a vector or serveral vectors
to a data frame, always remember to check that it is being stored
correctly.
#What if we want to just inspect one variable in this data frame? We can do this using the '$' sign like this:
students$state
[1] "New York" "New York" "Pennsylvania" "New Jersey" "New York"
[6] "New York"
#This tells R to print the variable "state" within the data frame "student"
#What if I have a very large data frame, and I need to know what the variable names are? There are a few ways to do this:
names(students)
[1] "age" "major" "state"
colnames(students)
[1] "age" "major" "state"
#Sometimes, you will find that you have a very large data frame, and its time consuming to print out the whole thing. You can view just a few lines using the function head()
head(students)
#Let's look under the hood of this function:
?head
#Here we see that we can set the number of rows that are displayed using the n= argument like this:
head(students, n = 3)
#It can also be helpful to check the type of data that is stored in our data frame:
#Try this first - what type of data is it?
str(students)
'data.frame': 6 obs. of 3 variables:
$ age : num 21 18 19 20 18 19
$ major: chr "math" "sociology" "psychology" "geography" ...
$ state: chr "New York" "New York" "Pennsylvania" "New Jersey" ...
#The str() command gives us a lot of useful information about our data frame! We'll discuss data frame in more detail later; for now, it's just important for you to be familiar with this type of object
Finally, we will introduce one more type of object, the list. Lists
will not be used frequently throughout this course, but they are a
fundamental data structure to R and its helpful for you to at least be
familiar with them now. You will likely use lists quite a bit more as
you learn more advanced programming (I use lists constantly in my own
research). It is also helpful to know about lists in case you ever
accidentally transform a vector or a data frame into a list (which may
occasionally happen if you use a function incorrectly, although
hopefully we will avoid this!). What is a list? A list is a collection
of variables similarly to a data frame. A list, however, is not neat and
structured in the same way that a data frame is. Let’s look at an
example!
#I'll create a list containing some data on a hypothetical student:
student_a <- list( age = 21,
state = "New York",
major = c("Economics","Math")
)
#inspect the list:
student_a
$age
[1] 21
$state
[1] "New York"
$major
[1] "Economics" "Math"
#in our data frame, each student could only have exactly one major - in this list, the element "major" can contain as many values as we want! While all of the columns must be the same length in any data frame, a list is a nice way of storing data of different lengths. We can select an element from a list similarly to a data frame:
student_a$major
[1] "Economics" "Math"
When might you want to use a list? Again, in this class, we likely
won’t use them very often. But, think about a time when you might use a
large amount of data for a research problem. If I’m working with a bunch
of data frames and I need to use a function on each one, it might be
easier for me to create a list of dataframes and run the function on the
list instead - lists end up being very useful at helping you to save
time. But no need to think much more about that now! The purpose of this
section was just to introduce you to the concept.
In this next section, we’ll introduce a useful new R package: dplyr!
This package is part of a larger family of packages that are known as
the tidyverse. I use this package constantly in my own work; it’s very
helpful for working with data, and hopefully you’ll agree!
Before I introduce some functions from the package let’s first a)
download our first package and b) load in some data.
To download a package, you’ll go to tools > install packages >
then type in dplyr in the box
#We have to load in packages every time we start a new session in R
#When you start a new R document, it's often easiest to put all of the packages you'll be
#using at the top of the document
#Before we do that, usually I start by setting the working directory for my R session
#This tells R what folder I'll be using to access any files or data - I frequently work with excel files, so this is very important
#We set the working directory like this:
setwd("C:/Users/melha/OneDrive/Documents/Binghamton/Harp130")
#How can I find my file path? On a Windows machine, go to your file explorer
#It'll show the file path at the top; usually it says something like "> This PC > Documents >...." - if you click on this line, the full file path is shown, and you can copy and paste it right into R
#Note, however, that R is very picky about slashes in you're using a Windows machine - while Windows uses a backslash, you will need to change it to a forward slash manually (I'll show you what I mean)
#After you set the working directory, let's load in our first package! Dplyr
library(dplyr)
#Usually it'll show some text in red - this is generally not an issue, unless you get an error code that says the package won't load
Now, we’ll load in our first data frame, and then I will show you how
to clean it up using dplyr! We’re going to play around with some sample
data. The data in the folder is American Community Survey data from
2010-12 that was compiled by FiveThirtyEight. It has a breakdown of
survey respondents’ college majors, their gender, and the number of
respondents that were employed after graduation. Note that this data is
somewhat outdated (e.g. not every student is represented by only
including two genders), but it is good example data for the exercises
that follow. We will talk more about designing inclusive surveys later.
Make sure you’ve put the .csv file in your working directory!
Source: https://github.com/fivethirtyeight/data/blob/master/college-majors
#Let's read in the data using the read.csv command - this command is how we will load most data in the course
majors <- read.csv("college_majors.csv")
#How many observations are in this data? How many variables?
#Let's inspect the first few rows of the data to make sure it looks right
head(majors)
#Let's start learning how to use dplyr!
#First things first: the pipe symbol, or %>%
#The pipe is extremely useful and flexible, and it makes code very easy to follow
#Essentially, it allows us to perform multiple operations or functions in a logical order
#For example, I can write the following code to repeat the previous step:
majors %>% head()
#Here, I am telling R, "take the majors dataframe, then print the first six lines"
#The pipe is the "then" in the sentence - it tells R what to do next
#We can practice using it with other simple functions:
majors %>% str()
'data.frame': 173 obs. of 8 variables:
$ Major_code : int 2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
$ Major : chr "PETROLEUM ENGINEERING" "MINING AND MINERAL ENGINEERING" "METALLURGICAL ENGINEERING" "NAVAL ARCHITECTURE AND MARINE ENGINEERING" ...
$ Total : int 2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
$ Men : int 2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
$ Women : int 282 77 131 135 11021 373 1667 960 10907 16016 ...
$ Major_category: chr "Engineering" "Engineering" "Engineering" "Engineering" ...
$ ShareWomen : num 0.121 0.102 0.153 0.107 0.342 ...
$ Employed : int 1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
#Now we know the type of each variable in the data, which is very handy information
#Obviously, we could have just written str(majors) - right now, this doesn't look all that useful!
#Let's introduce some more functions from the dplyr package, and you'll see why the pipe is so versatile
#The first dplyr function we will learn is filter!
#Let's say I want to only look at engineering majors. I can use "filter" to select rows of a dataframe based on a logical condition like this:
majors %>% filter(Major_category == "Engineering") %>% head()
What did I do there? Let’s think through the steps. First, I told R
to use the dataframe, majors. Next, I told R to filter the dataframe and
only include rows where Major_category is equal to Engineering. Then, I
told R that I wanted to see the first six rows of this filtered data.
Note that, when using filter, strings are case sensitive. If I had typed
“engineering” the new data would have contained zero rows. In addition,
note that filtering is a logical operation - that is why I used two
equal signs. When I work with characters using “filter”, I can R that I
want to select rows where a variable is equal (==) or not equal (!=) to
a specified value. If I was filtering based on a numeric value, I can
also tell R to filter rows that are less than (<) or greater than
(>) a specified value. Let’s look at some additional examples:
#select majors that have more than 2,000 students:
majors %>% filter(Total > 2000) %>% head()
#What if I want to filter by multiple categories?
#First, how do I find out what all of the categories are without scrolling through the entire dataframe? We can use the base R function "unique()" to view the unique values of a variable like this:
unique(majors$Major_category)
[1] "Engineering" "Business"
[3] "Physical Sciences" "Law & Public Policy"
[5] "Computers & Mathematics" "Agriculture & Natural Resources"
[7] "Industrial Arts & Consumer Services" "Arts"
[9] "Health" "Social Science"
[11] "Biology & Life Science" "Education"
[13] "Humanities & Liberal Arts" "Psychology & Social Work"
[15] "Communications & Journalism" "Interdisciplinary"
#What if we want to know which category is the largest? We can use another base R function, table(), to look at this:
table(majors$Major_category)
Agriculture & Natural Resources Arts
10 8
Biology & Life Science Business
14 13
Communications & Journalism Computers & Mathematics
4 11
Education Engineering
16 29
Health Humanities & Liberal Arts
12 15
Industrial Arts & Consumer Services Interdisciplinary
7 1
Law & Public Policy Physical Sciences
5 10
Psychology & Social Work Social Science
9 9
#Which is the largest?
#What is we wanted to filter by the Health and Biology & Life Sciences categories? We can do that like so:
majors %>% filter(Major_category == c("Health", "Biology & Life Science")) %>% head()
Warning: longer object length is not a multiple of shorter object length
What did we actually just do? Let’s break it down. First, we told R
to use the majors data. Next, we use filter() to select rows that
contain either Health or Biology & Life Science in the variable
Major_category. Then we asked R to show us the first six rows of the
result. You use the c() that we learned before to tell R to filter by
multiple character values.
#Before we move on, it's important to note that we can save the filtered data to a new object! This comes in handy if we want to perform any analysis on a smaller version of the data. An example:
engineers <- majors %>% filter(Major_category == "Engineering")
#Now we have a new object, engineers, that we can easily work with
Let’s look at another function from dplyr. What if, instead of rows,
I want to select specific columns in a dataframe. This is also very
simple if we use the “select()” function. Let’s look at an example, and
then I’ll explain it in a bit more detail.
#What if we just want to look at enrollment in each major?
enroll <- majors %>% select(Major, Total)
#Using select, I just need to tell R the names of the columns that I am interested in selecting. If you look at my new dataframe, enroll, you'll see that I only have two columns.
#If I'm working with this data, I might want to know which majors have the largest enrollment. Here, we'll use another new dplyr function, arrange() to order the data
enroll <- majors %>% select(Major, Total) %>% arrange(desc(Total))
#When using arrange(), it's important to note that, by default, this function orders numeric data from smallest to largest (or in ascending order). If you wish to order your data from largest to smallest (in descending order), you must tell R this by putting the variable that you wish to order the dataframe by inside of the function desc(). If you check the help documentation for arrange(), it shows examples using desc() - if you ever forget, just check the help page! It will often have the information you're looking for. I'll show more complicated ways to use arrange() later in the course.
Let’s talk about another important function from the dplyr package:
mutate(). This function allows you to easily create a new variable in a
dataframe. It works like this: mutate(variable_name = x). X can be any
number of things - often, you will create a new variable based on a
formula or other variables in the dataframe. Let’s go back to the larger
majors and dataframe and look at some examples of this in practice!
#The simplest possible variable would just be a number or a letter for every row. I can create a variable like this as follows (I'll show you why this is useful in a bit):
majors1 <- majors %>% mutate(one = 1)
#Take a look at the dataframe to see what this looks like! I've stored this as a new dataframe, majors1.
#As I said earlier, often you will use a formula to create a new variable. Let's say, for instance, I want to create a new variable that shows the percentage of each major that are women. I can do this using two existing variables, Total and Women, as follows:
majors1 <- majors1 %>% mutate(women_frac = Women/Total)
#If we want to know which major has the largest percentage of women, we can add on arrange() at the end to find out:
majors1 <- majors1 %>% mutate(women_frac = Women/Total) %>% arrange(desc(women_frac))
#Which major has the largest percentage of women?
#What if I don't want these new variables anymore? I can easily remove them using select():
majors2 <- majors1 %>% select(-c(one, women_frac))
#Deleting two variables requires two things: a) I need to use a "-" to tell R that I am removing two variables rather than selecting them. b) I need to tell R that I am removing both, so I use c() to communicate that to R. What happens if I don't do that?
majors3 <- majors1 %>% select(-one, women_frac)
#alternatively, I can put a "-" in front of each variable to accomplish the same goal - using c() becomes very convenient, however, if you are removing a large number of variables at once.
majors1 <- majors1 %>% select(-one, -women_frac)
I have two more dplyr functions to introduce! The two are summarise()
and group_by(). Often, rather than work with raw data, we will want to
know information about groups within our data. In the majors data, we
might want to know more about each major_category, for example, the
number of students in each category. We don’t want to count up each one
by hand, that would be very tedious! Instead, we can tell R to do this
for us, using the two new functions.
#If I want R to perform an operation or function by group (in this case, by major_category), I have to tell it to do so using group_by(). This function doesn't change the dataframe in any way - it just instructs R to use the specified groups in any subsequent operations. Using it is as simple as:
majors %>% group_by(Major_category)
#Now, I want to create a new dataframe with the total number of students in each category. We can do this by summing the variable Total within each group. summarise() allows us to do just that - this function creates a new dataframe for a summary statistic of our choosing. Because it is a summary of the data, this function typically collapses the larger dataframe into one that has one value for each group. For example, if we use summarise to calculate the mean of the variable Total (without grouping it), this is what we get:
mean_total <- majors %>% summarise(mean = mean(Total, na.rm=T))
#In this line of code, the whole data frame is treated as a single group, so the output of summarise is one value. Let's see what happens when we group the data by category. Here, we'll get the total number of students in each category, so we'll use the sum() function as our summary statistic.
cat_total <- majors %>% group_by(Major_category) %>% summarise(total_cat = sum(Total))
#let's look at this new dataframe:
head(cat_total)
#Uh oh, one of the values is NA! Recall that, when we calculate any summary statistic, R doesn't know what to do with NA values - when it encounters an NA value within a group, it therefore returns NA as the final value. We need to tell it to ignore the NA values to avoid this.
cat_total <- majors %>%
group_by(Major_category) %>%
summarise(total_cat = sum(Total, na.rm=T))
#That should work now! Note that, when I write longer or more complicated code using dplyr, I like to put each additional function on its own line - this just makes the code cleaner and easier to follow.
#So, this does a really nice job producing one summary statistic for us! But what if we want to create a dataframe with a number of summary statistics about the major categories? This is easy to do with summarise() too! If you look up the help page for summarise(), it lists a number of suggested summary statistics that you can calculate using this function - if you are stuck and need ideas, it's helpful to return to the help page. Let's say I want to know the total students in each category, average enrollment for each department in the categories, as well as the largest and smallest enrollment values for the departments in each category. Here's how I do it, using the sum(), mean(), min(), and max() functions that are built into R:
summary_cat <- majors %>%
group_by(Major_category) %>%
summarise(total = sum(Total, na.rm = T), avg = mean(Total, na.rm = T),
smallest = min(Total, na.rm = T), largest = max(Total, na.rm = T))
#Now we have a nice summary table! Let's arrange the table in an order that makes sense, for example, alphabetically:
summary_cat <- summary_cat %>% arrange(Major_category)
That’s it for Week 2 - we’ve covered a lot of ground! In this week’s
tutorial, we’ve discussed missing data, how to work with data frames and
lists, and the basics of the dplyr package. Using this package, you will
be able to easily clean and manipulate data, create new variables, and
produce summary statistics. We learned about the pipe operator (
%>%), and some of the most common dplyr functions: filter(),
select(), mutate(), arrange(), group_by(), and summarise(). We also
learned some new base R functions that we will use to produce summary
statistics as we go: mean, sum, min, and max. We will learn more summary
statistics as we go, but these are important basic functions to start
with. Next week, we will use our data manipulation skills to learn how
to produce basic visuals and graphs using the ggplot2 package. The
skills we are learning in the first three weeks of this course are
foundational R concepts that we will utilize throughout the course and
that will help you if you choose to expand your programming knowledge in
the future.
Resources
Navarro, D. (2019). Learning Statistics with R. Retrieved from: https://learningstatisticswithr.com/book/index.html.
Wickham, H., François, R., Henry, L. & Müller, K. (2021). dplyr:
A grammar of data manipulation. R pack-age version 1.0.6. Retrieved from
https:// CRAN.R- proje ct. org/ packa
ge= dplyr.
