Welcome to Week 2 of Geog 380A! This week, we will learn some additional concepts and tools in R that will help you to better work with, manage, and clean data for statistical analysis. Some of this tutorial will follow along with the chapter in Learning Statistics with R; in addition, I will introduce you to some techniques using the dplyr package, which came out after the initial version of our textbook. Not only is this package extremely useful for data management tasks, but it is also very popular in the R community and will be a very valuable tool in your arsenal. Let’s get started!


#Let's start by continuing last week's discussion of variables
#When you work with a variable, there are some special values that you might see

#For example, do you remember what you get when you divide any number by 0? Try it!

1/0
[1] Inf
#The answer is infinity, or "Inf"

#Next, what happens if I try to perform an operation that results in an invalid number? For example:

0/0
[1] NaN
#Here, the result is "NaN" which stands for "Not a Number". This means that there is no meaningful numeric answer, or the answer may be undefined. This is distinct from the next two values because we know what the answer is, it just isn't meaningful

#By contrast, "NA" denotes missing data - we don't know what's supposed to be stored in this element. "NULL" on the other hand indicates that there is no value at all. 
#In practice, you are more likely to see NaN or NA - it is very important to pay attention to these values. If they pop up, they can invalidate the results of your analysis or cause R to return an incorrect answer. For example, let's say we have this vector and there's accidentally a missing value in it:

obj <- c(1, 3, 5, NA, 2, 7)

#What if we want to take the sum of these values? What does R give us?

sum(obj)
[1] NA
#We get an NA! There are a number of ways to troubleshoot this; the most common way is to tell R to simply ignore missing values, by setting na.rm (short for NA Remove) to TRUE:

sum(obj, na.rm = T)
[1] 18
#As a reminder, you can always check the documentation for a function to see what arguments it takes; many functions will include some variation of the na.rm argument to allow you to ignore missing values. This may not always be the best option, but we don't need to worry about that right now. 
#What if we want to add names to that vector that we just created? We can easily do that like so:

names(obj) <- c("E1", "E2", "E3", "E4", "E5", "E6")

#Here I've just used "E-" to generically name each element, but you can use any name that you want
#inspect our vector
obj
E1 E2 E3 E4 E5 E6 
 1  3  5 NA  2  7 
#If I change my mind and don't want the names, I can always do the following to delete them:
names(obj) <- NULL

#We use "NULL" in this case to tell R that there is no value for the names - if we used NA instead, what would happen?

names(obj) <- NA

obj
<NA> <NA> <NA> <NA> <NA> <NA> 
   1    3    5   NA    2    7 
#Finally, names are useful because we can index vectors using element names:

names(obj) <- c("E1", "E2", "E3", "E4", "E5", "E6")
obj["E2"]
E2 
 3 
#Next, we'll talk about a data type that is specific to statistics. So far, we've learned about numeric and character data. The next data type is a Factor
#This becomes important when we want to distinguish between nominal, ordinal, interval and ratio scale data
#Numeric data makes a lot of sense for ratio data, and some sense for interval data (although note that this data does not have a natural zero value, so some operations wouldn't make sense to use)
#What about data types where performing numeric operations don't make a lot of sense? For example, let's think about nominal data. Say I've assigned you all to three groups: groups 1, 2, and 3. We can't say anything about these numbers! It wouldn't make sense to say that group 2 is 2x group 1, or that group 3 minus group 2 equals group 1, would it? 

#But, if I store this as numeric data, what happens? Let's say I have this group data:

group <- c(1,2,3,2,1,3,2)

#Realistically, I shouldn't be able to perform any operations on this data, but R will let me:

group*3
[1] 3 6 9 6 3 9 6
group+2
[1] 3 4 5 4 3 5 4
#In order to ensure that our data is properly stored, we can transform this data into a different data type: Factor data. 

group <- as.factor(group)

#did it work?

class(group)
[1] "factor"
#Can I perform any computations on this data?

group+1
Warning: ‘+’ not meaningful for factors
[1] NA NA NA NA NA NA NA
#No! This is what we wanted to see. 
#What happens if we print out this data?

group
[1] 1 2 3 2 1 3 2
Levels: 1 2 3
#Levels will often refer to the categories in the data, while the data stores the values in our vector. We can also make the levels more meaningful here:

levels(group) <- c("group_1", "group_2", "group_3")

group
[1] group_1 group_2 group_3 group_2 group_1 group_3 group_2
Levels: group_1 group_2 group_3
#It's not necessarily pretty, but it's more meaningful than the previous version

So, you have now seen numeric, character, and factor data, and we’ll use factors throughout the course to store nominal data. You have also worked with different types of objects, including values and vectors. Next, we will introduce a very important type of object: the data frame. You may be familiar with this data type if you have ever worked in Microsoft Excel before - a spreadsheet is one way to view a data frame. At its simplest, a data frame is a way of organizing vectors of data into a table.

#To start out, let's say we have some data on the students in this class
#We know everyone's major, their home state, and age

major <- c("math", "sociology", "psychology", "geography", "physics", "english")
state <- c("New York", "New York", "Pennsylvania", "New Jersey", "New York", "New York")
age <- c(21,18,19,20,18,19)

#we can view and work with each vector individually, but this doesn't tell us how the different variables relate to each other! Additionally, we know that each vector is the same size, and each element represents the same student in each vector (e.g element 1 is student 1 in each vector)
#In order to visualize this data in a more organized way, we can put it into a data frame!

students <- data.frame(age, major, state)

students

That is so much nicer! Notice now that, in the environment pane, the “students” data frame is now listed under Data, while our vectors are listed as Values - this is how R organizes this kind of data. When we manipulate data, its also important to pay attention to how it is being categorized - if you convert a dataframe to a vector or serveral vectors to a data frame, always remember to check that it is being stored correctly.

#What if we want to just inspect one variable in this data frame? We can do this using the '$' sign like this:

students$state
[1] "New York"     "New York"     "Pennsylvania" "New Jersey"   "New York"    
[6] "New York"    
#This tells R to print the variable "state" within the data frame "student"

#What if I have a very large data frame, and I need to know what the variable names are? There are a few ways to do this:

names(students)
[1] "age"   "major" "state"
colnames(students)
[1] "age"   "major" "state"
#Sometimes, you will find that you have a very large data frame, and its time consuming to print out the whole thing. You can view just a few lines using the function head()

head(students)

#Let's look under the hood of this function:

?head

#Here we see that we can set the number of rows that are displayed using the n= argument like this:

head(students, n = 3)
#It can also be helpful to check the type of data that is stored in our data frame:
#Try this first - what type of data is it?
str(students)
'data.frame':   6 obs. of  3 variables:
 $ age  : num  21 18 19 20 18 19
 $ major: chr  "math" "sociology" "psychology" "geography" ...
 $ state: chr  "New York" "New York" "Pennsylvania" "New Jersey" ...
#The str() command gives us a lot of useful information about our data frame! We'll discuss data frame in more detail later; for now, it's just important for you to be familiar with this type of object

Finally, we will introduce one more type of object, the list. Lists will not be used frequently throughout this course, but they are a fundamental data structure to R and its helpful for you to at least be familiar with them now. You will likely use lists quite a bit more as you learn more advanced programming (I use lists constantly in my own research). It is also helpful to know about lists in case you ever accidentally transform a vector or a data frame into a list (which may occasionally happen if you use a function incorrectly, although hopefully we will avoid this!). What is a list? A list is a collection of variables similarly to a data frame. A list, however, is not neat and structured in the same way that a data frame is. Let’s look at an example!

#I'll create a list containing some data on a hypothetical student:
student_a <- list( age = 21,
                   state = "New York",
                   major = c("Economics","Math") 
)

#inspect the list:
student_a
$age
[1] 21

$state
[1] "New York"

$major
[1] "Economics" "Math"     
#in our data frame, each student could only have exactly one major - in this list, the element "major" can contain as many values as we want! While all of the columns must be the same length in any data frame, a list is a nice way of storing data of different lengths. We can select an element from a list similarly to a data frame:

student_a$major
[1] "Economics" "Math"     

When might you want to use a list? Again, in this class, we likely won’t use them very often. But, think about a time when you might use a large amount of data for a research problem. If I’m working with a bunch of data frames and I need to use a function on each one, it might be easier for me to create a list of dataframes and run the function on the list instead - lists end up being very useful at helping you to save time. But no need to think much more about that now! The purpose of this section was just to introduce you to the concept.

In this next section, we’ll introduce a useful new R package: dplyr! This package is part of a larger family of packages that are known as the tidyverse. I use this package constantly in my own work; it’s very helpful for working with data, and hopefully you’ll agree!

Before I introduce some functions from the package let’s first a) download our first package and b) load in some data.

To download a package, you’ll go to tools > install packages > then type in dplyr in the box

library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Now, we’ll load in our first data frame, and then I will show you how to clean it up using dplyr! We’re going to play around with some sample data. The data in the folder is American Community Survey data from 2010-12 that was compiled by FiveThirtyEight. It has a breakdown of survey respondents’ college majors, their gender, and the number of respondents that were employed after graduation. Note that this data is somewhat outdated (e.g. not every student is represented by only including two genders), but it is good example data for the exercises that follow. We will talk more about designing inclusive surveys later. Make sure you’ve put the .csv file in your working directory!

Source: https://github.com/fivethirtyeight/data/blob/master/college-majors

#Let's start learning how to use dplyr!
#First things first: the pipe symbol, or %>% 
#The pipe is extremely useful and flexible, and it makes code very easy to follow 
#Essentially, it allows us to perform multiple operations or functions in a logical order
#For example, I can write the following code to repeat the previous step:

majors %>% head()

#Here, I am telling R, "take the majors dataframe, then print the first six lines"
#The pipe is the "then" in the sentence - it tells R what to do next
#Note that you can hold shift-control-m as a shortcut to insert the pipe symbol
majors %>% str()
'data.frame':   173 obs. of  8 variables:
 $ Major_code    : int  2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
 $ Major         : chr  "PETROLEUM ENGINEERING" "MINING AND MINERAL ENGINEERING" "METALLURGICAL ENGINEERING" "NAVAL ARCHITECTURE AND MARINE ENGINEERING" ...
 $ Total         : int  2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
 $ Men           : int  2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
 $ Women         : int  282 77 131 135 11021 373 1667 960 10907 16016 ...
 $ Major_category: chr  "Engineering" "Engineering" "Engineering" "Engineering" ...
 $ ShareWomen    : num  0.121 0.102 0.153 0.107 0.342 ...
 $ Employed      : int  1976 640 648 758 25694 1857 2912 1526 76442 61928 ...

What did I do there? Let’s think through the steps. First, I told R to use the dataframe, majors. Next, I told R to filter the dataframe and only include rows where Major_category is equal to Engineering. Then, I told R that I wanted to see the first six rows of this filtered data. Note that, when using filter, strings are case sensitive. If I had typed “engineering” the new data would have contained zero rows. In addition, note that filtering is a logical operation - that is why I used two equal signs. When I work with characters using “filter”, I can R that I want to select rows where a variable is equal (==) or not equal (!=) to a specified value. If I was filtering based on a numeric value, I can also tell R to filter rows that are less than (<) or greater than (>) a specified value. Let’s look at some additional examples:

table(majors$Major_category)

    Agriculture & Natural Resources                                Arts 
                                 10                                   8 
             Biology & Life Science                            Business 
                                 14                                  13 
        Communications & Journalism             Computers & Mathematics 
                                  4                                  11 
                          Education                         Engineering 
                                 16                                  29 
                             Health           Humanities & Liberal Arts 
                                 12                                  15 
Industrial Arts & Consumer Services                   Interdisciplinary 
                                  7                                   1 
                Law & Public Policy                   Physical Sciences 
                                  5                                  10 
           Psychology & Social Work                      Social Science 
                                  9                                   9 
majors %>% filter(Major_category == c("Health", "Biology & Life Science")) %>% head()
longer object length is not a multiple of shorter object length

What did we actually just do? Let’s break it down. First, we told R to use the majors data. Next, we use filter() to select rows that contain either Health or Biology & Life Science in the variable Major_category. Then we asked R to show us the first six rows of the result. You use the c() that we learned before to tell R to filter by multiple character values.

#Before we move on, it's important to note that we can save the filtered data to a new object! This comes in handy if we want to perform any analysis on a smaller version of the data. An example:

engineers <- majors %>% filter(Major_category == "Engineering")

#Now we have a new object, engineers, that we can easily work with

Let’s look at another function from dplyr. What if, instead of rows, I want to select specific columns in a dataframe. This is also very simple if we use the “select()” function. Let’s look at an example, and then I’ll explain it in a bit more detail.

#What if we just want to look at enrollment in each major?

enroll <- majors %>% select(Major, Total)

#Using select, I just need to tell R the names of the columns that I am interested in selecting. If you look at my new dataframe, enroll, you'll see that I only have two columns. 

#If I'm working with this data, I might want to know which majors have the largest enrollment. Here, we'll use another new dplyr function, arrange() to order the data

enroll <- majors %>% select(Major, Total) %>% arrange(desc(Total))

#When using arrange(), it's important to note that, by default, this function orders numeric data from smallest to largest (or in ascending order). If you wish to order your data from largest to smallest (in descending order), you must tell R this by putting the variable that you wish to order the dataframe by inside of the function desc(). If you check the help documentation for arrange(), it shows examples using desc() - if you ever forget, just check the help page! It will often have the information you're looking for. I'll show more complicated ways to use arrange() later in the course. 

Let’s talk about another important function from the dplyr package: mutate(). This function allows you to easily create a new variable in a dataframe. It works like this: mutate(variable_name = x). X can be any number of things - often, you will create a new variable based on a formula or other variables in the dataframe. Let’s go back to the larger majors and dataframe and look at some examples of this in practice!

#The simplest possible variable would just be a number or a letter for every row. I can create a variable like this as follows (I'll show you why this is useful in a bit):

majors1 <- majors %>% mutate(one = 1)

#Take a look at the dataframe to see what this looks like! I've stored this as a new dataframe, majors1. 
#As I said earlier, often you will use a formula to create a new variable. Let's say, for instance, I want to create a new variable that shows the percentage of each major that are women. I can do this using two existing variables, Total and Women, as follows:

majors1 <- majors1 %>% mutate(women_frac = Women/Total)

#If we want to know which major has the largest percentage of women, we can add on arrange() at the end to find out:

majors1 <- majors1 %>% mutate(women_frac = Women/Total) %>% arrange(desc(women_frac))
#Which major has the largest percentage of women?

#What if I don't want these new variables anymore? I can easily remove them using select():

majors2 <- majors1 %>% select(-c(one, women_frac))

#Deleting two variables requires two things: a) I need to use a "-" to tell R that I am removing two variables rather than selecting them. b) I need to tell R that I am removing both, so I use c() to communicate that to R. What happens if I don't do that? 

majors3 <- majors1 %>% select(-one, women_frac)

#alternatively, I can put a "-" in front of each variable to accomplish the same goal - using c() becomes very convenient, however, if you are removing a large number of variables at once. 

majors1 <- majors1 %>% select(-one, -women_frac)

I have two more dplyr functions to introduce! The two are summarise() and group_by(). Often, rather than work with raw data, we will want to know information about groups within our data. In the majors data, we might want to know more about each major_category, for example, the number of students in each category. We don’t want to count up each one by hand, that would be very tedious! Instead, we can tell R to do this for us, using the two new functions.

#If I want R to perform an operation or function by group (in this case, by major_category), I have to tell it to do so using group_by(). This function doesn't change the dataframe in any way - it just instructs R to use the specified groups in any subsequent operations. Using it is as simple as:

majors %>% group_by(Major_category)

#Now, I want to create a new dataframe with the total number of students in each category. We can do this by summing the variable Total within each group. summarise() allows us to do just that - this function creates a new dataframe for a summary statistic of our choosing. Because it is a summary of the data, this function typically collapses the larger dataframe into one that has one value for each group. For example, if we use summarise to calculate the mean of the variable Total (without grouping it), this is what we get:

mean_total <- majors %>% summarise(mean = mean(Total, na.rm=T))
head(mean_total)
#In this line of code, the whole data frame is treated as a single group, so the output of summarise is one value. Let's see what happens when we group the data by category. Here, we'll get the total number of students in each category, so we'll use the sum() function as our summary statistic. 

cat_total <- majors %>% group_by(Major_category) %>% summarise(total_cat = sum(Total))

#let's look at this new dataframe:
head(cat_total)

#Uh oh, one of the values is NA! Recall that, when we calculate any summary statistic, R doesn't know what to do with NA values - when it encounters an NA value within a group, it therefore returns NA as the final value. We need to tell it to ignore the NA values to avoid this. 

cat_total <- majors %>% 
  group_by(Major_category) %>% 
  summarise(total_cat = sum(Total, na.rm=T))

#That should work now! Note that, when I write longer or more complicated code using dplyr, I like to put each additional function on its own line - this just makes the code cleaner and easier to follow. 
#So, this does a really nice job producing one summary statistic for us! But what if we want to create a dataframe with a number of summary statistics about the major categories? This is easy to do with summarise() too! If you look up the help page for summarise(), it lists a number of suggested summary statistics that you can calculate using this function - if you are stuck and need ideas, it's helpful to return to the help page. Let's say I want to know the total students in each category, average enrollment for each department in the categories, as well as the largest and smallest enrollment values for the departments in each category. Here's how I do it, using the sum(), mean(), min(), and max() functions that are built into R:

summary_cat <- majors %>% 
  group_by(Major_category) %>% 
  summarise(total = sum(Total, na.rm = T), avg = mean(Total, na.rm = T), 
            smallest = min(Total, na.rm = T), largest = max(Total, na.rm = T))

#Now we have a nice summary table! Let's arrange the table in an order that makes sense, for example, alphabetically:

summary_cat <- summary_cat %>% arrange(Major_category)

head(summary_cat)

That’s it for Week 2 - we’ve covered a lot of ground! In this week’s tutorial, we’ve discussed missing data, how to work with data frames and lists, and the basics of the dplyr package. Using this package, you will be able to easily clean and manipulate data, create new variables, and produce summary statistics. We learned about the pipe operator ( %>%), and some of the most common dplyr functions: filter(), select(), mutate(), arrange(), group_by(), and summarise(). We also learned some new base R functions that we will use to produce summary statistics as we go: mean, sum, min, and max. We will learn more summary statistics as we go, but these are important basic functions to start with. Next week, we will use our data manipulation skills to learn how to produce basic visuals and graphs using the ggplot2 package. The skills we are learning in the first three weeks of this course are foundational R concepts that we will utilize throughout the course and that will help you if you choose to expand your programming knowledge in the future.

Resources

Navarro, D. (2019). Learning Statistics with R. Retrieved from: https://learningstatisticswithr.com/book/index.html.

Wickham, H., François, R., Henry, L. & Müller, K. (2021). dplyr: A grammar of data manipulation. R pack-age version 1.0.6. Retrieved from https:// CRAN.R- proje ct. org/ packa ge= dplyr.

