Lab 1: R Basics and the Tidyverse

R Basics

Objects

Everything in R is an object and every object has a name. We use functions on the objects. An object is an assignment between a name and a value.

x <- 5
y = 10

Notice that this saves in the global environment. Now we can use these objects to do other things. To print the object, just type the name of the object and run the line of code.

## [1] 5

Mathematical operations

We can use mathematical operations on our objects and we can create new objects.

a <- x + y
a

## [1] 15

b <- x*y
b

## [1] 50

c <- y^x
c

## [1] 1e+05

d <- y/x
d

## [1] 2

There are many different types of objects and we will learn about them throughout this course. One we will use frequently is a vector. We can create a vector object using:

vector1 <- c(1:10)
vector1

##  [1]  1  2  3  4  5  6  7  8  9 10

vector2 <- c(a, b, c, d) #notice that this one will give you a vector of the objects we just made above, not the letters!
vector2

## [1]     15     50 100000      2

We can do mathematical operations with vectors too!

vector1^2

##  [1]   1   4   9  16  25  36  49  64  81 100

All of the objects we have made so far have been numbers but objects don’t have to be just numbers.

vector3 <- c(40, "banana", "carrot", NULL)
vector3

## [1] "40"     "banana" "carrot"

Functions

We have actually already used a function! c() is a function that we used to make a list of things for the vector! Functions can transform your data in many ways. We are going to use functions today to get a snapshot and summary of our data.

The head() and tail() functions give us the first few items and last few items in the data. We can specify how many items we want to see by using head(object, number).

head(vector1, 3) #gives the first 3 items in vector1

## [1] 1 2 3

tail(vector1, 3) #gives the last 3 items in vector1

## [1]  8  9 10

Self check: Try creating a vector with 5 items in it and view the first 2 of them.

sample_vector <- c(1,3,5,7,9)     
head(sample_vector, 2)

## [1] 1 3

We can also use functions to find summary statistics of our data.

mean(vector1)

## [1] 5.5

sd(vector1)

## [1] 3.02765

median(vector1)

## [1] 5.5

summary(vector1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00

Self check: How would we find the variance of vector1?

sd(vector1)^2

## [1] 9.166667

Self check: What is the maximum of your 5 item vector?

max(sample_vector)

## [1] 9

summary(sample_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       3       5       5       7       9

If you don’t know what a function does, you can get help from R by putting a question mark in front of the function name. This brings up an R help page. Example: ?mean

Classes

Each object in R has a class. The class can be logical, numeric, character, etc. We can check the class of something using the class() function.

class(a)

## [1] "numeric"

class(vector1)

## [1] "integer"

class(2>3)

## [1] "logical"

What about our vector3 that has words and numbers? What do we think this class should be?

class(vector3)

## [1] "character"

vector3

## [1] "40"     "banana" "carrot"

vector3 is a class character - we can’t do mathematical operations on it! Notice that even though we have a number in the vector, R has converted it to a character!

Tidyverse

R is really useful because of its ability to use packages. Pacman is a package for “package management” - it helps us load multiple packages at once. We need to load the pacman pacakge after installing it to use it. Next, we use the p_load() function to load other packages we want to use. Let’s load the tidyverse.

install.packages("pacman", repos = "http://cran.us.r-project.org")

## 
## The downloaded binary packages are in
##  /var/folders/30/xcfr30vj55q2bbs6dlm27fqw0000gn/T//RtmpEWQtDm/downloaded_packages

library(pacman)
p_load(tidyverse)

Tidyverse is used for data wrangling. It allows you to manipulate data frames in a rather intuitive way. Tidyverse is a huge package so today we will be focusing on functions from the dplyr package (comes with tidyverse). The main functions were are using in this class are: - select(): subset columns - filter(): subset rows on conditions - arrange(): sort results - mutate(): create new columns by using information from other columns - group_by() and summarize(): create summary statisitcs on grouped data - count(): count discrete values

We are going to use a dataset that is built into the tidyverse package. Let’s give it a name so we can work with it.

our_data <- starwars

We can view data frame by typing view(our_data) or by clicking the name in the global environment. To look at only names of variables, we can use names().

names(our_data)

##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
## [11] "films"      "vehicles"   "starships"

Select and Filter

Let’s select only the name, gender, and homeworld variables

select(our_data, c(name, gender, homeworld))

Notice that this didn’t save anything in our global environment! If you want to save this new dataframe, you have to give it a name! To select all columns except a certain one, use a minus sign

select(our_data, c(-starships, -vehicles))

Filter the data frame to include only droids

filter(our_data, species == "Droid")

Filter the data frame to include droids OR humans

filter(our_data, species == "Droid" | species == "Human")

Filter the data frame to include characters taler than 100 cm and a mass over 100

filter(our_data, height > 100 & mass > 100)

Piping

What if we want to do those things all in one step??? The tidyverse allows us chain functions together with %>%. The pipe connects the LHS to the RHS. (Like reading a book). Let’s make a new dataframe where we select the name, height, and mass. Filter out those who are shorter than 100 cm.

new_df <- our_data %>% select(name, height, mass) %>% filter(height >= 100)
new_df

Self check: make a new data frame where you select all columns except gender and has characters that appear ONLY in the film “A New Hope”

example_df <- our_data %>% select(-gender) %>% filter(films == "A New Hope")

Arrange

Let’s arrange all of the characters by their height

our_data %>% arrange(height)

Notice this does lowest to highest, we can do the other way too

our_data %>% arrange(desc(height))

Self check: Arrange the characters names in alphabetical order

our_data %>% arrange(name)

Mutate

Mutate creates a new variables. Let’s create a new variable that measures height in inches instead of centimeters (2.54cm per inch).

our_data %>% mutate(height_inches = height/2.54)

Self check: Create a new variable that is the sum of person’s mass and height

our_data %>% mutate(total = height + mass)

Group_by and Summarize

Using these two functions together will group data together and can make summary statistics. Let’s find the average height for each species.

our_data %>% group_by(species) %>% summarize(avg_height = mean(height))

# Notice we have NA's! We can get rid of those
our_data %>% na.omit() %>% group_by(species) %>% summarize(avg_height = mean(height))

Count

Count the number of each species

our_data %>% count(species)

OLS Regression

To do a regression in R, we use lm(). The basic steup: name <- lm(y ~ x, data = name_of_df). Let’s regress height on mass.

reg1 <- lm(height ~ mass, data = our_data)
summary(reg1)

## 
## Call:
## lm(formula = height ~ mass, data = our_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.763   -5.610    6.385   18.202   58.897 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 171.28536    5.34340   32.05   <2e-16 ***
## mass          0.02807    0.02752    1.02    0.312    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.52 on 57 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.01792,    Adjusted R-squared:  0.0006956 
## F-statistic:  1.04 on 1 and 57 DF,  p-value: 0.312

Let’s filter out Jabba the Hutt because he is an outlier. We can filter using pipes inside our lm function.

reg2 <- lm(height ~ mass, data = our_data %>% filter(species != "Hutt"))
summary(reg2)

## 
## Call:
## lm(formula = height ~ mass, data = our_data %>% filter(species != 
##     "Hutt"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.821  -6.273   2.327  14.078  45.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 101.6706     8.6593  11.741  < 2e-16 ***
## mass          0.9500     0.1064   8.931 2.72e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.3 on 55 degrees of freedom
##   (24 observations deleted due to missingness)
## Multiple R-squared:  0.5919, Adjusted R-squared:  0.5845 
## F-statistic: 79.77 on 1 and 55 DF,  p-value: 2.72e-12

Self check: Can you interpret the coefficient? Interpret the intercept. What are the null and alternative hypotheses? Is the coefficient significant at the 5% level?

# answer: H0: beta_1 = 0, Ha: beta_1 /= 0
# answer: For a 1 kg increase in mass, height increases by .95 cm. If a person weighs 0 kg, they would be 101 cm tall
# answer: Since p < .05, we reject the null hypothesis at the 5% level

Test Your Learning

The following questions use content learned in Lab 1. All questions refer to the starwars dataset that we used in lab.

Use pipes to make a new data frame to include characters with blue eyes and retain only the columns of name, gender, and homeworld.

new_df <- starwars %>% filter(eye_color == "blue") %>% select(name, gender, homeworld)

Create a new data frame from the starwars data that meets the following criteria: contains only the mass column and a new column called mass_half containing values that are half the mass values. In this mass_half column, there are no NAs and all values are less than 50. Hint: to filter out NA values use !is.na()

new_df <- starwars %>% select(mass) %>% filter(!is.na(mass)) %>% mutate(mass_half = mass/2) %>% filter(mass_half < 50)

Use group_by() and summarize() to find the mean, min, and max mass for each homeworld.

df1 <- starwars %>% group_by(homeworld) %>% summarize(mean(mass))
df2 <- starwars %>% group_by(homeworld) %>% summarize(max(mass))
df3 <- starwars %>% group_by(homeworld) %>% summarize(min(mass))

How many characters are female?

gender_df <- starwars %>% count(gender)

Run a regression of height on mass and gender. Filter out Jabba the Hutt and filter out the NAs in gender. What are the null and alternative hypotheses for the coefficient on gendermale? Interpret the coefficient on gendermale. Is this significant at the 1% level? What about the 5% level?

reg <- lm(height ~ mass + factor(gender), data = starwars %>% filter(species != "Hutt" & !is.na(gender)))
summary(reg)

## 
## Call:
## lm(formula = height ~ mass + factor(gender), data = starwars %>% 
##     filter(species != "Hutt" & !is.na(gender)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.431  -3.403   1.715   8.053  48.526 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        113.8554     9.6656  11.779 4.91e-16 ***
## mass                 1.0144     0.1167   8.694 1.43e-11 ***
## factor(gender)male -18.0741     8.5390  -2.117   0.0393 *  
## factor(gender)none -55.8753    25.0229  -2.233   0.0301 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.78 on 50 degrees of freedom
##   (24 observations deleted due to missingness)
## Multiple R-squared:  0.6091, Adjusted R-squared:  0.5856 
## F-statistic: 25.97 on 3 and 50 DF,  p-value: 2.868e-10