Lab 1

##R Basics First, use the # to write a comment. Commenting your code is important. To execute a script, use cmd+return on mac or control+enter on windows. You can also hit the run button in the top right.

Everything in R is an object and every object has a name! We use functions on the objects.

Making an object: an assignment between a name and a value

x <- 5
y = 10

Notice that this saves in the global environment. Now we can use these objects to do other things. To print the object, just type the name

## [1] 5

Mathematical operations

a <- x + y
a

## [1] 15

b <- x*y
b

## [1] 50

c <- y^x
c

## [1] 1e+05

d <- y/x
d

## [1] 2

There are many different types of objects and we will learn about them throughout this course. We can create vectors as well:

vector1 <- c(1:10)
vector1

##  [1]  1  2  3  4  5  6  7  8  9 10

vector2 <- c(a, b, c, d) #notice that this one will give you a vector of the objects we just made above, not the letters!
vector2

## [1]     15     50 100000      2

We can do mathematical operations with vectors too!

vector1^2

##  [1]   1   4   9  16  25  36  49  64  81 100

But objects don’t have to be just numbers:

vector3 <- c(40, "banana", "carrot", NULL)
vector3

## [1] "40"     "banana" "carrot"

##Functions

We have actually already used a function! c() is a function that we used to make a list of things for the vector! Functions can transform your data in many ways. Focus on getting a snapshot of the data and summary stats

head(vector1, 3) #gives the first 3 items in vector1

## [1] 1 2 3

tail(vector1, 3) #gives the last 3 items in vector1

## [1]  8  9 10

Self check: Try creating a vector with 5 items in it and view the first 2 of them.

sample_vector <- c(1,3,5,7,9)     
head(sample_vector, 2)

## [1] 1 3

Here are some basic math operations that may be helpful to you!

mean(vector1)

## [1] 5.5

sd(vector1)

## [1] 3.02765

median(vector1)

## [1] 5.5

summary(vector1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00

Self check: How would we find the variance of vector1?

sd(vector1)^2

## [1] 9.166667

If you don’t know what a function does, you can get help from R

?mean

Self check: What is the maximum of your 5 item vector?

max(sample_vector)

## [1] 9

summary(sample_vector)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       3       5       5       7       9

##Classes

Each object in R has a class. Classes can be logical, numeric, character, etc. We can check the class of something using the class() function.

class(a)

## [1] "numeric"

class(vector1)

## [1] "integer"

class(2>3)

## [1] "logical"

What about our vector3 that has words and numbers?

class(vector3)

## [1] "character"

It’s a character so we can’t do mathematical operations on it! Notice that even though we have a number in the vector, R has converted it to a character!

##Using Packages R is really useful because of its ability to use packages. Pacman is a package for “package management” - it helps us load multiple packages at once. You only need to install a package on your computer once. Runinstall.packages("pacman") to get the pacman package on your computer.

We need to load the pacman package after installing it to use it

library(pacman)

Now we use the p_load function to load other packages we want to use.

p_load(tidyverse)

Tidyverse is used for data wrangling. It allows you to manipulate data frames in a rather intuitive way. Tidyverse is a huge package so today we will be focusing on functions from the dplyr package (comes with tidyverse):

select(): subset columns
filter(): subset rows on conditions
arrange(): sort results
mutate(): create new columns by using information from other columns
group_by() and summarize(): create summary statisitcs on grouped data
count(): count discrete values

We are going to use a dataset that is built into the tidyverse package. Let’s give it a name so we can work with it:

our_data <- starwars

We can view data frame by typing view(data) or by clicking the name in the global environment:

view(our_data)

We can also look at names of variables:

names(our_data)

##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
## [11] "films"      "vehicles"   "starships"

###Select and Filter Let’s select only the name, gender, and homeworld variables

select(our_data, c(name, gender, homeworld))

## # A tibble: 87 x 3
##    name               gender homeworld
##    <chr>              <chr>  <chr>    
##  1 Luke Skywalker     male   Tatooine 
##  2 C-3PO              <NA>   Tatooine 
##  3 R2-D2              <NA>   Naboo    
##  4 Darth Vader        male   Tatooine 
##  5 Leia Organa        female Alderaan 
##  6 Owen Lars          male   Tatooine 
##  7 Beru Whitesun lars female Tatooine 
##  8 R5-D4              <NA>   Tatooine 
##  9 Biggs Darklighter  male   Tatooine 
## 10 Obi-Wan Kenobi     male   Stewjon  
## # … with 77 more rows

Notice that this didn’t save anything in our global environment! If you want to save this new dataframe, you have to give it a name!

To select all columns except a certain one, use a minus sign

select(our_data, c(-starships, -vehicles))

## # A tibble: 87 x 11
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 3 more variables: homeworld <chr>, species <chr>,
## #   films <list>

Filter the data frame to include only droids

filter(our_data, species == "Droid")

## # A tibble: 5 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>    
## 1 C-3PO    167    75 <NA>       gold       yellow           112 <NA>   Tatooine 
## 2 R2-D2     96    32 <NA>       white, bl… red               33 <NA>   Naboo    
## 3 R5-D4     97    32 <NA>       white, red red               NA <NA>   Tatooine 
## 4 IG-88    200   140 none       metal      red               15 none   <NA>     
## 5 BB8       NA    NA none       none       black             NA none   <NA>     
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>

Filter the data frame to include droids OR humans

filter(our_data, species == "Droid" | species == "Human")

## # A tibble: 40 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 30 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Filter the data frame to include characters taler than 100 cm and a mass over 100

filter(our_data, height > 100 & mass > 100)

## # A tibble: 10 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Dart…    202   136 none       white      yellow          41.9 male  
##  2 Owen…    178   120 brown, gr… light      blue            52   male  
##  3 Chew…    228   112 brown      unknown    blue           200   male  
##  4 Jabb…    175  1358 <NA>       green-tan… orange         600   herma…
##  5 Jek …    180   110 brown      fair       blue            NA   male  
##  6 IG-88    200   140 none       metal      red             15   none  
##  7 Bossk    190   113 none       green      red             53   male  
##  8 Dext…    198   102 none       brown      yellow          NA   male  
##  9 Grie…    216   159 none       brown, wh… green, y…       NA   male  
## 10 Tarf…    234   136 brown      brown      blue            NA   male  
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

###Piping!! What if we want to do those things all in one step??? Can chain functions together with %>% he pipe connects the LHS to the RHS. (Like reading a book). Let’s make a new dataframe where we select the name, height, and mass. Filter out those who are shorter than 100 cm:

new_df <- our_data %>% select(name, height, mass) %>% filter(height >= 100)
new_df

## # A tibble: 74 x 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 Darth Vader           202   136
##  4 Leia Organa           150    49
##  5 Owen Lars             178   120
##  6 Beru Whitesun lars    165    75
##  7 Biggs Darklighter     183    84
##  8 Obi-Wan Kenobi        182    77
##  9 Anakin Skywalker      188    84
## 10 Wilhuff Tarkin        180    NA
## # … with 64 more rows

Self check: make a new data frame where you select all columns except gender and has characters that appear ONLYin the film “A New Hope”

example_df <- our_data %>% select(-gender) %>% filter(films == "A New Hope")

Arrange: Let’s arrange all of the characters by their height:

our_data %>% arrange(height)

## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Yoda      66    17 white      green      brown            896 male  
##  2 Ratt…     79    15 none       grey, blue unknown           NA male  
##  3 Wick…     88    20 brown      brown      brown              8 male  
##  4 Dud …     94    45 none       blue, grey yellow            NA male  
##  5 R2-D2     96    32 <NA>       white, bl… red               33 <NA>  
##  6 R4-P…     96    NA none       silver, r… red, blue         NA female
##  7 R5-D4     97    32 <NA>       white, red red               NA <NA>  
##  8 Sebu…    112    40 none       grey, red  orange            NA male  
##  9 Gasg…    122    NA none       white, bl… black             NA male  
## 10 Watto    137    NA black      blue, grey yellow            NA male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Notice this does lowest to highest, we can do the other way too:

our_data %>% arrange(desc(height))

## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Yara…    264    NA none       white      yellow          NA   male  
##  2 Tarf…    234   136 brown      brown      blue            NA   male  
##  3 Lama…    229    88 none       grey       black           NA   male  
##  4 Chew…    228   112 brown      unknown    blue           200   male  
##  5 Roos…    224    82 none       grey       orange          NA   male  
##  6 Grie…    216   159 none       brown, wh… green, y…       NA   male  
##  7 Taun…    213    NA none       grey       black           NA   female
##  8 Rugo…    206    NA none       green      orange          NA   male  
##  9 Tion…    206    80 none       grey       black           NA   male  
## 10 Dart…    202   136 none       white      yellow          41.9 male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Self check: Arrange the characters names in alphabetical order

our_data %>% arrange(name)

## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Ackb…    180    83 none       brown mot… orange          41   male  
##  2 Adi …    184    50 none       dark       blue            NA   female
##  3 Anak…    188    84 blond      fair       blue            41.9 male  
##  4 Arve…     NA    NA brown      fair       brown           NA   male  
##  5 Ayla…    178    55 none       blue       hazel           48   female
##  6 Bail…    191    NA black      tan        brown           67   male  
##  7 Barr…    166    50 black      yellow     blue            40   female
##  8 BB8       NA    NA none       none       black           NA   none  
##  9 Ben …    163    65 none       grey, gre… orange          NA   male  
## 10 Beru…    165    75 brown      light      blue            47   female
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Mutate: This creates new variables. Let’s create a new variable that measures height in inches instead of centimeters (2.54cm per inch):

our_data %>% mutate(height_inches = height/2.54)

## # A tibble: 87 x 14
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, height_inches <dbl>

Self check: Create a new variable that is the sum of person’s mass and height:

our_data %>% mutate(total = height + mass)

## # A tibble: 87 x 14
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, total <dbl>

Group_by and Summarize: This will group data together and can make summary statistics. Let’s find the average height for each species:

our_data %>% group_by(species) %>% summarize(avg_height = mean(height))

## # A tibble: 38 x 2
##    species   avg_height
##    <chr>          <dbl>
##  1 Aleena           79 
##  2 Besalisk        198 
##  3 Cerean          198 
##  4 Chagrian        196 
##  5 Clawdite        168 
##  6 Droid            NA 
##  7 Dug             112 
##  8 Ewok             88 
##  9 Geonosian       183 
## 10 Gungan          209.
## # … with 28 more rows

Notice we have NA’s! We can get rid of those

our_data %>% na.omit() %>% group_by(species) %>% summarize(avg_height = mean(height))

## # A tibble: 11 x 2
##    species      avg_height
##    <chr>             <dbl>
##  1 Cerean              198
##  2 Ewok                 88
##  3 Gungan              196
##  4 Human               178
##  5 Kel Dor             188
##  6 Mirialan            168
##  7 Mon Calamari        180
##  8 Trandoshan          190
##  9 Twi'lek             178
## 10 Wookiee             228
## 11 Zabrak              175

Count: Count the number of each species

our_data %>% count(species)

## # A tibble: 38 x 2
##    species       n
##    <chr>     <int>
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         5
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # … with 28 more rows

OLS Regression

To do a regression in R, we use lm(). The basic setup: name <- lm(y ~ x, data = name_of_df)

reg1 <- lm(height ~ mass, data = our_data)

How do we look at the regression output?

summary(reg1)

## 
## Call:
## lm(formula = height ~ mass, data = our_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.763   -5.610    6.385   18.202   58.897 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 171.28536    5.34340   32.05   <2e-16 ***
## mass          0.02807    0.02752    1.02    0.312    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.52 on 57 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.01792,    Adjusted R-squared:  0.0006956 
## F-statistic:  1.04 on 1 and 57 DF,  p-value: 0.312

Let’s filter out Jabba the Hutt because he is a large boi:

reg2 <- lm(height ~ mass, data = our_data %>% filter(species != "Hutt"))
summary(reg2)

## 
## Call:
## lm(formula = height ~ mass, data = our_data %>% filter(species != 
##     "Hutt"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.821  -6.273   2.327  14.078  45.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 101.6706     8.6593  11.741  < 2e-16 ***
## mass          0.9500     0.1064   8.931 2.72e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.3 on 55 degrees of freedom
##   (24 observations deleted due to missingness)
## Multiple R-squared:  0.5919, Adjusted R-squared:  0.5845 
## F-statistic: 79.77 on 1 and 55 DF,  p-value: 2.72e-12

Self check: Can you interpret the coefficient? Interpret the intercept. What are the null and alternative hypotheses? Is the coefficient significant at the 5% level? * answer: H0: beta_1 = 0, Ha: beta_1 /= 0 * answer: For a 1 kg increase in mass, height increases by .95 cm. If a person weighs 0 kg, they would be 101 cm tall * answer: Since p < .05, we reject the null hypothesis at the 5% level

Challenge Questions

All of these use the starwars data:

Use pipes to make a new data frame to include characters with blue eyes and retain only the columns of name, gender, and homeworld.
Create a new data frame from the starwars data that meets the following criteria: contains only the mass column and a new column called mass_half containing values that are half the
mass values. In this mass_half column, there are no NAs and all values are less than 50. Hint: to filter out NA values use !is.na()
Use group_by() and summarize() to find the mean, min, and max mass for each homeworld.
How many characters are female?
Run a regression of height on mass and gender. Filter out Jabba the Hutt and filter out the NAs in gender. What are the null and alternative hypotheses for the coefficient on gendermale? Interpret the coefficient on gendermale. Is this significant at the 1% level? What about the 5% level?