##R Basics First, use the # to write a comment. Commenting your code is important. To execute a script, use cmd+return on mac or control+enter on windows. You can also hit the run button in the top right.
Everything in R is an object and every object has a name! We use functions on the objects.
Making an object: an assignment between a name and a value
x <- 5
y = 10
Notice that this saves in the global environment. Now we can use these objects to do other things. To print the object, just type the name
x
## [1] 5
a <- x + y
a
## [1] 15
b <- x*y
b
## [1] 50
c <- y^x
c
## [1] 1e+05
d <- y/x
d
## [1] 2
There are many different types of objects and we will learn about them throughout this course. We can create vectors as well:
vector1 <- c(1:10)
vector1
## [1] 1 2 3 4 5 6 7 8 9 10
vector2 <- c(a, b, c, d) #notice that this one will give you a vector of the objects we just made above, not the letters!
vector2
## [1] 15 50 100000 2
We can do mathematical operations with vectors too!
vector1^2
## [1] 1 4 9 16 25 36 49 64 81 100
But objects don’t have to be just numbers:
vector3 <- c(40, "banana", "carrot", NULL)
vector3
## [1] "40" "banana" "carrot"
##Functions
We have actually already used a function! c() is a function that we used to make a list of things for the vector! Functions can transform your data in many ways. Focus on getting a snapshot of the data and summary stats
head(vector1, 3) #gives the first 3 items in vector1
## [1] 1 2 3
tail(vector1, 3) #gives the last 3 items in vector1
## [1] 8 9 10
Self check: Try creating a vector with 5 items in it and view the first 2 of them.
sample_vector <- c(1,3,5,7,9)
head(sample_vector, 2)
## [1] 1 3
Here are some basic math operations that may be helpful to you!
mean(vector1)
## [1] 5.5
sd(vector1)
## [1] 3.02765
median(vector1)
## [1] 5.5
summary(vector1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.25 5.50 5.50 7.75 10.00
Self check: How would we find the variance of vector1?
sd(vector1)^2
## [1] 9.166667
If you don’t know what a function does, you can get help from R
?mean
Self check: What is the maximum of your 5 item vector?
max(sample_vector)
## [1] 9
summary(sample_vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 3 5 5 7 9
##Classes
Each object in R has a class. Classes can be logical, numeric, character, etc. We can check the class of something using the class() function.
class(a)
## [1] "numeric"
class(vector1)
## [1] "integer"
class(2>3)
## [1] "logical"
What about our vector3 that has words and numbers?
class(vector3)
## [1] "character"
It’s a character so we can’t do mathematical operations on it! Notice that even though we have a number in the vector, R has converted it to a character!
##Using Packages R is really useful because of its ability to use packages. Pacman is a package for “package management” - it helps us load multiple packages at once. You only need to install a package on your computer once. Runinstall.packages("pacman") to get the pacman package on your computer.
We need to load the pacman package after installing it to use it
library(pacman)
Now we use the p_load function to load other packages we want to use.
p_load(tidyverse)
Tidyverse is used for data wrangling. It allows you to manipulate data frames in a rather intuitive way. Tidyverse is a huge package so today we will be focusing on functions from the dplyr package (comes with tidyverse):
select(): subset columnsfilter(): subset rows on conditionsarrange(): sort resultsmutate(): create new columns by using information from other columnsgroup_by() and summarize(): create summary statisitcs on grouped datacount(): count discrete valuesWe are going to use a dataset that is built into the tidyverse package. Let’s give it a name so we can work with it:
our_data <- starwars
We can view data frame by typing view(data) or by clicking the name in the global environment:
view(our_data)
We can also look at names of variables:
names(our_data)
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "gender" "homeworld" "species"
## [11] "films" "vehicles" "starships"
###Select and Filter Let’s select only the name, gender, and homeworld variables
select(our_data, c(name, gender, homeworld))
## # A tibble: 87 x 3
## name gender homeworld
## <chr> <chr> <chr>
## 1 Luke Skywalker male Tatooine
## 2 C-3PO <NA> Tatooine
## 3 R2-D2 <NA> Naboo
## 4 Darth Vader male Tatooine
## 5 Leia Organa female Alderaan
## 6 Owen Lars male Tatooine
## 7 Beru Whitesun lars female Tatooine
## 8 R5-D4 <NA> Tatooine
## 9 Biggs Darklighter male Tatooine
## 10 Obi-Wan Kenobi male Stewjon
## # … with 77 more rows
Notice that this didn’t save anything in our global environment! If you want to save this new dataframe, you have to give it a name!
To select all columns except a certain one, use a minus sign
select(our_data, c(-starships, -vehicles))
## # A tibble: 87 x 11
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 3 more variables: homeworld <chr>, species <chr>,
## # films <list>
Filter the data frame to include only droids
filter(our_data, species == "Droid")
## # A tibble: 5 x 13
## name height mass hair_color skin_color eye_color birth_year gender homeworld
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 <NA> Tatooine
## 2 R2-D2 96 32 <NA> white, bl… red 33 <NA> Naboo
## 3 R5-D4 97 32 <NA> white, red red NA <NA> Tatooine
## 4 IG-88 200 140 none metal red 15 none <NA>
## 5 BB8 NA NA none none black NA none <NA>
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>
Filter the data frame to include droids OR humans
filter(our_data, species == "Droid" | species == "Human")
## # A tibble: 40 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 30 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Filter the data frame to include characters taler than 100 cm and a mass over 100
filter(our_data, height > 100 & mass > 100)
## # A tibble: 10 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Dart… 202 136 none white yellow 41.9 male
## 2 Owen… 178 120 brown, gr… light blue 52 male
## 3 Chew… 228 112 brown unknown blue 200 male
## 4 Jabb… 175 1358 <NA> green-tan… orange 600 herma…
## 5 Jek … 180 110 brown fair blue NA male
## 6 IG-88 200 140 none metal red 15 none
## 7 Bossk 190 113 none green red 53 male
## 8 Dext… 198 102 none brown yellow NA male
## 9 Grie… 216 159 none brown, wh… green, y… NA male
## 10 Tarf… 234 136 brown brown blue NA male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
###Piping!! What if we want to do those things all in one step??? Can chain functions together with %>% he pipe connects the LHS to the RHS. (Like reading a book). Let’s make a new dataframe where we select the name, height, and mass. Filter out those who are shorter than 100 cm:
new_df <- our_data %>% select(name, height, mass) %>% filter(height >= 100)
new_df
## # A tibble: 74 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 Darth Vader 202 136
## 4 Leia Organa 150 49
## 5 Owen Lars 178 120
## 6 Beru Whitesun lars 165 75
## 7 Biggs Darklighter 183 84
## 8 Obi-Wan Kenobi 182 77
## 9 Anakin Skywalker 188 84
## 10 Wilhuff Tarkin 180 NA
## # … with 64 more rows
Self check: make a new data frame where you select all columns except gender and has characters that appear ONLYin the film “A New Hope”
example_df <- our_data %>% select(-gender) %>% filter(films == "A New Hope")
Arrange: Let’s arrange all of the characters by their height:
our_data %>% arrange(height)
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Yoda 66 17 white green brown 896 male
## 2 Ratt… 79 15 none grey, blue unknown NA male
## 3 Wick… 88 20 brown brown brown 8 male
## 4 Dud … 94 45 none blue, grey yellow NA male
## 5 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 6 R4-P… 96 NA none silver, r… red, blue NA female
## 7 R5-D4 97 32 <NA> white, red red NA <NA>
## 8 Sebu… 112 40 none grey, red orange NA male
## 9 Gasg… 122 NA none white, bl… black NA male
## 10 Watto 137 NA black blue, grey yellow NA male
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Notice this does lowest to highest, we can do the other way too:
our_data %>% arrange(desc(height))
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Yara… 264 NA none white yellow NA male
## 2 Tarf… 234 136 brown brown blue NA male
## 3 Lama… 229 88 none grey black NA male
## 4 Chew… 228 112 brown unknown blue 200 male
## 5 Roos… 224 82 none grey orange NA male
## 6 Grie… 216 159 none brown, wh… green, y… NA male
## 7 Taun… 213 NA none grey black NA female
## 8 Rugo… 206 NA none green orange NA male
## 9 Tion… 206 80 none grey black NA male
## 10 Dart… 202 136 none white yellow 41.9 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Self check: Arrange the characters names in alphabetical order
our_data %>% arrange(name)
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Ackb… 180 83 none brown mot… orange 41 male
## 2 Adi … 184 50 none dark blue NA female
## 3 Anak… 188 84 blond fair blue 41.9 male
## 4 Arve… NA NA brown fair brown NA male
## 5 Ayla… 178 55 none blue hazel 48 female
## 6 Bail… 191 NA black tan brown 67 male
## 7 Barr… 166 50 black yellow blue 40 female
## 8 BB8 NA NA none none black NA none
## 9 Ben … 163 65 none grey, gre… orange NA male
## 10 Beru… 165 75 brown light blue 47 female
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Mutate: This creates new variables. Let’s create a new variable that measures height in inches instead of centimeters (2.54cm per inch):
our_data %>% mutate(height_inches = height/2.54)
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>, height_inches <dbl>
Self check: Create a new variable that is the sum of person’s mass and height:
our_data %>% mutate(total = height + mass)
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>, total <dbl>
Group_by and Summarize: This will group data together and can make summary statistics. Let’s find the average height for each species:
our_data %>% group_by(species) %>% summarize(avg_height = mean(height))
## # A tibble: 38 x 2
## species avg_height
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Clawdite 168
## 6 Droid NA
## 7 Dug 112
## 8 Ewok 88
## 9 Geonosian 183
## 10 Gungan 209.
## # … with 28 more rows
Notice we have NA’s! We can get rid of those
our_data %>% na.omit() %>% group_by(species) %>% summarize(avg_height = mean(height))
## # A tibble: 11 x 2
## species avg_height
## <chr> <dbl>
## 1 Cerean 198
## 2 Ewok 88
## 3 Gungan 196
## 4 Human 178
## 5 Kel Dor 188
## 6 Mirialan 168
## 7 Mon Calamari 180
## 8 Trandoshan 190
## 9 Twi'lek 178
## 10 Wookiee 228
## 11 Zabrak 175
Count: Count the number of each species
our_data %>% count(species)
## # A tibble: 38 x 2
## species n
## <chr> <int>
## 1 Aleena 1
## 2 Besalisk 1
## 3 Cerean 1
## 4 Chagrian 1
## 5 Clawdite 1
## 6 Droid 5
## 7 Dug 1
## 8 Ewok 1
## 9 Geonosian 1
## 10 Gungan 3
## # … with 28 more rows
To do a regression in R, we use lm(). The basic setup: name <- lm(y ~ x, data = name_of_df)
reg1 <- lm(height ~ mass, data = our_data)
How do we look at the regression output?
summary(reg1)
##
## Call:
## lm(formula = height ~ mass, data = our_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.763 -5.610 6.385 18.202 58.897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 171.28536 5.34340 32.05 <2e-16 ***
## mass 0.02807 0.02752 1.02 0.312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.52 on 57 degrees of freedom
## (28 observations deleted due to missingness)
## Multiple R-squared: 0.01792, Adjusted R-squared: 0.0006956
## F-statistic: 1.04 on 1 and 57 DF, p-value: 0.312
Let’s filter out Jabba the Hutt because he is a large boi:
reg2 <- lm(height ~ mass, data = our_data %>% filter(species != "Hutt"))
summary(reg2)
##
## Call:
## lm(formula = height ~ mass, data = our_data %>% filter(species !=
## "Hutt"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.821 -6.273 2.327 14.078 45.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.6706 8.6593 11.741 < 2e-16 ***
## mass 0.9500 0.1064 8.931 2.72e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.3 on 55 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.5919, Adjusted R-squared: 0.5845
## F-statistic: 79.77 on 1 and 55 DF, p-value: 2.72e-12
Self check: Can you interpret the coefficient? Interpret the intercept. What are the null and alternative hypotheses? Is the coefficient significant at the 5% level? * answer: H0: beta_1 = 0, Ha: beta_1 /= 0 * answer: For a 1 kg increase in mass, height increases by .95 cm. If a person weighs 0 kg, they would be 101 cm tall * answer: Since p < .05, we reject the null hypothesis at the 5% level
All of these use the starwars data:
!is.na()group_by() and summarize() to find the mean, min, and max mass for each homeworld.