#GOALS for today
Review Math Operations
Review dplyr
Reading Data
Regressions and Interpretation
##Math Operations
vector1<- c(1:10)
mean(vector1)
## [1] 5.5
sd(vector1)
## [1] 3.02765
median(vector1)
## [1] 5.5
summary(vector1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.25 5.50 5.50 7.75 10.00
Self Check: create a vector and find its variance!
vector2<- c(1,3,5,7,9)
sd(vector2)^2
## [1] 10
##The dplyr package:
Recall our workflow from last time: * Load the ‘pacman’ package * Load the ‘tidyverse’ package
library(pacman)
p_load(tidyverse)
alternatively we can simply use:
library(tidyverse)
the p_load function is increasingly helpful the more packages you want to use, so it’s good practice to get used to this work flow sooner than later.
There are a ton of useful functions in dplyr but the follwoing are a good place to start:
select(): subset columnsfilter(): subset rows on conditionsarrange(): sort resultsmutate(): create new columns by using information from other columnsgroup_by() and summarize(): create summary statisitcs on grouped datacount(): count discrete valuesWe are going to use a dataset that is built into the tidyverse package. The dataset is called starwras. Let’s give it a name so we can work with it:
our_data <- starwars
We can view data frame by typing view(data) or by clicking the name in the global environment. Note the dplyr function has lowercase view() while the Base R function has an uppercase View()
view(our_data)
We can also look at names of variables without looking at the entire dataset:
names(our_data)
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "gender" "homeworld" "species"
## [11] "films" "vehicles" "starships"
To select all columns except a certain one, use a minus sign:
select(our_data, c(-starships, -vehicles))
## # A tibble: 87 x 11
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 3 more variables: homeworld <chr>, species <chr>,
## # films <list>
Lets do a few filtering examples. To filter the data frame to include only droids:
filter(our_data, species == "Droid")
## # A tibble: 5 x 13
## name height mass hair_color skin_color eye_color birth_year gender homeworld
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 <NA> Tatooine
## 2 R2-D2 96 32 <NA> white, bl… red 33 <NA> Naboo
## 3 R5-D4 97 32 <NA> white, red red NA <NA> Tatooine
## 4 IG-88 200 140 none metal red 15 none <NA>
## 5 BB8 NA NA none none black NA none <NA>
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>
Filter the data frame to include droids OR humans
filter(our_data, species == "Droid" | species == "Human")
## # A tibble: 40 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 30 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Filter the data frame to include characters taler than 100 cm and a mass over 100
filter(our_data, height > 100 & mass > 100)
## # A tibble: 10 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Dart… 202 136 none white yellow 41.9 male
## 2 Owen… 178 120 brown, gr… light blue 52 male
## 3 Chew… 228 112 brown unknown blue 200 male
## 4 Jabb… 175 1358 <NA> green-tan… orange 600 herma…
## 5 Jek … 180 110 brown fair blue NA male
## 6 IG-88 200 140 none metal red 15 none
## 7 Bossk 190 113 none green red 53 male
## 8 Dext… 198 102 none brown yellow NA male
## 9 Grie… 216 159 none brown, wh… green, y… NA male
## 10 Tarf… 234 136 brown brown blue NA male
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
How about some piping!! What if we want to do those things all in one step??? Can chain functions together with %>%. The pipe connects the LHS to the RHS. (Like reading a book) Let’s make a new dataframe where we select the name, height, and mass. Filter out those who are shorter than 100 cm:
new_df <- our_data %>% select(name, height, mass) %>% filter(height >= 100)
new_df
## # A tibble: 74 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 Darth Vader 202 136
## 4 Leia Organa 150 49
## 5 Owen Lars 178 120
## 6 Beru Whitesun lars 165 75
## 7 Biggs Darklighter 183 84
## 8 Obi-Wan Kenobi 182 77
## 9 Anakin Skywalker 188 84
## 10 Wilhuff Tarkin 180 NA
## # … with 64 more rows
Self check: make a new data frame where you select all columns except gender and has characters that appear ONLY in the film “A New Hope”:
example_df <- our_data %>% select(-gender) %>% filter(films == "A New Hope")
view(example_df)
Let’s do some work with arrange(). Let’s arrange all of the characters by their height
our_data %>% arrange(height)
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Yoda 66 17 white green brown 896 male
## 2 Ratt… 79 15 none grey, blue unknown NA male
## 3 Wick… 88 20 brown brown brown 8 male
## 4 Dud … 94 45 none blue, grey yellow NA male
## 5 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 6 R4-P… 96 NA none silver, r… red, blue NA female
## 7 R5-D4 97 32 <NA> white, red red NA <NA>
## 8 Sebu… 112 40 none grey, red orange NA male
## 9 Gasg… 122 NA none white, bl… black NA male
## 10 Watto 137 NA black blue, grey yellow NA male
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Notice this does lowest to highest, we can do the other way too
our_data %>% arrange(desc(height))
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Yara… 264 NA none white yellow NA male
## 2 Tarf… 234 136 brown brown blue NA male
## 3 Lama… 229 88 none grey black NA male
## 4 Chew… 228 112 brown unknown blue 200 male
## 5 Roos… 224 82 none grey orange NA male
## 6 Grie… 216 159 none brown, wh… green, y… NA male
## 7 Taun… 213 NA none grey black NA female
## 8 Rugo… 206 NA none green orange NA male
## 9 Tion… 206 80 none grey black NA male
## 10 Dart… 202 136 none white yellow 41.9 male
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Self check: Arrange the characters names in alphabetical order
our_data %>% arrange(name)
## # A tibble: 87 x 13
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Ackb… 180 83 none brown mot… orange 41 male
## 2 Adi … 184 50 none dark blue NA female
## 3 Anak… 188 84 blond fair blue 41.9 male
## 4 Arve… NA NA brown fair brown NA male
## 5 Ayla… 178 55 none blue hazel 48 female
## 6 Bail… 191 NA black tan brown 67 male
## 7 Barr… 166 50 black yellow blue 40 female
## 8 BB8 NA NA none none black NA none
## 9 Ben … 163 65 none grey, gre… orange NA male
## 10 Beru… 165 75 brown light blue 47 female
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
Mutate creates new variables. Let’s create a new variable that measures age in dog years. First we need to create an age variable. I am going to assume it’s year 400 in the starwars universe (correct me if I am wrong):
our_data<-our_data %>% mutate(age = 400-birth_year)
Next I will find the characters’ ages in dog years (scale by 7–it’s SCIENCE):
our_data<-our_data%>%mutate(dog_years_age=age*7)
view(our_data)
We could also do this in one step:
our_data<-our_data %>% mutate(age = 400-birth_year)%>%
mutate(dog_years_age=age*7)
Self check: Create a new variable that is the sum of person’s mass and height
our_data %>% mutate(total = height + mass)
## # A tibble: 87 x 16
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 4 Dart… 202 136 none white yellow 41.9 male
## 5 Leia… 150 49 brown light brown 19 female
## 6 Owen… 178 120 brown, gr… light blue 52 male
## 7 Beru… 165 75 brown light blue 47 female
## 8 R5-D4 97 32 <NA> white, red red NA <NA>
## 9 Bigg… 183 84 black light brown 24 male
## 10 Obi-… 182 77 auburn, w… fair blue-gray 57 male
## # … with 77 more rows, and 8 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>, age <dbl>,
## # dog_years_age <dbl>, total <dbl>
Group_by and Summarize will group data together and can make summary statistics. Let’s find the average height for each species:
our_data %>% group_by(species) %>% summarize(avg_height = mean(height))
## # A tibble: 38 x 2
## species avg_height
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Clawdite 168
## 6 Droid NA
## 7 Dug 112
## 8 Ewok 88
## 9 Geonosian 183
## 10 Gungan 209.
## # … with 28 more rows
Notice we have NA’s! We can get rid of those:
our_data %>% na.omit() %>% group_by(species) %>% summarize(avg_height = mean(height))
## # A tibble: 11 x 2
## species avg_height
## <chr> <dbl>
## 1 Cerean 198
## 2 Ewok 88
## 3 Gungan 196
## 4 Human 178
## 5 Kel Dor 188
## 6 Mirialan 168
## 7 Mon Calamari 180
## 8 Trandoshan 190
## 9 Twi'lek 178
## 10 Wookiee 228
## 11 Zabrak 175
Count…well it counts:
our_data %>% count(species)
## # A tibble: 38 x 2
## species n
## <chr> <int>
## 1 Aleena 1
## 2 Besalisk 1
## 3 Cerean 1
## 4 Chagrian 1
## 5 Clawdite 1
## 6 Droid 5
## 7 Dug 1
## 8 Ewok 1
## 9 Geonosian 1
## 10 Gungan 3
## # … with 28 more rows
Remember: If you want more info on a function type ?name_of_function!
##OLS Regression To do a regression in R, we use lm(). The basic setup: name <- lm(y ~ x, data = name_of_df). Let’s regress height on mass:
reg1 <- lm(height ~ mass, data = our_data)
How do we look at the regression output?
summary(reg1)
##
## Call:
## lm(formula = height ~ mass, data = our_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.763 -5.610 6.385 18.202 58.897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 171.28536 5.34340 32.05 <2e-16 ***
## mass 0.02807 0.02752 1.02 0.312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.52 on 57 degrees of freedom
## (28 observations deleted due to missingness)
## Multiple R-squared: 0.01792, Adjusted R-squared: 0.0006956
## F-statistic: 1.04 on 1 and 57 DF, p-value: 0.312
Hmm it looks like there isn’t a significant relationship between mass and height. Let’s filter out Jabba the Hutt because he is a large boi
reg2 <- lm(height ~ mass, data = our_data %>% filter(species != "Hutt"))
summary(reg2)
##
## Call:
## lm(formula = height ~ mass, data = our_data %>% filter(species !=
## "Hutt"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.821 -6.273 2.327 14.078 45.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.6706 8.6593 11.741 < 2e-16 ***
## mass 0.9500 0.1064 8.931 2.72e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.3 on 55 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.5919, Adjusted R-squared: 0.5845
## F-statistic: 79.77 on 1 and 55 DF, p-value: 2.72e-12
##Before Reading in data
Before we learn to read data into R, it would be helpful to know how to tell R where it is. This is important because later in the class we need to load files from our local machine. Eventually, we want to start using our own data instead of data contained in a package.
/. Windows can also use \\ (must be double).getwd(). “wd” stands for working directory.setwd()dir()"").setwd().my_dir <- "Home/Folder1/Folder2" then setwd(my_dir)setwd("..")R can read in data from just about any source/format. Today we’re going to cover reading data saved in CSVs (comma-separated variables).
First, we’ll load the tidyverse package, which will actually load several packages (we want readr). The base (basic) installation of R already has a function for reading CSVs, but the function in tidyverse (readr) is a bit nicer.
Remember, you can always get to the help files in R/RStudio using ?. Let’s check out the help file for read_csv.
?read_csv
##Getting data into R
Step 1: Download the data - Download the csv - Search Marijuana Data Vincentarelbundock on Google –> “Data Set” - Ctrl + F or Command + F to search: Arrests for Marijuana Possession –> csv (the DOC option gives us a description of the data) - or search for “Arrests for Marijuana Possession” at https://vincentarelbundock.github.io/Rdatasets/datasets.html - Make sure your downloaded file is in a reasonable directory - Navigate R to the reasonable directory - Read the data, read_csv("../data/Arrests.csv")
STEP 2: Read the data
It may be helpful to save the path as object and then read the data in using that object
my_path<-"/Users/garrettstanford/Downloads/Arrests.csv"
read_csv(my_path)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## released = col_character(),
## colour = col_character(),
## year = col_double(),
## age = col_double(),
## sex = col_character(),
## employed = col_character(),
## citizen = col_character(),
## checks = col_double()
## )
## # A tibble: 5,226 x 9
## X1 released colour year age sex employed citizen checks
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Yes White 2002 21 Male Yes Yes 3
## 2 2 No Black 1999 17 Male Yes Yes 3
## 3 3 Yes White 2000 24 Male Yes Yes 3
## 4 4 No Black 2000 46 Male Yes Yes 1
## 5 5 Yes Black 1999 27 Female Yes Yes 1
## 6 6 Yes Black 1998 16 Female Yes Yes 0
## 7 7 Yes White 1999 40 Male No Yes 0
## 8 8 Yes White 1998 34 Female Yes Yes 1
## 9 9 Yes Black 2000 23 Male Yes Yes 4
## 10 10 Yes White 2001 30 Male Yes Yes 3
## # … with 5,216 more rows
Alternatively could just read in data:
read_csv("/Users/garrettstanford/Downloads/Arrests.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## released = col_character(),
## colour = col_character(),
## year = col_double(),
## age = col_double(),
## sex = col_character(),
## employed = col_character(),
## citizen = col_character(),
## checks = col_double()
## )
## # A tibble: 5,226 x 9
## X1 released colour year age sex employed citizen checks
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 Yes White 2002 21 Male Yes Yes 3
## 2 2 No Black 1999 17 Male Yes Yes 3
## 3 3 Yes White 2000 24 Male Yes Yes 3
## 4 4 No Black 2000 46 Male Yes Yes 1
## 5 5 Yes Black 1999 27 Female Yes Yes 1
## 6 6 Yes Black 1998 16 Female Yes Yes 0
## 7 7 Yes White 1999 40 Male No Yes 0
## 8 8 Yes White 1998 34 Female Yes Yes 1
## 9 9 Yes Black 2000 23 Male Yes Yes 4
## 10 10 Yes White 2001 30 Male Yes Yes 3
## # … with 5,216 more rows
Notice that we read the data, but it just printed to screen. We want to assign the data to an object (give it a name).
arrest_data<-read_csv(my_path)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## released = col_character(),
## colour = col_character(),
## year = col_double(),
## age = col_double(),
## sex = col_character(),
## employed = col_character(),
## citizen = col_character(),
## checks = col_double()
## )
Here are some commands for getting a snapshot of a dataset: head, tail, summary, table, plot
#head(arrest_data,15);
#tail(arrest_data, 10);
#head(arrest_data, 25)%>% tail(10)
summary(arrest_data)
## X1 released colour year
## Min. : 1 Length:5226 Length:5226 Min. :1997
## 1st Qu.:1307 Class :character Class :character 1st Qu.:1998
## Median :2614 Mode :character Mode :character Median :2000
## Mean :2614 Mean :2000
## 3rd Qu.:3920 3rd Qu.:2001
## Max. :5226 Max. :2002
## age sex employed citizen
## Min. :12.00 Length:5226 Length:5226 Length:5226
## 1st Qu.:18.00 Class :character Class :character Class :character
## Median :21.00 Mode :character Mode :character Mode :character
## Mean :23.85
## 3rd Qu.:27.00
## Max. :66.00
## checks
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :1.636
## 3rd Qu.:3.000
## Max. :6.000
What is the mean age of black people? White people?
arrest_data%>%group_by(colour) %>%
summarize(avg_age = mean(age))
## # A tibble: 2 x 2
## colour avg_age
## <chr> <dbl>
## 1 Black 24.8
## 2 White 23.5
If we look at our dataset there some of the data is impossible to use in its current format. How does one regress the value “Yes” or “Black”? Lets use the ifelse function to make some numerical representations of these columns:
arrest_data<-arrest_data%>%mutate(gender_dummy=ifelse(sex=="Male", 1, 0))
arrest_data<-arrest_data%>%mutate(colour_dummy=ifelse(colour=="Black", 1, 0))
arrest_data<- arrest_data<-arrest_data%>%mutate(released_dummy=ifelse(released=="Yes", 1, 0))
This function is super helpful, but if it’s over your head don’t worry too much about it. Alternatively you could try ?ifelse to learn more.
Looking at the “DOC” file on the website which was right next to the “CSV” that you downloaded we can get a description of what each variable is. I see that “checks: Number of police data bases (of previous arrests, previous convictions, parole status, etc. – 6 in all) on which the arrestee’s name appeared; a numeric vector.”
So lets see if age has an effect on how many checks someone has:
reg3<-lm(checks ~ age, data = arrest_data)
summary(reg3)
##
## Call:
## lm(formula = checks ~ age, data = arrest_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6403 -1.4903 -0.4403 1.3597 4.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.040227 0.064080 16.233 <2e-16 ***
## age 0.025002 0.002537 9.853 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 5224 degrees of freedom
## Multiple R-squared: 0.01825, Adjusted R-squared: 0.01806
## F-statistic: 97.09 on 1 and 5224 DF, p-value: < 2.2e-16
So being older makes you more of criminal? Maybe, or maybe something else is going on…
Lets look at if the year indicates the probability that the individual will be black or white
reg4<-lm(colour_dummy ~ year, data = arrest_data)
summary(reg4)
##
## Call:
## lm(formula = colour_dummy ~ year, data = arrest_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2499 -0.2471 -0.2458 -0.2430 0.7570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.511065 8.577352 -0.293 0.770
## year 0.001379 0.004290 0.321 0.748
##
## Residual standard error: 0.431 on 5224 degrees of freedom
## Multiple R-squared: 1.978e-05, Adjusted R-squared: -0.0001716
## F-statistic: 0.1034 on 1 and 5224 DF, p-value: 0.7479
Doesn’t look like there are signiificant findings!
Does race have an effect on if released?
reg5<- lm(released_dummy~colour_dummy, data = arrest_data)
summary(reg5)
##
## Call:
## lm(formula = released_dummy ~ colour_dummy, data = arrest_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8580 0.1419 0.1419 0.1419 0.2585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.858050 0.005943 144.38 <2e-16 ***
## colour_dummy -0.116590 0.011971 -9.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3729 on 5224 degrees of freedom
## Multiple R-squared: 0.01783, Adjusted R-squared: 0.01765
## F-statistic: 94.86 on 1 and 5224 DF, p-value: < 2.2e-16
Looks like there IS a significant relationship. Remember when the outcome is a binary variable the coefficent is a probability. So if someone is black they are 11% less likely to be released with a summons.