421: Lab 1

#GOALS for today

Review Math Operations
Review dplyr
Reading Data
Regressions and Interpretation

##Math Operations

vector1<- c(1:10)
mean(vector1)

## [1] 5.5

sd(vector1)

## [1] 3.02765

median(vector1)

## [1] 5.5

summary(vector1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00

Self Check: create a vector and find its variance!

vector2<- c(1,3,5,7,9)
sd(vector2)^2

## [1] 10

##The dplyr package:

Recall our workflow from last time: * Load the ‘pacman’ package * Load the ‘tidyverse’ package

library(pacman)
p_load(tidyverse)

alternatively we can simply use:

library(tidyverse)

the p_load function is increasingly helpful the more packages you want to use, so it’s good practice to get used to this work flow sooner than later.

There are a ton of useful functions in dplyr but the follwoing are a good place to start:

select(): subset columns
filter(): subset rows on conditions
arrange(): sort results
mutate(): create new columns by using information from other columns
group_by() and summarize(): create summary statisitcs on grouped data
count(): count discrete values

We are going to use a dataset that is built into the tidyverse package. The dataset is called starwras. Let’s give it a name so we can work with it:

our_data <- starwars

We can view data frame by typing view(data) or by clicking the name in the global environment. Note the dplyr function has lowercase view() while the Base R function has an uppercase View()

view(our_data)

We can also look at names of variables without looking at the entire dataset:

names(our_data)

##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "gender"     "homeworld"  "species"   
## [11] "films"      "vehicles"   "starships"

To select all columns except a certain one, use a minus sign:

select(our_data, c(-starships, -vehicles))

## # A tibble: 87 x 11
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 3 more variables: homeworld <chr>, species <chr>,
## #   films <list>

Lets do a few filtering examples. To filter the data frame to include only droids:

filter(our_data, species == "Droid")

## # A tibble: 5 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>    
## 1 C-3PO    167    75 <NA>       gold       yellow           112 <NA>   Tatooine 
## 2 R2-D2     96    32 <NA>       white, bl… red               33 <NA>   Naboo    
## 3 R5-D4     97    32 <NA>       white, red red               NA <NA>   Tatooine 
## 4 IG-88    200   140 none       metal      red               15 none   <NA>     
## 5 BB8       NA    NA none       none       black             NA none   <NA>     
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>

Filter the data frame to include droids OR humans

filter(our_data, species == "Droid" | species == "Human")

## # A tibble: 40 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 30 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Filter the data frame to include characters taler than 100 cm and a mass over 100

filter(our_data, height > 100 & mass > 100)

## # A tibble: 10 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Dart…    202   136 none       white      yellow          41.9 male  
##  2 Owen…    178   120 brown, gr… light      blue            52   male  
##  3 Chew…    228   112 brown      unknown    blue           200   male  
##  4 Jabb…    175  1358 <NA>       green-tan… orange         600   herma…
##  5 Jek …    180   110 brown      fair       blue            NA   male  
##  6 IG-88    200   140 none       metal      red             15   none  
##  7 Bossk    190   113 none       green      red             53   male  
##  8 Dext…    198   102 none       brown      yellow          NA   male  
##  9 Grie…    216   159 none       brown, wh… green, y…       NA   male  
## 10 Tarf…    234   136 brown      brown      blue            NA   male  
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

How about some piping!! What if we want to do those things all in one step??? Can chain functions together with %>%. The pipe connects the LHS to the RHS. (Like reading a book) Let’s make a new dataframe where we select the name, height, and mass. Filter out those who are shorter than 100 cm:

new_df <- our_data %>% select(name, height, mass) %>% filter(height >= 100)
new_df

## # A tibble: 74 x 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 Darth Vader           202   136
##  4 Leia Organa           150    49
##  5 Owen Lars             178   120
##  6 Beru Whitesun lars    165    75
##  7 Biggs Darklighter     183    84
##  8 Obi-Wan Kenobi        182    77
##  9 Anakin Skywalker      188    84
## 10 Wilhuff Tarkin        180    NA
## # … with 64 more rows

Self check: make a new data frame where you select all columns except gender and has characters that appear ONLY in the film “A New Hope”:

example_df <- our_data %>% select(-gender) %>% filter(films == "A New Hope")

view(example_df)

Let’s do some work with arrange(). Let’s arrange all of the characters by their height

our_data %>% arrange(height)

## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Yoda      66    17 white      green      brown            896 male  
##  2 Ratt…     79    15 none       grey, blue unknown           NA male  
##  3 Wick…     88    20 brown      brown      brown              8 male  
##  4 Dud …     94    45 none       blue, grey yellow            NA male  
##  5 R2-D2     96    32 <NA>       white, bl… red               33 <NA>  
##  6 R4-P…     96    NA none       silver, r… red, blue         NA female
##  7 R5-D4     97    32 <NA>       white, red red               NA <NA>  
##  8 Sebu…    112    40 none       grey, red  orange            NA male  
##  9 Gasg…    122    NA none       white, bl… black             NA male  
## 10 Watto    137    NA black      blue, grey yellow            NA male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Notice this does lowest to highest, we can do the other way too

our_data %>% arrange(desc(height))

## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Yara…    264    NA none       white      yellow          NA   male  
##  2 Tarf…    234   136 brown      brown      blue            NA   male  
##  3 Lama…    229    88 none       grey       black           NA   male  
##  4 Chew…    228   112 brown      unknown    blue           200   male  
##  5 Roos…    224    82 none       grey       orange          NA   male  
##  6 Grie…    216   159 none       brown, wh… green, y…       NA   male  
##  7 Taun…    213    NA none       grey       black           NA   female
##  8 Rugo…    206    NA none       green      orange          NA   male  
##  9 Tion…    206    80 none       grey       black           NA   male  
## 10 Dart…    202   136 none       white      yellow          41.9 male  
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Self check: Arrange the characters names in alphabetical order

our_data %>% arrange(name)

## # A tibble: 87 x 13
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Ackb…    180    83 none       brown mot… orange          41   male  
##  2 Adi …    184    50 none       dark       blue            NA   female
##  3 Anak…    188    84 blond      fair       blue            41.9 male  
##  4 Arve…     NA    NA brown      fair       brown           NA   male  
##  5 Ayla…    178    55 none       blue       hazel           48   female
##  6 Bail…    191    NA black      tan        brown           67   male  
##  7 Barr…    166    50 black      yellow     blue            40   female
##  8 BB8       NA    NA none       none       black           NA   none  
##  9 Ben …    163    65 none       grey, gre… orange          NA   male  
## 10 Beru…    165    75 brown      light      blue            47   female
## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Mutate creates new variables. Let’s create a new variable that measures age in dog years. First we need to create an age variable. I am going to assume it’s year 400 in the starwars universe (correct me if I am wrong):

our_data<-our_data %>% mutate(age = 400-birth_year)

Next I will find the characters’ ages in dog years (scale by 7–it’s SCIENCE):

our_data<-our_data%>%mutate(dog_years_age=age*7)

view(our_data)

We could also do this in one step:

our_data<-our_data %>% mutate(age = 400-birth_year)%>%
  mutate(dog_years_age=age*7)

Self check: Create a new variable that is the sum of person’s mass and height

our_data %>% mutate(total = height + mass)

## # A tibble: 87 x 16
##    name  height  mass hair_color skin_color eye_color birth_year gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
##  1 Luke…    172    77 blond      fair       blue            19   male  
##  2 C-3PO    167    75 <NA>       gold       yellow         112   <NA>  
##  3 R2-D2     96    32 <NA>       white, bl… red             33   <NA>  
##  4 Dart…    202   136 none       white      yellow          41.9 male  
##  5 Leia…    150    49 brown      light      brown           19   female
##  6 Owen…    178   120 brown, gr… light      blue            52   male  
##  7 Beru…    165    75 brown      light      blue            47   female
##  8 R5-D4     97    32 <NA>       white, red red             NA   <NA>  
##  9 Bigg…    183    84 black      light      brown           24   male  
## 10 Obi-…    182    77 auburn, w… fair       blue-gray       57   male  
## # … with 77 more rows, and 8 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, age <dbl>,
## #   dog_years_age <dbl>, total <dbl>

Group_by and Summarize will group data together and can make summary statistics. Let’s find the average height for each species:

our_data %>% group_by(species) %>% summarize(avg_height = mean(height))

## # A tibble: 38 x 2
##    species   avg_height
##    <chr>          <dbl>
##  1 Aleena           79 
##  2 Besalisk        198 
##  3 Cerean          198 
##  4 Chagrian        196 
##  5 Clawdite        168 
##  6 Droid            NA 
##  7 Dug             112 
##  8 Ewok             88 
##  9 Geonosian       183 
## 10 Gungan          209.
## # … with 28 more rows

Notice we have NA’s! We can get rid of those:

our_data %>% na.omit() %>% group_by(species) %>% summarize(avg_height = mean(height))

## # A tibble: 11 x 2
##    species      avg_height
##    <chr>             <dbl>
##  1 Cerean              198
##  2 Ewok                 88
##  3 Gungan              196
##  4 Human               178
##  5 Kel Dor             188
##  6 Mirialan            168
##  7 Mon Calamari        180
##  8 Trandoshan          190
##  9 Twi'lek             178
## 10 Wookiee             228
## 11 Zabrak              175

Count…well it counts:

our_data %>% count(species)

## # A tibble: 38 x 2
##    species       n
##    <chr>     <int>
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         5
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # … with 28 more rows

Remember: If you want more info on a function type ?name_of_function!

##OLS Regression To do a regression in R, we use lm(). The basic setup: name <- lm(y ~ x, data = name_of_df). Let’s regress height on mass:

reg1 <- lm(height ~ mass, data = our_data)

How do we look at the regression output?

summary(reg1)

## 
## Call:
## lm(formula = height ~ mass, data = our_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.763   -5.610    6.385   18.202   58.897 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 171.28536    5.34340   32.05   <2e-16 ***
## mass          0.02807    0.02752    1.02    0.312    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.52 on 57 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.01792,    Adjusted R-squared:  0.0006956 
## F-statistic:  1.04 on 1 and 57 DF,  p-value: 0.312

Hmm it looks like there isn’t a significant relationship between mass and height. Let’s filter out Jabba the Hutt because he is a large boi

reg2 <- lm(height ~ mass, data = our_data %>% filter(species != "Hutt"))

summary(reg2)

## 
## Call:
## lm(formula = height ~ mass, data = our_data %>% filter(species != 
##     "Hutt"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.821  -6.273   2.327  14.078  45.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 101.6706     8.6593  11.741  < 2e-16 ***
## mass          0.9500     0.1064   8.931 2.72e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.3 on 55 degrees of freedom
##   (24 observations deleted due to missingness)
## Multiple R-squared:  0.5919, Adjusted R-squared:  0.5845 
## F-statistic: 79.77 on 1 and 55 DF,  p-value: 2.72e-12

##Before Reading in data

Before we learn to read data into R, it would be helpful to know how to tell R where it is. This is important because later in the class we need to load files from our local machine. Eventually, we want to start using our own data instead of data contained in a package.

Note: Windows and R can both use directory separator /. Windows can also use \\ (must be double).
Find your current directory: getwd(). “wd” stands for working directory.
Change directories: setwd()
Show files in the current directory: dir()
Notice that directories are characters (surrounded by "").
RStudio will help you complete file paths when you hit tab while typing in the console.
You can save an object with the directory location and then return to that directory using setwd().
Example:```my_dir <- "Home/Folder1/Folder2" then setwd(my_dir)
Move up a level in the directory: setwd("..")

R can read in data from just about any source/format. Today we’re going to cover reading data saved in CSVs (comma-separated variables).

First, we’ll load the tidyverse package, which will actually load several packages (we want readr). The base (basic) installation of R already has a function for reading CSVs, but the function in tidyverse (readr) is a bit nicer.

Remember, you can always get to the help files in R/RStudio using ?. Let’s check out the help file for read_csv.

?read_csv

##Getting data into R

Step 1: Download the data - Download the csv - Search Marijuana Data Vincentarelbundock on Google –> “Data Set” - Ctrl + F or Command + F to search: Arrests for Marijuana Possession –> csv (the DOC option gives us a description of the data) - or search for “Arrests for Marijuana Possession” at https://vincentarelbundock.github.io/Rdatasets/datasets.html - Make sure your downloaded file is in a reasonable directory - Navigate R to the reasonable directory - Read the data, read_csv("../data/Arrests.csv")

STEP 2: Read the data

Make sure your downloaded file is in a reasonable directory.
You can find the file’s filepath by either (windows) right clicking on the file and looking in properties, under location
Or, (Mac) right click on the file, hold down alt and select the copy as filepath option
Navigate R to the folder loacation, using the setwd(my_dir) command, setting my_dir to the filepath
Read the data into R, using read_csv(“../Arrests.csv”)

It may be helpful to save the path as object and then read the data in using that object

my_path<-"/Users/garrettstanford/Downloads/Arrests.csv"

read_csv(my_path)

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   released = col_character(),
##   colour = col_character(),
##   year = col_double(),
##   age = col_double(),
##   sex = col_character(),
##   employed = col_character(),
##   citizen = col_character(),
##   checks = col_double()
## )

## # A tibble: 5,226 x 9
##       X1 released colour  year   age sex    employed citizen checks
##    <dbl> <chr>    <chr>  <dbl> <dbl> <chr>  <chr>    <chr>    <dbl>
##  1     1 Yes      White   2002    21 Male   Yes      Yes          3
##  2     2 No       Black   1999    17 Male   Yes      Yes          3
##  3     3 Yes      White   2000    24 Male   Yes      Yes          3
##  4     4 No       Black   2000    46 Male   Yes      Yes          1
##  5     5 Yes      Black   1999    27 Female Yes      Yes          1
##  6     6 Yes      Black   1998    16 Female Yes      Yes          0
##  7     7 Yes      White   1999    40 Male   No       Yes          0
##  8     8 Yes      White   1998    34 Female Yes      Yes          1
##  9     9 Yes      Black   2000    23 Male   Yes      Yes          4
## 10    10 Yes      White   2001    30 Male   Yes      Yes          3
## # … with 5,216 more rows

Alternatively could just read in data:

read_csv("/Users/garrettstanford/Downloads/Arrests.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   released = col_character(),
##   colour = col_character(),
##   year = col_double(),
##   age = col_double(),
##   sex = col_character(),
##   employed = col_character(),
##   citizen = col_character(),
##   checks = col_double()
## )

## # A tibble: 5,226 x 9
##       X1 released colour  year   age sex    employed citizen checks
##    <dbl> <chr>    <chr>  <dbl> <dbl> <chr>  <chr>    <chr>    <dbl>
##  1     1 Yes      White   2002    21 Male   Yes      Yes          3
##  2     2 No       Black   1999    17 Male   Yes      Yes          3
##  3     3 Yes      White   2000    24 Male   Yes      Yes          3
##  4     4 No       Black   2000    46 Male   Yes      Yes          1
##  5     5 Yes      Black   1999    27 Female Yes      Yes          1
##  6     6 Yes      Black   1998    16 Female Yes      Yes          0
##  7     7 Yes      White   1999    40 Male   No       Yes          0
##  8     8 Yes      White   1998    34 Female Yes      Yes          1
##  9     9 Yes      Black   2000    23 Male   Yes      Yes          4
## 10    10 Yes      White   2001    30 Male   Yes      Yes          3
## # … with 5,216 more rows

Notice that we read the data, but it just printed to screen. We want to assign the data to an object (give it a name).

arrest_data<-read_csv(my_path)

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   released = col_character(),
##   colour = col_character(),
##   year = col_double(),
##   age = col_double(),
##   sex = col_character(),
##   employed = col_character(),
##   citizen = col_character(),
##   checks = col_double()
## )

Here are some commands for getting a snapshot of a dataset: head, tail, summary, table, plot

#head(arrest_data,15);
#tail(arrest_data, 10);
#head(arrest_data, 25)%>% tail(10)
summary(arrest_data)

##        X1         released            colour               year     
##  Min.   :   1   Length:5226        Length:5226        Min.   :1997  
##  1st Qu.:1307   Class :character   Class :character   1st Qu.:1998  
##  Median :2614   Mode  :character   Mode  :character   Median :2000  
##  Mean   :2614                                         Mean   :2000  
##  3rd Qu.:3920                                         3rd Qu.:2001  
##  Max.   :5226                                         Max.   :2002  
##       age            sex              employed           citizen         
##  Min.   :12.00   Length:5226        Length:5226        Length:5226       
##  1st Qu.:18.00   Class :character   Class :character   Class :character  
##  Median :21.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :23.85                                                           
##  3rd Qu.:27.00                                                           
##  Max.   :66.00                                                           
##      checks     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :1.000  
##  Mean   :1.636  
##  3rd Qu.:3.000  
##  Max.   :6.000

What is the mean age of black people? White people?

arrest_data%>%group_by(colour) %>% 
  summarize(avg_age = mean(age))

## # A tibble: 2 x 2
##   colour avg_age
##   <chr>    <dbl>
## 1 Black     24.8
## 2 White     23.5

If we look at our dataset there some of the data is impossible to use in its current format. How does one regress the value “Yes” or “Black”? Lets use the ifelse function to make some numerical representations of these columns:

arrest_data<-arrest_data%>%mutate(gender_dummy=ifelse(sex=="Male", 1, 0)) 

arrest_data<-arrest_data%>%mutate(colour_dummy=ifelse(colour=="Black", 1, 0))

arrest_data<- arrest_data<-arrest_data%>%mutate(released_dummy=ifelse(released=="Yes", 1, 0))

This function is super helpful, but if it’s over your head don’t worry too much about it. Alternatively you could try ?ifelse to learn more.

Looking at the “DOC” file on the website which was right next to the “CSV” that you downloaded we can get a description of what each variable is. I see that “checks: Number of police data bases (of previous arrests, previous convictions, parole status, etc. – 6 in all) on which the arrestee’s name appeared; a numeric vector.”

So lets see if age has an effect on how many checks someone has:

reg3<-lm(checks ~ age, data = arrest_data)
summary(reg3)

## 
## Call:
## lm(formula = checks ~ age, data = arrest_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6403 -1.4903 -0.4403  1.3597  4.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.040227   0.064080  16.233   <2e-16 ***
## age         0.025002   0.002537   9.853   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 5224 degrees of freedom
## Multiple R-squared:  0.01825,    Adjusted R-squared:  0.01806 
## F-statistic: 97.09 on 1 and 5224 DF,  p-value: < 2.2e-16

So being older makes you more of criminal? Maybe, or maybe something else is going on…

Lets look at if the year indicates the probability that the individual will be black or white

reg4<-lm(colour_dummy ~ year, data = arrest_data)

summary(reg4)

## 
## Call:
## lm(formula = colour_dummy ~ year, data = arrest_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2499 -0.2471 -0.2458 -0.2430  0.7570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.511065   8.577352  -0.293    0.770
## year         0.001379   0.004290   0.321    0.748
## 
## Residual standard error: 0.431 on 5224 degrees of freedom
## Multiple R-squared:  1.978e-05,  Adjusted R-squared:  -0.0001716 
## F-statistic: 0.1034 on 1 and 5224 DF,  p-value: 0.7479

Doesn’t look like there are signiificant findings!

Does race have an effect on if released?

reg5<- lm(released_dummy~colour_dummy, data = arrest_data)

summary(reg5)

## 
## Call:
## lm(formula = released_dummy ~ colour_dummy, data = arrest_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8580  0.1419  0.1419  0.1419  0.2585 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.858050   0.005943  144.38   <2e-16 ***
## colour_dummy -0.116590   0.011971   -9.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3729 on 5224 degrees of freedom
## Multiple R-squared:  0.01783,    Adjusted R-squared:  0.01765 
## F-statistic: 94.86 on 1 and 5224 DF,  p-value: < 2.2e-16

Looks like there IS a significant relationship. Remember when the outcome is a binary variable the coefficent is a probability. So if someone is black they are 11% less likely to be released with a summons.