—————I Put all my text answers in Italics—————

This problem set is due Friday, September 15th at midnight. You may submit answers in any reasonable format, but knitting the rmarkdown (.rmd) file is probably the easiest method. You should show both output and code where appropriate, and make sure to answer ‘essay’ style questions as well. . Problem sets will involve a number of questions unrelated to coding and are worth a large number of points, so if you can’t get things to run properly you should still attempt these questions and submit a partial assignment at least. If working off of the .Rmd file you’ll want to change eval=FALSE to eval=TRUE for these code blocks to calculate everything in the document

Introduction

This is an rmarkdown document, which combines text with r code. Text can be formatted with markdown syntax, e.g. double asterisks for bold or hashtags as above for titles. When the document is knit (towards the top of RStudio) it’ll render into a PDF file which you may be reading now (in which case you won’t see any of the markup syntax). When submitting answers it’ll likely be convenient to use your own rmarkdown file so that you can interlace code with output, but you can also just run the code in and attach output separately (e.g. from screenshots). Also note that you need a TeX installation to knit to PDF (see the R walkthrough for installing this), but you should be able to knit to HTML by clicking the arrow next to the knit button.

You can write formulas/latex using dollar signs directly in markdown, e.g. \(lim_{n \to \infty}(\sum_i^n (x-\mu_x)^n)^{\frac{1}{n}}=max(x-\mu_x)\) This should have all of the major syntax you need for any math (the formula itself is unrelated to anything in this class, but it’s a cool result from topology).

R code is written in blocks using triple backticks. The code and the output can be either shown in the knit document or hidden based on your preference. There are a large number of formatting options available, but you can just use the default ones for assignments

1+1
## [1] 2

Question 0

{Run the following code and put the output somewhere in your answer (whether by rendering in rmarkdown and setting eval=TRUE, taking a screenshot, or pasting the result)}. If using stata or python you can skip this question.

set.seed(Sys.time())
runif(1)
## [1] 0.1284636

Question 1: Summary statistics

Students generally have issues initially with reading in data due to confusion surrounding working directories, so we’ll start with a built-in dataset in R. USArrests provides statistics on violent arrests by state in 1973. We’re interested in the relationship between urban population (as a percent of total population) and the assault rate (arrests per 100,000). While not necessary, we can first assign the dataset to a variable which makes it visible to the environment

df <- USArrests

You can see which variables we have to work with by using the names function

names(df)
## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

To access a specific column, you can use the $ operator:

df$Assault
##  [1] 236 263 294 190 276 204 110 238 335 211  46 120 249 113  56 115 109 249  83
## [20] 300 149 255  72 259 178 109 102 252  57 159 285 254 337  45 120 151 159 106
## [39] 174 279  86 188 201 120  48 156 145  81  53 161

We can also apply functions to a column, e.g.

summary(df$Assault)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    45.0   109.0   159.0   170.8   249.0   337.0

{Using the functions mean and sd, calculate the mean and standard deviation of the assault rate and percent of population that is urban (this will look just like the code for “summary” above, but using mean and sd). Can you meaningfully compare these two columns using mean and sd?}

#Put your code answers here
mean(df$Assault)
## [1] 170.76
mean(df$UrbanPop)
## [1] 65.54
sd(df$Assault)
## [1] 83.33766
sd(df$UrbanPop)
## [1] 14.47476

You are able to calculate the mean and standard deviation of each column, however comparing them to each other using just these commands won’t produce many meaningful results.

Question 2 : Basic Conditionals

Sometimes we are interested in averages for only a certain subpopulation. Filtering in base R can be a bit cumbersome, so we’ll make use of a package called dplyr which tends to be quite intuitive, along with magrittr which includes the pipe operator which makes our code ‘flat’ instead of nested. If you have not installed dplyr before, you’ll need to run install.packages to download and install it (this only ever needs to be run once)

install.packages('dplyr')
install.packages('magrittr')

Once installed, every time you load up an R session you will need to load the library in order to call functions from the package (or you can use the scoping operator ::)

library(dplyr)
library(magrittr)

Suppose we are interested in the average assault rate for states with below average urban populations. We can run

df %>%
  filter(UrbanPop < mean(UrbanPop)) %>%
  summarise(mean(Assault))

We can think of this code as a list of commands that are done in sequence, i.e. we start with our full dataset, filter to only those with below average urban populations, and then summarize by finding the mean assault rate (there’s also a group_by function that is incredibly useful that will be used in the next problem set when we have a much larger dataset)

We have a lot of flexibility in how we make comparisons, for instance we can also look at the just the state with the lowest urban population rate:

df %>%
  filter(UrbanPop == min(UrbanPop))

Some other functions that can come in handy include quantile (i.e. percentile), sort, %in% (e.g. state %in% c(“IL”,“IN”)), and match and findInterval (use ?findInterval to check out the documentation)

{Modify the above code and place below to calculate the average assault rate for states with above average urban population, and indicate which state has the highest urban population rate}

#put your answer here
df %>%
  filter(UrbanPop > mean(UrbanPop)) %>%
  summarise(mean(Assault))
df %>%
  filter(UrbanPop == max(UrbanPop)) %>%
  summarise(mean(UrbanPop))

The state with the highest urban population rate is California with 91, their Assault rate is 276

Question 3 : Basic Distributions (univariate)

{Using the hist function create a histogram of assault rate. Does it look normally distributed?}

It does not look normally distributed

hist(df$Assault)

Question 4 : Basic Scatterplot

{Using the plot command, display a basic scatterplot of Urban population vs assault}

plot(df$UrbanPop,df$Assault)

Question 5 : Running a Regression

Functions can also accept two or more arguments. Use the following code to calculate the correlation between the assault rate and urban population

cor(df$Assault,df$UrbanPop)
## [1] 0.2588717

{What is the relationship between assault and urban population? How strong is the relationship?}

The relationship between assault and urban population is slightly positive, meaning it is fairly weakly related.

We can also run a regression to obtain the association between assault and urban population. The code to use is below, which will give the regression for \(Assault_i = \beta_0 + \beta_1 UrbanPop_i + \varepsilon_i\), where i indexes state. The estimate column contains the coefficients of interest, while the Std Error, t value, and p value columns contain information related to hypothesis testing that we won’t use for this problem set.

options(scipen=999) #this effectively turns off scientific notation
model <- lm(data=df, Assault ~ UrbanPop)
summary(model)
## 
## Call:
## lm(formula = Assault ~ UrbanPop, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150.78  -61.85  -18.68   58.05  196.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  73.0766    53.8508   1.357   0.1811  
## UrbanPop      1.4904     0.8027   1.857   0.0695 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.33 on 48 degrees of freedom
## Multiple R-squared:  0.06701,    Adjusted R-squared:  0.04758 
## F-statistic: 3.448 on 1 and 48 DF,  p-value: 0.06948

{What is \(\hat\beta_0,\hat\beta_1\) for this model? How would you interpret these results (i.e. what does it mean for UrbanPop to increase by 1)? Do these results agree with the correlation that you calculated earlier?}

\(\hat\beta_0\) is equal to 73.0766 and \(\hat\beta_1\) is equal to 1.4904, an 1 unit increase in Urban Population increases Assault by an average of 1.49, starting at 73. This seemingly matches the results from earlier since they are both positive.

Question 6 : Scoring a Model

You want to see how close of a prediction your linear model gives. Let’s calculate the predicted value and residual for the 4th observation in the data, ie \(\hat{Assault_4}, \hat{\varepsilon_4}\). You can view the fourth row of your data using by subsetting with [brackets] (normally you use df[row,column], but to retrieve all columns you can use df[row,])

df[4,]

{Calculate these residuals using the estimating equation and obtained coefficients from part 5 (just type the rounded values into the regression equation manually and calculate). Then, use the builtin predict function to obtain these same variables to verify that they give the same results}. The code for this is below:

\(\hat{Assault_4}\) = 73 + 1.49(50) + \(\hat{\varepsilon_4}\) = 147.5 + \(\hat{\varepsilon_4}\)

Since Actual assault is 190,

190 - 147.5 = \(\hat{\varepsilon_4}\)

\(\hat{\varepsilon_4}\) = 42.5

\(\hat{Assault_4}\) = 147.5

df <- df %>%
  mutate(predicted=predict(model,df)) %>%
  mutate(residual=Assault-predicted)
df[4,]

Question 7: Error Term

Your predicted assault rate is generally very different from your actual assault rate. {Name 2 factors in the error term of this model. (i.e. specific things that affect assault rates that aren’t included in your model)}

A few things that might affect assault rates that are not included in this model include average or mean household salary, demographics, education levels, or state/municipal political party.

Question 8: Bias

Your regression predicts that higher rates of urban population are associated with higher rates of assault convictions. {Can you interpret these results causally? Why or why not?}

You are not able to interpret these results causally because the relationship is not strong enough to conduct any meaningful result. On top of this, there might be other reasons (such as the ones listed above) for the correlation that are not included in the model.

This data was taken in the 1970s, and violent crime rates peaked in the 80’s before steadily declining over time, where it now sits at a level below 1973. Over the same period, rates of urbanization have continued to increase in the United States. {Explain why the regression results deviate from the actual trends that have developed since the 1970s}

Since there are many other factors contributing to violent crime rates outside of just Urban Population, it is easy to see how the trends in this data do not hold up over time.

Extra credit 1

{Redo the scatterplot from question 4, but add in your line of best fit from the regression and make it look nice using ggplot2. You can find ggplot code in the the R walkthrough, or you can find code on StackExchange}

options(scipen=999) 
model <- lm(data=df, Assault ~ UrbanPop)
summary(model)
## 
## Call:
## lm(formula = Assault ~ UrbanPop, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -150.78  -61.85  -18.68   58.05  196.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  73.0766    53.8508   1.357   0.1811  
## UrbanPop      1.4904     0.8027   1.857   0.0695 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81.33 on 48 degrees of freedom
## Multiple R-squared:  0.06701,    Adjusted R-squared:  0.04758 
## F-statistic: 3.448 on 1 and 48 DF,  p-value: 0.06948
plot(df$UrbanPop,df$Assault)
library(ggplot2)

ggplot() + geom_point(aes(df$UrbanPop,df$Assault)) + 
  geom_smooth(aes(df$UrbanPop,df$Assault), method="lm", se=F)

Extra Credit 2

{Read in the SIPPsmall.csv dataset located on Blackboard into R using the following code} (unless your file structure is exactly identical to mine you’ll need to change the argument to read.csv or set the working directory to the proper location). Also make sure eval is set to TRUE if rendering in rmarkdown: you should display the names of the columns and the number of rows in the dataset

df <- read.csv("SIPPsmall.csv")
names(df)
## [1] "hhsize"          "nchildren"       "EarnedIncome"    "FoodStampAmount"
## [5] "state"
nrow(df)
## [1] 181986