Discussion 2

I’ve chosen two datasets - “USarrests” and “statepov”. USarrests summarizes state-level crime and population characteristics in the year 1973. It’s a cross-sectional dataset. I want to look at the relationship between urban population and crime to understand: are cities more dangerous (at least in 1973 when these data were collected)? I also found some poverty data by state that I aligned with the arrest data. That is also cross-sectional and, although from 2014, I thought it would still be interesting to view alongside the crime and urban data.

##Load necessary packages for analysis

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(usmap)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
arrests <- USArrests
poverty <- statepov
mapdata <- map_data("state")
## make row names into a columns so leftjoin will work correctly. make columns the same case (lower)
arrests$region <- tolower(rownames(arrests))
poverty$region <- tolower(poverty$full)
##Join datasets 
mapdata_us_cleaned <- left_join(mapdata,arrests, by = "region")
mapdata_us_cleaned <- right_join(mapdata_us_cleaned,poverty, by = "region")

##eliminate unnecessary columns
mapdata_us_cleaned <- mapdata_us_cleaned[,-10:-12]

## scatterplot with murder and urban population
Murder_Urban_scatter <- plot(x = arrests$UrbanPop, y = arrests$Murder, xlab = "% Urban Population", ylab = "Murder arrests per 100,000")

##Create state maps of murder rate, urban population rate and poverty rate
murdermap <- ggplot(mapdata_us_cleaned, aes( x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = Murder), color = "white") +
  labs(title = "Murder arrests per 100,000, 1973")

urbanmap <- ggplot(mapdata_us_cleaned, aes( x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = UrbanPop), color = "white") +
  labs(title = "% Urban Population by US state, 1973")

povertymap <- ggplot(mapdata_us_cleaned, aes( x = long, y = lat, group = group)) + 
  geom_polygon(aes(fill = pct_pov_2014), color = "white") +
  labs(title = "% of Population living in poverty by US state, 2014") 

#Look at each visualization
Murder_Urban_scatter
## NULL
murdermap

urbanmap

povertymap

#Maps/charts don't look very promising. Any correlation?
cor(arrests$Murder,arrests$UrbanPop)
## [1] 0.06957262

There does appear to be a some relationship between murder and more dense, urban states when looking just at the maps. However, the scatterplot and correlation figure do not support that relationship (at least at the state level).

Variance, covariance and simple linear regressions

Covariance is a measure of how much variables change in unison. It shows the degree to which a change in one variable is related to a change in another variable. If a covariance is positive, that indicates that when they change, they do so in the same direction. If it is negative, the opposite occurs.

Variance is a measure of spread in the observations of a single variable. It indicates how much a variable deviates from its mean value.

In trying to conceive of a line of best fit, it makes sense that one would factor in how much the two variables change in unison. You need to know how changes in the variables coincide. So why include the variance of X? To account for any differences in scale between the two variables. For example, if we wanted to measure how well unemployment predicts GDP historically in the US, we could look at the covariance of the two variables. But one is measured in percentage points, while the other is in the trillions of dollars. To account for that, the covariance(x,y) is divided by the variance of x.

cov <- cov(arrests$UrbanPop,arrests$Murder)
var <- var(arrests$UrbanPop)
beta1_manual <- cov/var

beta1_formula <- lm(Murder ~ UrbanPop, data = arrests)

beta1_formula
## 
## Call:
## lm(formula = Murder ~ UrbanPop, data = arrests)
## 
## Coefficients:
## (Intercept)     UrbanPop  
##     6.41594      0.02093
beta1_manual
## [1] 0.02093466

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.