CUNY SPS - Master of Science in Data Science

Discussion 11

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

For this week’s discussion, I used a dataset that I had previously worked with in my DATA606 course last semester. These data basically consist of college majors and information as it relates to unemployment rates, median salaries, age, etc.

I will first load the data from FiveThirtyEight’s github repository:

# load libraries
library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

# load data
all_ages <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv", header = TRUE)

# I will select only the columns that we are interested in using for the analysis.
all <- select(all_ages, Unemployment_rate, Median)

We will take a look at the relationship between unemployment rate and median salary.

m_Median <- lm(Unemployment_rate ~ Median, data = all)
summary(m_Median)

## 
## Call:
## lm(formula = Unemployment_rate ~ Median, data = all)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.056889 -0.009985 -0.000074  0.009971  0.094140 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.973e-02  5.578e-03  14.292  < 2e-16 ***
## Median      -3.937e-07  9.507e-08  -4.142 5.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01834 on 171 degrees of freedom
## Multiple R-squared:  0.09117,    Adjusted R-squared:  0.08586 
## F-statistic: 17.15 on 1 and 171 DF,  p-value: 5.406e-05

ggplot(all, aes(x = Unemployment_rate, y = Median)) +
  geom_point(color = 'blue')+
  geom_smooth(method = "lm", formula = y~x)

Median salary is a statistically significant predictor of unemployment rate as the p-value is close to zero. Although, not very apparent, we can also see from our graph above that the model confirms there is a negative relationship among these two variables.

Now let’s conduct a residual analysis to assess whether the linear model is reliable. We will need to check for linearity, nearly normal residuals and constant variability.

plot(m_Median$residuals ~ all$Median)
abline(h = 0, lty = 3)

hist(m_Median$residuals)

qqnorm(m_Median$residuals)
qqline(m_Median$residuals)

There seems to be a pattern in the residual plot where variability is more apparent at lower median salaries. Additionally, the residuals distribution has some outliers to the right, and this is evident in the normal probability plot that shows how the points deviate from the line. The normal residuals condition has not been met, thus the linear model does not seem appropriate according to our residual analysis.

CUNY SPS - Master of Science in Data Science - DATA605

Mario Pena

April 15, 2020

Discussion 11