MATH 239: Project Milestone #3: Fitting a Simple Linear Model 100 points

Goal: In this milestone we will start fitting our first simple model. Please report all your findings for this assignment in a R Markdown document.

Step 0:

Identify a numeric response variable in your data set and a numeric explanatory variable. (10 points)

Numeric response variable: number of EV charging stations by state

Numeric explanatory variable: median income by state

# median state income vs num units by state
ggplot(stations_income_df2, aes(med_income18_20, state_num_units.y))+
  geom_point()+
  ggtitle("Scatterplot of Median Income vs EV Stations by State")+
  xlab("Median Income")+
  ylab("EV Stations")+
  theme_bw()
## Warning: Removed 3 rows containing missing values (geom_point).

## remove num units outlier
no_outlier <- stations_income_df2 %>%
  filter(state_num_units.y < 40000)
ggplot(no_outlier, aes(med_income18_20, state_num_units.y))+
  geom_point()+
  ggtitle("Scatterplot of Median Income vs EV Stations by State, CA removed")+
  xlab("Median Income")+
  ylab("EV Stations")+
  theme_bw()
## Warning: Removed 3 rows containing missing values (geom_point).

## modify dataframe to be same length as m1$residuals due to na values
#full dataset
stations_income_df2 <- stations_income_df2[!is.na(stations_income_df2$med_income18_20), ]
no_outlier <- no_outlier[!is.na(no_outlier$med_income18_20), ]

Perform a simple linear regression analysis

Step 1: Determine the fitted model (10 points)

# full data set
m1 <- lm(state_num_units.y~med_income18_20, data = stations_income_df2)
summary(m1)
## 
## Call:
## lm(formula = state_num_units.y ~ med_income18_20, data = stations_income_df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3259  -1907  -1242      0  38843 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)     -2.304e+03  5.104e+03  -0.451    0.654
## med_income18_20  7.024e-02  7.366e-02   0.954    0.345
## 
## Residual standard error: 5854 on 50 degrees of freedom
## Multiple R-squared:  0.01787,    Adjusted R-squared:  -0.001778 
## F-statistic: 0.9095 on 1 and 50 DF,  p-value: 0.3448
#scatterplot with fitted line

ggplot(stations_income_df2, aes(med_income18_20, state_num_units.y))+
  geom_point()+
  ggtitle("Scatterplot of Median Income vs EV Stations by State")+
  xlab("Median Income")+
  ylab("EV Stations")+
  theme_bw()+
  geom_abline(slope=m1$coefficients[2], intercept=m1$coefficients[1],
              color="blue", lty=2, lwd=1)

# no outlier

m2 <- lm(state_num_units.y~med_income18_20, data = no_outlier)
summary(m2)
## 
## Call:
## lm(formula = state_num_units.y ~ med_income18_20, data = no_outlier)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1768.2 -1229.6  -714.7   921.4  6265.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)     3.746e+02  1.575e+03   0.238    0.813
## med_income18_20 1.985e-02  2.278e-02   0.871    0.388
## 
## Residual standard error: 1801 on 49 degrees of freedom
## Multiple R-squared:  0.01526,    Adjusted R-squared:  -0.004841 
## F-statistic: 0.7591 on 1 and 49 DF,  p-value: 0.3879
#scatterplot with fitted line, outlier removed

ggplot(no_outlier, aes(med_income18_20, state_num_units.y))+
  geom_point()+
  ggtitle("Scatterplot of median income by state vs number of EV charging units by state, CA removed")+
  xlab("median income by state")+
  ylab("number of EV charging units")+
  theme_bw()+
  geom_abline(slope=m2$coefficients[2], intercept=m2$coefficients[1],
              color="blue", lty=2, lwd=1)

Step 2: Perform a test for the slope.

From the linear model above:

Null hypothesis: slope = 0

Alternative hypothesis: slope != 0

State the reference distribution, degrees of freedom, the test statistic, and p-value. (10 points)

Full data set: t-distribution with 50 degrees of freedom, the test statistic is 0.954, and p-value is 0.345

Data set with CA removed: t-distribution with 49 degrees of freedom, the test statistic is 0.871, and p-value is 0.388

Write a five-part conclusion incorporating the hypotheses in the context of the problem. (10 points)

Full data set: We fail to reject the null hypothesis with a p-value of 0.345 at the alpha = 0.05 significance level. There does not seem to be much evidence that median income of a state tends to be associated with the number of EV charging stations.

Step 3: Create an ANOVA table and produce the F-statistic and discuss the R-squared value. (20 points)

anova(m1)
## Analysis of Variance Table
## 
## Response: state_num_units.y
##                 Df     Sum Sq  Mean Sq F value Pr(>F)
## med_income18_20  1   31163688 31163688  0.9095 0.3448
## Residuals       50 1713228705 34264574
anova(m2)
## Analysis of Variance Table
## 
## Response: state_num_units.y
##                 Df    Sum Sq Mean Sq F value Pr(>F)
## med_income18_20  1   2462683 2462683  0.7591 0.3879
## Residuals       49 158965949 3244203
# r^2 = 1 - (SSE/SST)
1 - (1713228705/(31163688+1713228705))
## [1] 0.01786507

With a very small R-squared value of 0.08 for the full data set, it appears that very little of the variation in the number of EV stations can be explained by the median income. It would be difficult to predict the number of EV stations based on the median income of a State.

Step 4: Create diagnostic plots to assess model assumptions. (20 points)

Residual plot

# residual plot to check model assumptions of center and variance of residuals


#full dataset
m1_df <- augment(m1, stations_income_df2)
ggplot(m1_df, aes(x = .fitted, y = .resid))+
  geom_point()+
  geom_hline(yintercept=0, lty=2, lwd=1, color="blue")+
  ggtitle("Residual Plot")+
  theme_bw()

ggplot(m1_df, aes(x = .fitted, y = .resid))+
  geom_point()+
  geom_hline(yintercept=0, lty=2, lwd=1, color="blue")+
  geom_smooth(se=FALSE)+
  ggtitle("Residual Plot with locally weighted smoothing line")+
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# CA removed
m2_df <- augment(m2, no_outlier)
ggplot(m2_df, aes(x = .fitted, y = .resid))+
  geom_point()+
  geom_hline(yintercept=0, lty=2, lwd=1, color="blue")+
  ggtitle("Residual Plot with CA removed")+
  theme_bw()

ggplot(m2_df, aes(x = .fitted, y = .resid))+
  geom_point()+
  geom_hline(yintercept=0, lty=2, lwd=1, color="blue")+
  geom_smooth(se=FALSE)+
  ggtitle("Residual Plot with CA removed and locally weighted smoothing line")+
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

QQ Plot

# check for normal distribution of errors assumption

qqnorm(m1$residuals)
qqline(m1$residuals)

qqnorm(m2$residuals)
qqline(m2$residuals)

Leverage vs Residual Plot

# residuals vs cooks distance

ggplot(m1_df, aes(x = .cooksd, y = .resid, label=State))+
  geom_point()+
  geom_text(hjust=-1, vjust=1)+
  ggtitle("Leverage vs Residual Plot")+
  theme_bw()
## Warning: Removed 2 rows containing missing values (geom_text).

ggplot(m2_df, aes(x = .cooksd, y = .resid, label=State))+
  geom_point()+
  geom_text(hjust=-1, vjust=1)+
  ggtitle("Leverage vs Residual Plot with CA removed")+
  theme_bw()
## Warning: Removed 2 rows containing missing values (geom_text).

Step 5: Summarize your findings. (20 points)

The two numeric variables we looked at are the median income by state and the number of EV charging stations by state. We hypothesized that the median income (explanatory) would be associated with the number of charging stations (response), however the scatter plot shows one apparent outlier, CA with a much higher number of stations, while the rest don’t seem to show any clear relationship.

We performed a hypothesis test for the slope with a t-distribution of 50 degrees of freedom (with all data) or 49 degrees of freedom (with CA removed). The full dataset had a t value of 0.954 and a p value of 0.345, and the outlierless dataset had a t value of 0.871 and a p value of 0.388. For both data sets we fail to reject the null hypothesis with p values of 0.345 (full) and 0.388 (CA removed) at a significance level of alpha = 0.05. There does not seem to be evidence that the median income of a state tends to be associated with the number of EV charging stations it has. R-squared = 0.017, which shows that the proportion of the variation in the number of stations is not really able to be predicted by the state median income.

To test our model assumptions, we created three diagnostic plots: residual plot, qq plot, and leverage vs. residuals plot. In the residual plot of the full data set, the residuals seem to be fairly centered around 0, except for California. With California removed, it becomes clear that there are more points overestimated by the model (points below 0 on the plot) than there are underestimated, so the data is not evenly spread around 0. There is not an obvious fan shape in the data, but there does appear to be a somewhat weak parabolic pattern, showing how there is not really constant spread either. In the qq plot of the full data set, the points appear to follow the qq line fairly well except at the far right hand side of the plot. This would indicate a right skewed distribution. With California removed, the points do not appear to follow the qq line as well, but the tails are bent upwards, indicating a right skewed distribution. In the leverage vs. residual plot of the full data set, California is the only data point with a large enough Cook’s distance (>0.08) to be influential in the model. However, in the data set with California removed, New York, Florida, Massachusetts, and Texas all have considerable leverage.

Our diagnostic plots to test for model assumptions show that a linear model was not a good way to gain insight into these data. These data inherently do not follow the model assumptions for normal distribution and even and constant spread of residuals around zero, so we cannot really use simple linear regression to draw conclusions about the relationship between EV stations and median income. We may be able to transform the data to better fit the model assumptions, but in it’s current state it is not suited for SLR.