library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.5
## âś” forcats   1.0.0     âś” stringr   1.5.1
## âś” ggplot2   3.5.1     âś” tibble    3.2.1
## âś” lubridate 1.9.4     âś” tidyr     1.3.1
## âś” purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(readxl)
library(pastecs)
## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
  1. Load your preferred dataset into R studio
dataset<-read.csv("state_data_combo.csv")
  1. Create a linear model “lm()” from the variables, with a continuous dependent variable as the outcome
linear_model<-lm(dataset$Library.Visits.Per.Capita~dataset$Poverty.Rate....,data=dataset)
  1. Check the following assumptions:
  1. Linearity (plot and raintest)
plot(linear_model,which=1)

raintest(linear_model)
## 
##  Rainbow test
## 
## data:  linear_model
## Rain = 2.636, df1 = 26, df2 = 23, p-value = 0.01082
  1. Independence of errors (durbin-watson)
durbinWatsonTest(linear_model)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.09462595      2.119612   0.652
##  Alternative hypothesis: rho != 0
  1. Homoscedasticity (plot, bptest)
plot(linear_model,which=3)

bptest(linear_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  linear_model
## BP = 0.43045, df = 1, p-value = 0.5118
  1. Normality of residuals (QQ plot, shapiro test)
plot(linear_model,which=2)

shapiro.test(linear_model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  linear_model$residuals
## W = 0.96794, p-value = 0.1815
  1. No multicolinarity (VIF, cor)
cor(dataset$Library.Visits.Per.Capita,dataset$Poverty.Rate....)
## [1] -0.365319
  1. does your model meet those assumptions? You don’t have to be perfectly right, just make a good case.
  2. If your model violates an assumption, which one?

My linear model does meet some of the assumptions required of linear regressions. Running a Durbin-Watson test on the linear model for error independence results in a p-value of 0.668. A p-value greater than .05 signifies Independence.

The model also passes the Breusch-Pagan test for homoscedasticity, returning a p-value of 0.5118 after testing. That significant result means that the null hypothesis of heteroscadicity was rejected.

The model also meets the assumption of normal residuals, as most of the points on a QQ-plot touch the plotted line, and a Shapiro test of the model returns a p-value of 0.1815. A p-value from a Shapiro test above .05 suggests that the residuals are normal.

Finally, the variables of library visits per capita and poverty rate are not correlated strongly with each other. The cor() function only returns a .365319 correlation between the two, which suggests some, but not strong, correlation

On the otherhand, my model does not meet the assumption of linearity. A rainbow test for linearity outputs a p-value = 0.01082, which means it fails to show significant linearity, a result visualized in the plot of the linear model.

  1. What would you do to mitigate this assumption? Show your work.

To mitigate the failure of my linear model to meet the assumption of linearity, I tried transforming the data with the log and square root transformations. However, these failed to improve linearity, actually decreasing the linearity according to the Rainbow test.

transformed_model1<-lm(log(dataset$Library.Visits.Per.Capita)~log(dataset$Poverty.Rate....),data=dataset)

transformed_model2<-lm(sqrt(dataset$Library.Visits.Per.Capita)~sqrt(dataset$Poverty.Rate....),data=dataset)
plot(transformed_model1,which=1)

plot(transformed_model2,which=1)

raintest(transformed_model1)
## 
##  Rainbow test
## 
## data:  transformed_model1
## Rain = 3.6533, df1 = 26, df2 = 23, p-value = 0.001251
raintest(transformed_model2)
## 
##  Rainbow test
## 
## data:  transformed_model2
## Rain = 2.9854, df1 = 26, df2 = 23, p-value = 0.004977