1. Load your preferred dataset into R studio
  2. Create a linear model “lm()” from the variables, with a continuous dependent variable as the outcome
  3. Check the following assumptions:
  1. Linearity (plot and raintest)
  2. Independence of errors (durbin-watson)
  3. Homoscedasticity (plot, bptest)
  4. Normality of residuals (QQ plot, shapiro test)
  5. No multicolinarity (VIF, cor)
  1. does your model meet those assumptions? You don’t have to be perfectly right, just make a good case.
  2. If your model violates an assumption, which one?
  3. What would you do to mitigate this assumption? Show your work.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.2.0     âś” readr     2.1.6
## âś” forcats   1.0.1     âś” stringr   1.6.0
## âś” ggplot2   4.0.2     âś” tibble    3.3.1
## âś” lubridate 1.9.5     âś” tidyr     1.3.2
## âś” purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pastecs)
## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
Animal_Control<-read_csv("Animal_Care_and_Control_Division_Annual_Statistics.csv")
## Rows: 22 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (17): Year, Number of Employees, Number of Division Vehicles, Annual Bud...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Adoptions_RTO_Foster <- lm(Euthanized~Adoptions+`Fostered Animals`+`Return to Owner`, data=Animal_Control)
plot(Adoptions_RTO_Foster,which=1)

raintest(Adoptions_RTO_Foster)
## 
##  Rainbow test
## 
## data:  Adoptions_RTO_Foster
## Rain = 8.1043, df1 = 11, df2 = 6, p-value = 0.009022
Adoptions_RTO_Foster<-lm(Euthanized~Adoptions+`Fostered Animals`+`Return to Owner`, data=Animal_Control)
Adoptions_RTO_Foster_log<-lm(log(Euthanized)~log(Adoptions+`Fostered Animals`+`Return to Owner`),data=Animal_Control)
plot(Adoptions_RTO_Foster_log,which=1)

raintest(Adoptions_RTO_Foster_log)
## 
##  Rainbow test
## 
## data:  Adoptions_RTO_Foster_log
## Rain = 4.5353, df1 = 11, df2 = 8, p-value = 0.02055
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
durbinWatsonTest(Adoptions_RTO_Foster)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1800719      1.569375   0.144
##  Alternative hypothesis: rho != 0
plot(Adoptions_RTO_Foster,which = 3)

bptest(Adoptions_RTO_Foster)
## 
##  studentized Breusch-Pagan test
## 
## data:  Adoptions_RTO_Foster
## BP = 1.2035, df = 3, p-value = 0.7522
plot(Adoptions_RTO_Foster,which=2)

shapiro.test(Adoptions_RTO_Foster$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  Adoptions_RTO_Foster$residuals
## W = 0.94478, p-value = 0.2704
plot(Adoptions_RTO_Foster_log,which=2)

Adoptions_RTO_Foster <- lm(Euthanized~Adoptions+`Fostered Animals`+`Return to Owner`, data=Animal_Control)
vif(Adoptions_RTO_Foster)
##          Adoptions `Fostered Animals`  `Return to Owner` 
##           1.470658           1.661375           1.226752

#My model doesn’t seem to meet the assumption of linearity by eye. I feel that my model is not very linear and has a curve in it can be transformed to become more linear. I conducted a rainbow test and got a p-value far less than 0.05. My raintest p-value results were 0.009022, making this model non-linear.

#After running the Durbin-Watson Test my data shows a p-value of 0.112. This result passed the indpendence of errors assumption. With a p-value of more than 0.05 I fail to reject the null hypothesis. The residuals are not significantly autocorrelated, thereby satisfying the assumption of independence required for linear modeling.

#As for the assumption of homoscedasticity I ran a plot and bptest. The p-value that my data shows in the bptest was 0.7522 and this is very much greater than 0.05. Based on this p-value of 0.7522 i fail to reject the null hypothesis and my model has homoscedasticity. My data passes this assumption.

#I have met the assumption of normality of residuals based on the the W=0.94478. With this number being so close to 1.0 a 0.94 indicates that my data is close to a perfect normal distribution. I ran plot (log)Which=2) just to try and see how the data changed for educational purpose. I have satisfied the assumption of normality.

#While running the VIF test my variables retured values far less than 5 and way less than 10. The three independent variables of Adoptions, Return to Owner, and Fostering are not strongly correlated with one another. The data shows Adoptions= 1.470658, Return to Owner= 1.226752 and Fostering= 1.661375. My variables passed the no multicolinarity assumption.

#My model violated the assumption of linearity.

#I performed a log transformation on the data. Adoptions_RTO_Foster_log<-lm(log(Euthanized)~log(Adoptions+Fostered Animals+Return to Owner),data=Animal_Control) plot(Adoptions_RTO_Foster_log,which=1) While i got a p-value of p-value = 0.02055 and this is still considered non-linear it was a move in the right direction.

#I am not sure if I should try to continue to mitigate the linearity assumption or if I should leave it alone. After the log transformation it apears to the eye to be less curved and more linear than the orgianl model. I actually think the first model is more linear than the log transformed model. The transformed model has two bumps in it, where as the orginal has only one slope.