1) Load your preferred dataset into R studio

Project=read_xlsx("QUANT_Project.xlsx")

2) Create a linear model “lm()” from the variables, with a continuous dependent variable as the outcome

Project_NoZeroCost = Project[Project$Cost != 0, ]
Model=lm(Cost~IMPPCI+Calls, data=Project_NoZeroCost)

3) Check the following assumptions: a) Linearity (plot and raintest)

plot(Model,which=1)

The raintest suggests that the relationship is not linear.

raintest(Model)
## 
##  Rainbow test
## 
## data:  Model
## Rain = 9.8422, df1 = 4301, df2 = 4298, p-value < 2.2e-16

b) Independence of errors (durbin-watson)

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
durbinWatsonTest(Model)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1614318      1.676109       0
##  Alternative hypothesis: rho != 0

I am not fully sure why the p-value lists zero, but the DW Statistic is 1.67, which I would think is close to 2 depending on the context. Since this exercise assumes homoscedasticity, maybe the information does not follow that requirement and is generating a strange result.

c) Homoscedasticity (plot, bptest)

plot(Model,which=3)

The p-value is significant, so the model should be HETEROscedastic, which would probably impact the effectiveness of the last exercise.

bptest(Model)
## 
##  studentized Breusch-Pagan test
## 
## data:  Model
## BP = 778.14, df = 2, p-value < 2.2e-16

d) Normality of residuals (QQ plot, shapiro test)

plot(Model,which=2)

The Shapiro test could not process a dataset of this size (8615 objects), but the plot shows a significant portion of the residuals diverging from the line.

e) No multicolinarity (VIF, cor)

vif(Model)
##   IMPPCI    Calls 
## 1.025128 1.025128

I do not have clear interpretation of the cor test results, but the vif came out close to 1, so it is pretty clear that the PCI and calls are not very closely related.

ModelCor<-Project_NoZeroCost %>% dplyr::select(Cost,IMPPCI,Calls)

cor(ModelCor)
##             Cost IMPPCI     Calls
## Cost   1.0000000     NA 0.4391854
## IMPPCI        NA      1        NA
## Calls  0.4391854     NA 1.0000000

4) does your model meet those assumptions? You don’t have to be perfectly right, just make a good case.

The model definitely does not meet the assumptions, with the possible exception that calls and PCI do not closely correlate with each other.

5) If your model violates an assumption, which one?

It seems to violate almost all assumptions. It is not linear, it is not homoscedasticic, and by extension I do not know how to carry out the DW test, and the residuals are not normal (although a large stretch of them that have some normality)

6) What would you do to mitigate this assumption? Show your work.

I would do a log transformation on the dependent variable. It seems like that could potentially improve most of the issues, as they improve normality of residuals, improve homoscedascity, would make the data more compatible with a DW test to give us more insight, and would make the data more linear in general.