This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
plot(cars)
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed. #Instructions #Evaluate you model using diagnostics and explore influential observations. For each test you conduct:
#1. State the rationale for conducting the test/visualization.
#2. Discuss your model’s performance in the test and changes that may need to be made to the model.
#3. Modify regression models as needed.
#4. Present a final model, corrected for all assumptions and influential observations.
The template for this work is provided to you. Your text will discuss your thought process for evaluating your model and walk through the changes you make. This is intentionally verbose on your end to ensure robust understanding. Use the materials shared and your book to guide your decisions.
For the most part, code is provided to you. You will only need to adapt code to your models, not create code yourself.
install.packages("tidyverse")
Error in install.packages : Updating loaded packages
install.packages("effects")
Error in install.packages : Updating loaded packages
install.packages("mvtnorm")
Error in install.packages : Updating loaded packages
install.packages("lmtest")
Installing package into ‘/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
also installing the dependency ‘zoo’
trying URL 'http://package-proxy/src/contrib/zoo_1.8-7.tar.gz'
Content type 'application/x-tar' length 1103065 bytes (1.1 MB)
==================================================
downloaded 1.1 MB
trying URL 'http://package-proxy/src/contrib/lmtest_0.9-37.tar.gz'
Content type 'application/x-tar' length 351240 bytes (343 KB)
==================================================
downloaded 343 KB
* installing *binary* package ‘zoo’ ...
* DONE (zoo)
* installing *binary* package ‘lmtest’ ...
* DONE (lmtest)
The downloaded source packages are in
‘/tmp/Rtmpf34POq/downloaded_packages’
install.packages("car")
Installing package into ‘/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
also installing the dependencies ‘zip’, ‘SparseM’, ‘MatrixModels’, ‘sp’, ‘data.table’, ‘openxlsx’, ‘abind’, ‘pbkrtest’, ‘quantreg’, ‘maptools’, ‘rio’
trying URL 'http://package-proxy/src/contrib/zip_2.0.4.tar.gz'
Content type 'application/x-tar' length 526472 bytes (514 KB)
==================================================
downloaded 514 KB
trying URL 'http://package-proxy/src/contrib/SparseM_1.78.tar.gz'
Content type 'application/x-tar' length 1098903 bytes (1.0 MB)
==================================================
downloaded 1.0 MB
trying URL 'http://package-proxy/src/contrib/MatrixModels_0.4-1.tar.gz'
Content type 'application/x-tar' length 363332 bytes (354 KB)
==================================================
downloaded 354 KB
trying URL 'http://package-proxy/src/contrib/sp_1.4-1.tar.gz'
Content type 'application/x-tar' length 1897269 bytes (1.8 MB)
==================================================
downloaded 1.8 MB
trying URL 'http://package-proxy/src/contrib/data.table_1.12.8.tar.gz'
Content type 'application/x-tar' length 2225611 bytes (2.1 MB)
==================================================
downloaded 2.1 MB
trying URL 'http://package-proxy/src/contrib/openxlsx_4.1.4.tar.gz'
Content type 'application/x-tar' length 3506179 bytes (3.3 MB)
==================================================
downloaded 3.3 MB
trying URL 'http://package-proxy/src/contrib/abind_1.4-5.tar.gz'
Content type 'application/x-tar' length 61688 bytes (60 KB)
==================================================
downloaded 60 KB
trying URL 'http://package-proxy/src/contrib/pbkrtest_0.4-8.6.tar.gz'
Content type 'application/x-tar' length 272989 bytes (266 KB)
==================================================
downloaded 266 KB
trying URL 'http://package-proxy/src/contrib/quantreg_5.55.tar.gz'
Content type 'application/x-tar' length 1549231 bytes (1.5 MB)
==================================================
downloaded 1.5 MB
trying URL 'http://package-proxy/src/contrib/maptools_0.9-9.tar.gz'
Content type 'application/x-tar' length 2196431 bytes (2.1 MB)
==================================================
downloaded 2.1 MB
trying URL 'http://package-proxy/src/contrib/rio_0.5.16.tar.gz'
Content type 'application/x-tar' length 505745 bytes (493 KB)
==================================================
downloaded 493 KB
trying URL 'http://package-proxy/src/contrib/car_3.0-7.tar.gz'
Content type 'application/x-tar' length 1555284 bytes (1.5 MB)
==================================================
downloaded 1.5 MB
* installing *binary* package ‘zip’ ...
* DONE (zip)
* installing *binary* package ‘SparseM’ ...
* DONE (SparseM)
* installing *binary* package ‘MatrixModels’ ...
* DONE (MatrixModels)
* installing *binary* package ‘sp’ ...
* DONE (sp)
* installing *binary* package ‘data.table’ ...
* DONE (data.table)
* installing *binary* package ‘abind’ ...
* DONE (abind)
* installing *binary* package ‘pbkrtest’ ...
* DONE (pbkrtest)
* installing *binary* package ‘openxlsx’ ...
* DONE (openxlsx)
* installing *binary* package ‘quantreg’ ...
* DONE (quantreg)
* installing *binary* package ‘maptools’ ...
* DONE (maptools)
* installing *binary* package ‘rio’ ...
* DONE (rio)
* installing *binary* package ‘car’ ...
* DONE (car)
The downloaded source packages are in
‘/tmp/RtmpjNYgd6/downloaded_packages’
library(tidyverse)
── Attaching packages ───────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.0 ✓ purrr 0.3.3
✓ tibble 3.0.0 ✓ dplyr 0.8.5
✓ tidyr 1.0.2 ✓ stringr 1.4.0
✓ readr 1.3.1 ✓ forcats 0.5.0
── Conflicts ──────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(effects)
Loading required package: carData
lattice theme set by effectsTheme()
See ?effectsTheme for details.
library(mvtnorm)
library(lmtest)
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
library(zoo)
library(car)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 methods overwritten by 'car':
method from
influence.merMod lme4
cooks.distance.influence.merMod lme4
dfbeta.influence.merMod lme4
dfbetas.influence.merMod lme4
Attaching package: ‘car’
The following object is masked from ‘package:dplyr’:
recode
The following object is masked from ‘package:purrr’:
some
HR_Data <- read_excel("HR Onboarding and Performance.xlsx")
Error in read_excel("HR Onboarding and Performance.xlsx") :
could not find function "read_excel"
emails<-HR_Data$`Number of e-mails sent or received within the first 90 days`
soc_events<-HR_Data$`Onboarding social events attended (out of 10)`
mentor<-HR_Data$`Time with mentor/new hire buddy (hours)`
knowledge<-HR_Data$`Early employee knowledge attainment scores (out of 10)`
sales<-HR_Data$`Sales within the first 90 days`
Regreesion Assumptions (LINE)
matrix1<-tibble(emails=emails, soc_events=soc_events, mentor=mentor, knowledge=knowledge, sales=sales)
scatterplotMatrix(matrix1)
*The relationship between sales and emails appears to be linear. The same can be said for sales and early knowledge. There does appear to be heteroskedasticity in the relationship between time with a mentor and sales in the first 90 days. The scatterplot above also denotes that the variables are not related to one other-and should be noted when considering multicollinearity.
model1<-lm(sales~emails+ knowledge+ mentor+soc_events, data=HR_Data)
summary(model1)
Call:
lm(formula = sales ~ emails + knowledge + mentor + soc_events,
data = HR_Data)
Residuals:
Min 1Q Median 3Q Max
-4.3003 -0.8707 0.3104 0.7461 2.7616
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.43533 1.39185 -1.750 0.0924 .
emails 0.09275 0.01167 7.945 2.67e-08 ***
knowledge -0.05829 0.16494 -0.353 0.7268
mentor 0.99801 0.20464 4.877 5.12e-05 ***
soc_events 0.07644 0.13195 0.579 0.5676
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.373 on 25 degrees of freedom
Multiple R-squared: 0.813, Adjusted R-squared: 0.7831
F-statistic: 27.17 on 4 and 25 DF, p-value: 8.831e-09
Our output shows that our model as a whole, explains approximately 78% of variance in sales within the first 90 days are explained by the number of emails received. Linearity
residualPlots(model1,plot=F)
Test stat Pr(>|Test stat|)
emails -0.4493 0.657280
knowledge -1.4504 0.159901
mentor -1.0858 0.288338
soc_events -2.2674 0.032656 *
Tukey test -2.9645 0.003031 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
*If we see a significant p value (less than or equal to .05), then we must assume that it is non-linear. With that in mind, emails, early knowledge, and time with mentors have linear relationships with sales within the first 90 days. However, social events are non-linear.
qqPlot(model1)
[1] 13 17
*The outliers for this data set are 13 and 17.
shapiro.test(resid(model1))
Shapiro-Wilk normality test
data: resid(model1)
W = 0.90619, p-value = 0.01194
The significant p-value of 0.01 suggests that the data is not normally distributed.
bptest(model1)
studentized Breusch-Pagan test
data: model1
BP = 3.744, df = 4, p-value = 0.4418
The Breusch-Pagan test determines the assumption of constant error variance (homoscedasticity). This tests the extent to which the mean of our fitted residuals have a mean of 0. If the p-value is significant (less than 0.05) then there are no homoscedastic errors. In this case, p-value is not significant, so we must assume homoscedasticity is in our model.
residualPlots(model1,plot=T,tests=F)
*If we can draw a box or rectangle around this plot then we an assume normally distributed errors. With only a few data points, I don’t find this data as clear as the homework assigned however, I do believe that a square could be drawn around each one. Let’s review the bptest to double check. See below:
bptest(model1)
studentized Breusch-Pagan test
data: model1
BP = 3.744, df = 4, p-value = 0.4418
*BP test reveals that a model does not pass the assumption of constant error because the p-value is significant at .4418.
summary(model1)
Call:
lm(formula = sales ~ emails + knowledge + mentor + soc_events,
data = HR_Data)
Residuals:
Min 1Q Median 3Q Max
-4.3003 -0.8707 0.3104 0.7461 2.7616
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.43533 1.39185 -1.750 0.0924 .
emails 0.09275 0.01167 7.945 2.67e-08 ***
knowledge -0.05829 0.16494 -0.353 0.7268
mentor 0.99801 0.20464 4.877 5.12e-05 ***
soc_events 0.07644 0.13195 0.579 0.5676
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.373 on 25 degrees of freedom
Multiple R-squared: 0.813, Adjusted R-squared: 0.7831
F-statistic: 27.17 on 4 and 25 DF, p-value: 8.831e-09
I am not exactly sure what is asked of us in this part. I know that the number of emails and time with mentor both have significant estimates and small standard errors.
vif(model1)
emails knowledge mentor soc_events
1.304644 2.493205 1.100212 2.040471
*No variance inflation factors (VIF) are 4 or higher, so multicollinearity should not be a problem with this dataset.
Identifying Unusual and Influential Data Points
influenceIndexPlot(model=model1)
The only recurring points are 13 and 17. The Bonferroni p-value is -4 for point 13. Therefore, I am going to run some tests to see if it is best to remove the outlier of 13.
model_no_outlier<-lm(sales~emails+ knowledge+ mentor+soc_events, data=HR_Data[-c(13),])
compareCoefs(model1,model_no_outlier)
Calls:
1: lm(formula = sales ~ emails + knowledge + mentor
+ soc_events, data = HR_Data)
2: lm(formula = sales ~ emails + knowledge + mentor
+ soc_events, data = HR_Data[-c(13), ])
Model 1 Model 2
(Intercept) -2.44 -2.44
SE 1.39 1.39
emails 0.0927 0.0927
SE 0.0117 0.0117
knowledge -0.0583 -0.0583
SE 0.1649 0.1649
mentor 0.998 0.998
SE 0.205 0.205
soc_events 0.0764 0.0764
SE 0.1320 0.1320
You can see that our coefficients did not change at all. To do a final check for significant difference, we will do an anova test.
Anova(model1, model_no_outlier)
Anova Table (Type II tests)
Response: sales
Sum Sq Df F value Pr(>F)
emails 118.916 1 63.1172 2.666e-08 ***
knowledge 0.235 1 0.1249 0.7268
mentor 44.812 1 23.7851 5.125e-05 ***
soc_events 0.632 1 0.3356 0.5676
Residuals 47.101 25
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Looking at the Anova test below, the significant p-values (the ones smaller than 0.05) are indicating a signifcant change from removing the outlier of 13 in both number of emails and time with mentors. So, it would behoove us to remove the outlier of 13. Which makes me feel like that -4 that was shown in Bonferroni’s p-value was something significant (I’m just not exactly sure what Bonferonni is telling me).
Hey Dr. Outland, I am not really sure what is going on with this portion of class. I promise I took a lot of time for this information however, I do not learn well from text books and we have no videos that we can watch. The lecture during class would be more helpful with visuals (for the visual learners in the class, aka me). I think that maybe I needed to change the non-linear relationship with the social events however, I did not understand that section of the pdf and would love to review this in class. The pdf is a lot more helpful than the textbook, so thank you for creating this for us! However, I need to learn it in class as well (the power of hearing things 3 times). I hope that you and your family are safe in Canada. Please help me learn stats before it’s too late for me. All the best, Mary Mac