Most of the data in R are saved in a dataframe format or tibble format . This will help your data analysis and also make easy the use of many libraries in R.
library(readr)# read cvs
library(readxl) #read xls
library(dplyr)
library(knitr) # web widget
library(tidyverse) # data manipulation
library(data.table) # fast file reading
library(kableExtra) # nice table html formating
library(gridExtra) # arranging ggplot in grid
library(caTools) # split
library(plotrix)
library(MASS)
We are going to work with a dataset with 9 variables. (this dataset is not published)
library(readxl)
E_data <- read_excel("Energy.data.xlsx")
head(E_data,20)
Correlation plot are used to better estimate the relationship of the variables.
#
pairs(E_data,col="red",main="Scatterplot for many variables")
#
# Calculate correlations between two variables
library(psych)#
Attaching package: 㤼㸱psych㤼㸲
The following object is masked from 㤼㸱package:car㤼㸲:
logit
The following object is masked from 㤼㸱package:plotrix㤼㸲:
rescale
The following objects are masked from 㤼㸱package:ggplot2㤼㸲:
%+%, alpha
pairs.panels(E_data)
library(corrplot)
library("PerformanceAnalytics")
cor.mat=cor(E_data)# correlation matrix
cor.mat
X1 X2 X3 X4 X5 X6 X7
X1 1.0000000 0.8378028 0.64893581 0.3440950 0.5194892 0.42394055 NA
X2 0.8378028 1.0000000 0.68302585 0.3507297 0.4917172 0.39354441 NA
X3 0.6489358 0.6830259 1.00000000 0.1010862 0.2076625 0.02821228 NA
X4 0.3440950 0.3507297 0.10108615 1.0000000 0.5904182 0.61688107 NA
X5 0.5194892 0.4917172 0.20766253 0.5904182 1.0000000 0.83410613 NA
X6 0.4239406 0.3935444 0.02821228 0.6168811 0.8341061 1.00000000 NA
X7 NA NA NA NA NA NA 1
X8 0.8390428 0.8621822 0.72012935 0.3574015 0.5138736 0.41555722 NA
X9 0.8623118 0.8319266 0.64424337 0.4040246 0.5503853 0.45276020 NA
X8 X9
X1 0.8390428 0.8623118
X2 0.8621822 0.8319266
X3 0.7201293 0.6442434
X4 0.3574015 0.4040246
X5 0.5138736 0.5503853
X6 0.4155572 0.4527602
X7 NA NA
X8 1.0000000 0.8891950
X9 0.8891950 1.0000000
pay attention to X7 variable we have NA-s. Let’s see if we have missing values .
# we observe a problem with the X7 variable
table(is.na(E_data$X7))
FALSE TRUE
2870 1
# if you don;t want to print in R use function invisible()
E_data[!complete.cases(E_data),]# list rows of data that have missing values
E_data=na.omit(E_data)
table(is.na(E_data$X7))
FALSE
2870
chart.Correlation(E_data,histogram=TRUE, pch=19)
Observe the argument order-“hclust” which has now ordered the variables based on a hierrarchial cluster method.
library(corrplot)
library(RColorBrewer)
M <-cor(E_data) #select only numerical variables
M
X1 X2 X3 X4 X5 X6 X7
X1 1.0000000 0.8377607 0.64978651 0.3440001 0.5195171 0.42370461 0.67575871
X2 0.8377607 1.0000000 0.68377292 0.3506462 0.4918643 0.39342335 0.69248074
X3 0.6497865 0.6837729 1.00000000 0.1013726 0.2100569 0.03004593 0.90435406
X4 0.3440001 0.3506462 0.10137261 1.0000000 0.5911886 0.61762445 0.16052836
X5 0.5195171 0.4918643 0.21005694 0.5911886 1.0000000 0.83337441 0.26438586
X6 0.4237046 0.3934233 0.03004593 0.6176245 0.8333744 1.00000000 0.08002069
X7 0.6757587 0.6924807 0.90435406 0.1605284 0.2643859 0.08002069 1.00000000
X8 0.8389888 0.8621470 0.72100067 0.3573106 0.5139133 0.41532732 0.73406616
X9 0.8622618 0.8318836 0.64514506 0.4039408 0.5504120 0.45251378 0.67630348
X8 X9
X1 0.8389888 0.8622618
X2 0.8621470 0.8318836
X3 0.7210007 0.6451451
X4 0.3573106 0.4039408
X5 0.5139133 0.5504120
X6 0.4153273 0.4525138
X7 0.7340662 0.6763035
X8 1.0000000 0.8891565
X9 0.8891565 1.0000000
corrplot(M, method="pie")
corrplot(M, type="upper", order="hclust",col=brewer.pal(n=8, name="RdYlBu"))
library(ggplot2)
library(GGally)
# display a pair plot of all four columns of data
GGally::ggpairs(E_data)
We can use a Chi-suqare statistical test for correlation betwen two variables using the function chisq.test(var1, var2). A p-value < 0.05 suggest a significant correlation between the variables.
print(chisq.test(E_data$X3,E_data$X7))
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: E_data$X3 and E_data$X7
X-squared = 5514821, df = 5153414, p-value < 2.2e-16
print(chisq.test(E_data$X3,E_data$X5))
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: E_data$X3 and E_data$X5
X-squared = 6849914, df = 6864368, p-value = 1
print(chisq.test(E_data$X2,E_data$X5))
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: E_data$X2 and E_data$X5
X-squared = 7510207, df = 7492288, p-value = 1.869e-06
print(chisq.test(E_data$X1,E_data$X9))
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: E_data$X1 and E_data$X9
X-squared = 6812184, df = 6830437, p-value = 1
We will try to build a linear model to explain the duration from the balance variable.
model.1<-lm(X3~X7, data=E_data)
model.1
Call:
lm(formula = X3 ~ X7, data = E_data)
Coefficients:
(Intercept) X7
29.54503 0.03647
This provides us with some of the most important metrics from our model. In particular, the last line gives us a report on our overall model confidence or ‘goodness of fit’ - that is, how confident can we be that our model fits the outcome better than the alternative of a random model. The F-statistic is an evaluation of the confidence of the entire model, with a higher F-statistic indicating a strong likelihood that the model fits the data better than a random model. More intuitively, perhaps, we also have the p-value for the F-statistic. In this case it is extremely small, so we can conclude that our model has some explanatory/predictive power over and above a random model.(to be judged with additional tests)
Be careful not to confuse model goodness of fit with R^2. Depending on your sample, it is entirely possible for a model with a low R^2 to have high goodness of fit and vice versa. In this occasion additional tests are needed.
names(model.1)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
summary(model.1)
Call:
lm(formula = X3 ~ X7, data = E_data)
Residuals:
Min 1Q Median 3Q Max
-241.929 -29.545 -1.919 24.332 258.840
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.955e+01 1.561e+00 18.92 <2e-16 ***
X7 3.647e-02 3.213e-04 113.48 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 48.42 on 2868 degrees of freedom
Multiple R-squared: 0.8179, Adjusted R-squared: 0.8178
F-statistic: 1.288e+04 on 1 and 2868 DF, p-value: < 2.2e-16
model.1$coefficients
(Intercept) X7
29.54503071 0.03646693
## summary of the model
summary(model.1)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.54503071 1.5614265516 18.92182 2.433762e-75
X7 0.03646693 0.0003213499 113.48043 0.000000e+00
## confidence intervals for coefficients
confint(model.1)# confidence intervals for the coefficients
2.5 % 97.5 %
(Intercept) 26.48339884 32.60666259
X7 0.03583683 0.03709703
## view r-squared
summary(model.1)$r.squared
[1] 0.8178563
## use new data to predict
predict(model.1, list(X7=c(3,4,1,0)))
1 2 3 4
29.65443 29.69090 29.58150 29.54503
# predict with confidence intervals
predict(model.1, list(X7=c(3,4,1,0)),interval = "confidence")
fit lwr upr
1 29.65443 26.59434 32.71452
2 29.69090 26.63132 32.75048
3 29.58150 26.52038 32.64262
4 29.54503 26.48340 32.60666
Obtain graphical presentation of linear model
predicted_values <- model.1$fitted.values
length(predicted_values)
[1] 2870
true_values <- E_data$X3
length(true_values)
[1] 2870
# plot true values against predicted values
plot(predicted_values, true_values,col="blue")
# or the same if you use
residuals.1 <- model.1$residuals
# plot residuals against predicted values
plot(predicted_values, residuals.1,col="red")
# the residual plot may be obtained also from applying the plot() to the model.
plot(model.1)
The Ljung-Box test is used to analyze the autocorrelation of the residuals accompanied by the ACF plot (Autocorrelation Plot). The Ljung-Box test is used to test the dependence of the residuals. If the p-value of the test is less than 0.05 we can not reject the null hypothesis (the residuals are dependent), In R we can also use Box.test function to check for autocorrelations in the residuals. Here the null hypothesis is: H0 : No autocorrelation in the forecast errors vs H1 : there is an autocorrelation in the forecast errors. We expect p-value from our model is > 0.05 to accept H0
library(forecast)
accuracy(model.1) #
ME RMSE MAE MPE MAPE MASE
Training set 8.124151e-16 48.40478 34.64952 -Inf Inf 0.3832438
checkresiduals(model.1)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 417.21, df = 10, p-value < 2.2e-16
Box.test(x = model.1$residuals)
Box-Pierce test
data: model.1$residuals
X-squared = 128.53, df = 1, p-value < 2.2e-16
Plot residuals vs each explanatory variable
model.2<-lm(X3~X7+X8, data=E_data)
model.2
Call:
lm(formula = X3 ~ X7 + X8, data = E_data)
Coefficients:
(Intercept) X7 X8
13.422214 0.032799 0.005905
summary(model.2)
Call:
lm(formula = X3 ~ X7 + X8, data = E_data)
Residuals:
Min 1Q Median 3Q Max
-227.208 -26.357 -0.252 24.146 263.476
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.342e+01 2.141e+00 6.268 4.21e-10 ***
X7 3.280e-02 4.640e-04 70.686 < 2e-16 ***
X8 5.905e-03 5.483e-04 10.769 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 47.48 on 2867 degrees of freedom
Multiple R-squared: 0.8249, Adjusted R-squared: 0.8248
F-statistic: 6755 on 2 and 2867 DF, p-value: < 2.2e-16
plot(model.2)
plot(model.2$fitted.values,residuals(model.2))
# plot residuals vs each variable
plot(E_data$X7,residuals(model.2))
plot(E_data$X8,residuals(model.2))
accuracy(model.2) #
ME RMSE MAE MPE MAPE MASE
Training set 1.201009e-17 47.45452 34.23929 -Inf Inf 0.3787064
checkresiduals(model.2)
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 437.81, df = 10, p-value < 2.2e-16
Box.test(x = model.2$residuals)
Box-Pierce test
data: model.2$residuals
X-squared = 173.98, df = 1, p-value < 2.2e-16
## use new data to predict
model.2
Call:
lm(formula = X3 ~ X7 + X8, data = E_data)
Coefficients:
(Intercept) X7 X8
13.422214 0.032799 0.005905
predict(model.2, list(X7=c(3,4,1,0),X8=c(20,35,13,45)))
1 2 3 4
13.63871 13.76009 13.53178 13.68794
# predict manually
X7.p=c(3,4,1,0)
X8.p=c(20,35,13,45)
model.2.p<-262.37471+0.00188*X7.p-0.16545*X8.p
model.2.p
[1] 259.0713 256.5915 260.2257 254.9295
# predicting with confidence intervals
predict(model.2, list(X7=c(3,4,1,0),X8=c(20,35,13,45)),interval = "confidence")
fit lwr upr
1 13.63871 9.454591 17.82283
2 13.76009 9.587089 17.93309
3 13.53178 9.342619 17.72094
4 13.68794 9.522812 17.85308
test for normality in the residuals
library(ggpubr)
Attaching package: 㤼㸱ggpubr㤼㸲
The following object is masked from 㤼㸱package:forecast㤼㸲:
gghistogram
# Density plot
ggdensity(model.1$residuals, fill = "lightgray")
# QQ-plot
ggqqplot(model.1$residuals)
interaction_model <- lm(X3 ~X7 + X5+ X7*X5, data =E_data)
summary(interaction_model)
Call:
lm(formula = X3 ~ X7 + X5 + X7 * X5, data = E_data)
Residuals:
Min 1Q Median 3Q Max
-241.573 -27.418 -2.336 24.310 265.466
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.551e+01 2.426e+00 14.635 < 2e-16 ***
X7 3.588e-02 5.650e-04 63.510 < 2e-16 ***
X5 -3.484e-02 8.836e-03 -3.943 8.23e-05 ***
X7:X5 3.131e-06 1.556e-06 2.012 0.0443 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 48.28 on 2866 degrees of freedom
Multiple R-squared: 0.819, Adjusted R-squared: 0.8188
F-statistic: 4323 on 3 and 2866 DF, p-value: < 2.2e-16
anova(model.1, model.2)
Analysis of Variance Table
Model 1: X3 ~ X7
Model 2: X3 ~ X7 + X8
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2868 6724476
2 2867 6463043 1 261433 115.97 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We see that the value of the F -statistic is 115.97, and the p-value is extremely low, so we reject the null hypothesis at any reasonable α and say that the regression is significant. So, X7 and X8 have a useful linear relationship with X3.
anova(model.1, interaction_model)
Analysis of Variance Table
Model 1: X3 ~ X7
Model 2: X3 ~ X7 + X5 + X7 * X5
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2868 6724476
2 2866 6681557 2 42919 9.2048 0.0001036 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This procedures start with all possible predictors in the model, then considers how deleting a single predictor will effect a chosen metric. Let’s try this on the E_data. We will use the step() function in R which by default uses AIC as its metric of choice.
library(faraway)
full.mod = lm(X3 ~ ., data = E_data)
coef(full.mod)
(Intercept) X1 X2 X4 X5
8.265079279 0.012094377 0.091584454 -0.103425252 0.002644293
X6 X7 X8 X9
-0.115752557 0.029941880 0.008496030 -0.003909354
step(full.mod, direction = "backward")# we can give a name to teh final model using backward selection
Start: AIC=21940.35
X3 ~ X1 + X2 + X4 + X5 + X6 + X7 + X8 + X9
Df Sum of Sq RSS AIC
- X5 1 146 5959774 21938
- X1 1 1426 5961054 21939
<none> 5959627 21940
- X9 1 11957 5971584 21944
- X4 1 31681 5991308 21954
- X2 1 69145 6028772 21972
- X6 1 101272 6060899 21987
- X8 1 167829 6127456 22018
- X7 1 7379759 13339386 24251
Step: AIC=21938.43
X3 ~ X1 + X2 + X4 + X6 + X7 + X8 + X9
Df Sum of Sq RSS AIC
- X1 1 1462 5961236 21937
<none> 5959774 21938
- X9 1 11833 5971606 21942
- X4 1 31633 5991407 21952
- X2 1 69208 6028981 21970
- X8 1 167721 6127495 22016
- X6 1 190653 6150427 22027
- X7 1 7525732 13485505 24280
Step: AIC=21937.13
X3 ~ X2 + X4 + X6 + X7 + X8 + X9
Df Sum of Sq RSS AIC
<none> 5961236 21937
- X9 1 10381 5971618 21940
- X4 1 32762 5993999 21951
- X2 1 83817 6045054 21975
- X8 1 171400 6132636 22017
- X6 1 189582 6150818 22025
- X7 1 7697434 13658670 24315
Call:
lm(formula = X3 ~ X2 + X4 + X6 + X7 + X8 + X9, data = E_data)
Coefficients:
(Intercept) X2 X4 X6 X7
8.395693 0.095784 -0.104114 -0.111343 0.030017
X8 X9
0.008557 -0.003361
So we start with the full model which contains all predictor variables from the dataset and then R will repeatedly attempt to delete a predictor until it stops, or reaches the full.mod ~ 1, which contains no predictors.
At each “step”, R reports the current model, its AIC, and the possible steps with their RSS and more importantly AIC.
This is the exact opposite of backwards selection. Here we tell R to start with a model using no predictors, that is X3 ~ 1, then at each step R will attempt to add a predictor until it finds a good model or reaches the full.mod= X3 ~ X1 + X2 + X4 + X5 + X6 + X7 + X8 + X9
forw.mod = step(lm(X3~1,data=E_data),scope = X3 ~ X1 + X2 + X4 + X5 + X6 + X7 + X8 + X9, direction = "forward")
Start: AIC=27158.39
X3 ~ 1
Df Sum of Sq RSS AIC
+ X7 1 30194040 6724476 22273
+ X8 1 19191794 17726722 25055
+ X2 1 17261083 19657433 25352
+ X1 1 15587829 21330688 25586
+ X9 1 15365935 21552581 25616
+ X5 1 1628990 35289527 27031
+ X4 1 379390 36539127 27131
+ X6 1 33328 36885188 27158
<none> 36918516 27158
Step: AIC=22272.9
X3 ~ X7
Df Sum of Sq RSS AIC
+ X8 1 261433 6463043 22161
+ X2 1 234727 6489749 22173
+ X1 1 101559 6622917 22231
+ X9 1 76480 6647996 22242
+ X4 1 72706 6651771 22244
+ X6 1 66550 6657926 22246
+ X5 1 33477 6690999 22261
<none> 6724476 22273
Step: AIC=22161.09
X3 ~ X7 + X8
Df Sum of Sq RSS AIC
+ X6 1 384881 6078162 21987
+ X5 1 245837 6217206 22052
+ X4 1 234527 6228515 22057
+ X9 1 40628 6422415 22145
+ X2 1 27747 6435296 22151
<none> 6463043 22161
+ X1 1 1883 6461160 22162
Step: AIC=21986.88
X3 ~ X7 + X8 + X6
Df Sum of Sq RSS AIC
+ X2 1 69792 6008370 21956
+ X4 1 32052 6046110 21974
+ X1 1 10354 6067808 21984
<none> 6078162 21987
+ X9 1 2611 6075551 21988
+ X5 1 50 6078112 21989
Step: AIC=21955.73
X3 ~ X7 + X8 + X6 + X2
Df Sum of Sq RSS AIC
+ X4 1 36752 5971618 21940
+ X9 1 14371 5993999 21951
<none> 6008370 21956
+ X5 1 415 6007955 21958
+ X1 1 5 6008365 21958
Step: AIC=21940.12
X3 ~ X7 + X8 + X6 + X2 + X4
Df Sum of Sq RSS AIC
+ X9 1 10381.5 5961236 21937
<none> 5971618 21940
+ X5 1 20.0 5971598 21942
+ X1 1 11.4 5971606 21942
Step: AIC=21937.13
X3 ~ X7 + X8 + X6 + X2 + X4 + X9
Df Sum of Sq RSS AIC
<none> 5961236 21937
+ X1 1 1462.44 5959774 21938
+ X5 1 182.53 5961054 21939
This model will search checks going both backwards and forwards at every step. It considers the addition of any variable not currently in the model, as well as the removal of any variable currently in the model. Here we perform stepwise search using
AIC as our metric. We start with the model X3 ~ 1 and search up to the full model with all predictors.
step.mod = step(lm(X3~1,data=E_data),scope = X3 ~ X1 + X2 + X4 + X5 + X6 + X7 + X8 + X9, direction = "both")
Start: AIC=27158.39
X3 ~ 1
Df Sum of Sq RSS AIC
+ X7 1 30194040 6724476 22273
+ X8 1 19191794 17726722 25055
+ X2 1 17261083 19657433 25352
+ X1 1 15587829 21330688 25586
+ X9 1 15365935 21552581 25616
+ X5 1 1628990 35289527 27031
+ X4 1 379390 36539127 27131
+ X6 1 33328 36885188 27158
<none> 36918516 27158
Step: AIC=22272.9
X3 ~ X7
Df Sum of Sq RSS AIC
+ X8 1 261433 6463043 22161
+ X2 1 234727 6489749 22173
+ X1 1 101559 6622917 22231
+ X9 1 76480 6647996 22242
+ X4 1 72706 6651771 22244
+ X6 1 66550 6657926 22246
+ X5 1 33477 6690999 22261
<none> 6724476 22273
- X7 1 30194040 36918516 27158
Step: AIC=22161.09
X3 ~ X7 + X8
Df Sum of Sq RSS AIC
+ X6 1 384881 6078162 21987
+ X5 1 245837 6217206 22052
+ X4 1 234527 6228515 22057
+ X9 1 40628 6422415 22145
+ X2 1 27747 6435296 22151
<none> 6463043 22161
+ X1 1 1883 6461160 22162
- X8 1 261433 6724476 22273
- X7 1 11263679 17726722 25055
Step: AIC=21986.88
X3 ~ X7 + X8 + X6
Df Sum of Sq RSS AIC
+ X2 1 69792 6008370 21956
+ X4 1 32052 6046110 21974
+ X1 1 10354 6067808 21984
<none> 6078162 21987
+ X9 1 2611 6075551 21988
+ X5 1 50 6078112 21989
- X6 1 384881 6463043 22161
- X8 1 579764 6657926 22246
- X7 1 8410485 14488647 24478
Step: AIC=21955.73
X3 ~ X7 + X8 + X6 + X2
Df Sum of Sq RSS AIC
+ X4 1 36752 5971618 21940
+ X9 1 14371 5993999 21951
<none> 6008370 21956
+ X5 1 415 6007955 21958
+ X1 1 5 6008365 21958
- X2 1 69792 6078162 21987
- X8 1 178580 6186950 22038
- X6 1 426926 6435296 22151
- X7 1 7695052 13703422 24320
Step: AIC=21940.12
X3 ~ X7 + X8 + X6 + X2 + X4
Df Sum of Sq RSS AIC
+ X9 1 10381 5961236 21937
<none> 5971618 21940
+ X5 1 20 5971598 21942
+ X1 1 11 5971606 21942
- X4 1 36752 6008370 21956
- X2 1 74492 6046110 21974
- X8 1 180004 6151622 22023
- X6 1 205399 6177017 22035
- X7 1 7725700 13697318 24321
Step: AIC=21937.13
X3 ~ X7 + X8 + X6 + X2 + X4 + X9
Df Sum of Sq RSS AIC
<none> 5961236 21937
+ X1 1 1462 5959774 21938
+ X5 1 183 5961054 21939
- X9 1 10381 5971618 21940
- X4 1 32762 5993999 21951
- X2 1 83817 6045054 21975
- X8 1 171400 6132636 22017
- X6 1 189582 6150818 22025
- X7 1 7697434 13658670 24315
Backward, forward, and stepwise search are all useful, but do have an obvious issue. By not checking every possible model, sometimes they will miss the best possible model. With an extremely large number of predictors, sometimes this is necessary since checking every possible model would be rather time consuming, even with current computers.
However, with a reasonably sized dataset, it isn’t too difficult to check all possible models. The regsubsets() function in the R package leaps will help us do it.
library(leaps)
all_mod = summary(regsubsets(X3~ ., data = E_data))
all_mod$which
(Intercept) X1 X2 X4 X5 X6 X7 X8 X9
1 TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
2 TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
3 TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
4 TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
5 TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
6 TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
7 TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Using $which gives us the best model, according to RSS , for a model of each possible size, in this case ranging from one to eight predictors. For example the best model with three predictors (p=4 ) would use: X2, X6,X7,X8.
We can obtain the RSS (regression sum of squares)for each of these models using $rss. Notice that these are decreasing since the models range from small to large number of predictors. We can also have a look at the adj-R^2 of teh models. using $adjr2
all_mod$rss
[1] 6724476 6463043 6078162 6008370 5971618 5961236 5959774 5959627
all_mod$adjr2
[1] 0.8177928 0.8248155 0.8351904 0.8370260 0.8379663 0.8381915 0.8381747
[8] 0.8381221
which.max(all_mod$adjr2)# mdoel 6 has the largest value of adj-R^2
[1] 6
To calculate AIC and BIC for each of the models with the best RSS. To do so, we will need both “n” and the “p” for the largest possible model.
p = length(coef(full.mod))
p
[1] 9
n = length(resid(full.mod))
n
[1] 2870
all_mod_aic = n * log(all_mod$rss / n) + 2 * (2:p)
all_mod_aic
[1] 22272.90 22161.09 21986.88 21955.73 21940.12 21937.13 21938.43 21940.35
# We can then extract the predictors of the model with the best AIC.
best_aic_ind = which.min(all_mod_aic)
# then print the model with best value of AIC
all_mod$which[best_aic_ind,]
(Intercept) X1 X2 X4 X5 X6
TRUE FALSE TRUE TRUE FALSE TRUE
X7 X8 X9
TRUE TRUE TRUE
library(mctest)
# diagnose possible overall presence of
multicollinearity<-mctest::omcdiag(model.2)
multicollinearity
Call:
mctest::omcdiag(mod = model.2)
Overall Multicollinearity Diagnostics
MC Results detection
Determinant |X'X|: 0.4611 0
Farrar Chi-Square: 2219.5559 1
Red Indicator: 0.7341 1
Sum of Lambda Inverse: 4.3370 0
Theil's Method: 0.2528 0
Condition Number: 7.2491 0
1 --> COLLINEARITY is detected by the test
0 --> COLLINEARITY is not detected by the test
To evaluate multicolinearity of multiple regression model, calculating the variance inflation factor (VIF) from the result of lm(). If VIF is more than 10, multicolinearity is strongly suggested.
library(fmsb)
VIF(model.2)
[1] 5.71225
VIF(full.mod)
[1] 6.194769
Another package
library(car)
#calculate the VIF for each predictor variable in the model
vif(model.2)
X7 X8
2.168507 2.168507
vif(full.mod)
X1 X2 X4 X5 X6 X7 X8 X9
5.002920 4.853446 1.707475 3.901515 4.078889 2.758283 6.992487 6.448949
We can see that the VIF for X1, X8 and X9 are greater than 5, which is potentially concerning. We can also visualize VIF
#create vector of VIF values
vif_values <- vif(full.mod)
#create horizontal bar chart to display each VIF value
barplot(vif_values, main = "VIF Values", horiz = TRUE, col = "steelblue")
#add vertical line at 5
abline(v = 5, lwd = 3, lty = 2)
We can analyze the model accuracy performance based on AIC and number fo parameters.
Exercise a) Try co add a quadratic term to your model and analyze it.
https://daviddalpiaz.github.io/appliedstats/multiple-linear-regression.html
https://www.r-graph-gallery.com/ggplot2-package.html
https://cran.r-project.org/web/packages/Rcmdr/Rcmdr.pdf
https://www.tutorialspoint.com/r/r_pie_charts.htm
https://www.tutorialspoint.com/r/r_chi_square_tests.htm
https://www.tutorialspoint.com/r/r_analysis_of_covariance.htm
https://www.tutorialspoint.com/r/r_logistic_regression.htm
https://gregor-mathes.netlify.app/2021/01/01/rethinking-chapter-4/
https://www.r-bloggers.com/2013/02/collinearity-and-stepwise-vif-selection/