Loading of R libraries
library(MASS)
library(broom)
library(tidyverse)
library(boot)
library(car)
library(corrplot)
library(magrittr)
The data considers student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. The dataset regards the performance of students in the subject of Mathematics. A similar dataset, which is not being considered here, collects data for the subject of the Portuguese language.
The dataset consists of 395 observations of 33 variables, including 30 explanatory variables and 3 dependent variables (G1, G2, G3). G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades.
The dataset is available at UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Student+Performance)
The description of the data was partially taken from that site.
The main objective of the modelling process is to identify factors that influence students’ performance at school. The emphasis will be put on family relations and mother’s and father’s education and jobs as explanatory variables.
As for the bootstrap method the goal is to reduce uncertainty of confidence intervals for regression coefficient and see whether bootstrap method might be useful to achieve that goal.
Attributes for student-mat.csv dataset:
These grades are related with the course subject, Math or Portuguese:
The dataset contained in the “student_mat.csv” file is loaded.
dataset<-read.csv("student-mat.csv", header=TRUE, sep=";", stringsAsFactors = TRUE)
For illustrative purposes and in order to simplify the model, I will focus solely on grades for the whole year period (G3). I also reduced the number of independent variables from 30 to 4 to analyze the impact of parents’ education and family relations on student performance. The choice of independent variables used for modeling purposes was arbitrary. The 4 independent variables selected were as following:
dt<-dataset[,c("address","Medu","Fedu","famrel","G3")]
Let us first determine if there are any missing values.
ifelse (sum(is.na(dt))==0, ("There are no missing values"),("The dataset contains missing values"))
## [1] "There are no missing values"
Structure of the data and summary
str(dt)
## 'data.frame': 395 obs. of 5 variables:
## $ address: Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
summary(dt)
## address Medu Fedu famrel G3
## R: 88 Min. :0.000 Min. :0.000 Min. :1.000 Min. : 0.00
## U:307 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:4.000 1st Qu.: 8.00
## Median :3.000 Median :2.000 Median :4.000 Median :11.00
## Mean :2.749 Mean :2.522 Mean :3.944 Mean :10.42
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:14.00
## Max. :4.000 Max. :4.000 Max. :5.000 Max. :20.00
Let us look at correlation for numerical variables (this is not entirely correct as variables are not continuous variables, but ordered variables).
corrplot::corrplot(cor(dt[,c(2,3,4,5)]), method="number")
Let us transform ordered variables from integrals into factors.
dt$Medu<-as.factor(dt$Medu)
dt$Fedu<-as.factor(dt$Fedu)
dt$famrel<-as.factor(dt$famrel)
Let us eliminate instances where there are only a few observations for a given factor level. These will negatively affect the model performance.
apply(dt,2, table)
## $address
##
## R U
## 88 307
##
## $Medu
##
## 0 1 2 3 4
## 3 59 103 99 131
##
## $Fedu
##
## 0 1 2 3 4
## 2 82 115 100 96
##
## $famrel
##
## 1 2 3 4 5
## 8 18 68 195 106
##
## $G3
##
## 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 38 1 7 15 9 32 28 56 47 31 31 27 33 16 6 12 5 1
Arbitrarily, I decided to delete all observations where any number of occurrences of a given factor level for dependent variables is less than 8 (i.e. 0 levels for Medu and Fedu with respectively 3 and 2 occurrences).
dt<-dt[-c(which(dt$Medu==0),which(dt$Fedu==0)),]
model_lm<-lm(G3~address+Medu+Fedu+famrel, data=dt)
summary(model_lm)
##
## Call:
## lm(formula = G3 ~ address + Medu + Fedu + famrel, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1821 -1.9282 0.6018 2.9783 9.0653
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.80430 1.75598 4.444 1.16e-05 ***
## addressU 0.69868 0.55507 1.259 0.20891
## Medu2 0.94520 0.77939 1.213 0.22599
## Medu3 1.51162 0.82875 1.824 0.06895 .
## Medu4 2.89466 0.89629 3.230 0.00135 **
## Fedu2 0.32706 0.69914 0.468 0.64019
## Fedu3 0.13100 0.77815 0.168 0.86640
## Fedu4 0.33853 0.85784 0.395 0.69334
## famrel2 -0.49859 1.91363 -0.261 0.79458
## famrel3 -0.04003 1.69075 -0.024 0.98112
## famrel4 0.15942 1.62751 0.098 0.92202
## famrel5 0.65342 1.65418 0.395 0.69306
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.492 on 378 degrees of freedom
## Multiple R-squared: 0.06896, Adjusted R-squared: 0.04187
## F-statistic: 2.545 on 11 and 378 DF, p-value: 0.004097
plot(model_lm)
plot(model_lm, 4)
Overall, the model seems to be in line with the assumptions of linear regression. The only problem is the lack of full compliance with the assumption of normality of residuals. For now,I decided to keep it as it is, especially since that assumption is not critical for the modeling purposes. Additionally, I decided to remove outliers from the dataset, which are identified by high Cook’s distance values (any observation with Cook’s distance higher than arbitrarily set treshold at 0.02).
model_lm_aug<-broom::augment(model_lm)
model_lm_aug<-cbind(nr=c(1:390),model_lm_aug)
head(dplyr::select(dplyr::arrange(model_lm_aug,desc(.cooksd)),nr,.rownames,.cooksd),10)
## nr .rownames .cooksd
## 1 385 390 0.05978726
## 2 139 141 0.04004013
## 3 293 297 0.03781713
## 4 220 223 0.02426498
## 5 149 151 0.02377933
## 6 169 171 0.01860045
## 7 133 135 0.01724869
## 8 135 137 0.01724869
## 9 290 294 0.01691917
## 10 387 392 0.01603881
dt2<-dt[-c(385,139,293,220,149),]
model_lm2<-lm(G3~address+Medu+Fedu+famrel, data=dt2)
summary(model_lm2)
##
## Call:
## lm(formula = G3 ~ address + Medu + Fedu + famrel, data = dt2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2928 -1.7949 0.6266 2.7950 9.2051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.273956 1.837329 5.048 7.01e-07 ***
## addressU 0.860879 0.542187 1.588 0.113
## Medu2 0.591538 0.765026 0.773 0.440
## Medu3 1.182193 0.814294 1.452 0.147
## Medu4 2.739709 0.880411 3.112 0.002 **
## Fedu2 0.153090 0.682518 0.224 0.823
## Fedu3 0.001853 0.762872 0.002 0.998
## Fedu4 0.180384 0.837858 0.215 0.830
## famrel2 -0.116684 2.030272 -0.057 0.954
## famrel3 -1.292460 1.751083 -0.738 0.461
## famrel4 -1.084602 1.692516 -0.641 0.522
## famrel5 -0.583551 1.716357 -0.340 0.734
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.371 on 373 degrees of freedom
## Multiple R-squared: 0.07647, Adjusted R-squared: 0.04924
## F-statistic: 2.808 on 11 and 373 DF, p-value: 0.001568
plot(model_lm2)
plot(model_lm2, 4)
The model without outliers with high Cook’s distance value increased its R-square value from 6.9 to 7.6, which is quite substantial. Medu4 variable remains significant at the level of 0.01.
Now let us take a look at confidence intervals for explanatory variables based on simple linear regression model.
confint_lm<-confint(model_lm2, level = 0.95, type="norm")
confint_lm
## 2.5 % 97.5 %
## (Intercept) 5.6611343 12.886777
## addressU -0.2052468 1.927004
## Medu2 -0.9127669 2.095843
## Medu3 -0.4189888 2.783375
## Medu4 1.0085188 4.470900
## Fedu2 -1.1889750 1.495155
## Fedu3 -1.4982163 1.501922
## Fedu4 -1.4671335 1.827902
## famrel2 -4.1088979 3.875530
## famrel3 -4.7356911 2.150771
## famrel4 -4.4126711 2.243467
## famrel5 -3.9585001 2.791399
I will try to improve these 95% confidence intervals with the bootstrap method. The goal of the bootstrapping is to reduce uncertainty around regression coefficients and consequently improve their confidence intervals.
model_lm_boot<-car::Boot(model_lm2, R=1999, method="case")
## Loading required namespace: boot
summary(model_lm_boot)
##
## Number of bootstrap replications R = 1997
## original bootBias bootSE bootMed
## (Intercept) 9.273956 -0.00637955 1.09096 9.255222
## addressU 0.860879 -0.00346421 0.54456 0.842666
## Medu2 0.591538 0.00826300 0.77262 0.608297
## Medu3 1.182193 -0.00665120 0.83903 1.175890
## Medu4 2.739709 0.00880206 0.86211 2.744813
## Fedu2 0.153090 0.01514570 0.72188 0.191842
## Fedu3 0.001853 0.00231289 0.74672 -0.023786
## Fedu4 0.180384 -0.01018358 0.85819 0.180738
## famrel2 -0.116684 -0.00471488 1.19694 -0.119610
## famrel3 -1.292460 0.01075080 0.97817 -1.257268
## famrel4 -1.084602 -0.00012461 0.82503 -1.063558
## famrel5 -0.583551 0.01586231 0.88412 -0.562816
b<-model_lm$coefficients
b_boot<-apply(na.omit(model_lm_boot$t), 2, mean)
Let us now look at relative improvement of precision of regression coefficient in absolute numbers.
abs((b_boot-b)/b)
## (Intercept) addressU Medu2 Medu3 Medu4 Fedu2
## 0.1874962 0.2271937 0.3654233 0.2223273 0.0504899 0.4856140
## Fedu3 Fedu4 famrel2 famrel3 famrel4 famrel5
## 0.9681991 0.4972346 0.7565156 31.0194865 7.8042914 1.8687929
Now let us look at the 95% confidence interval after applying bootstrap method.
confint_lm_boot<-confint(model_lm_boot, level = 0.95, type="norm")
confint_lm_boot
## Bootstrap normal confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) 7.1420972 11.4185734
## addressU -0.2029667 1.9316527
## Medu2 -0.9310266 2.0975765
## Medu3 -0.4556318 2.8333206
## Medu4 1.0411949 4.4206196
## Fedu2 -1.2769119 1.5528001
## Fedu3 -1.4640087 1.4630889
## Fedu4 -1.4914626 1.8725982
## famrel2 -2.4579226 2.2339846
## famrel3 -3.2203879 0.6139665
## famrel4 -2.7015011 0.5325464
## famrel5 -2.3322591 1.1334332
ci_lm<-as.data.frame(confint_lm)
ci_lm_boot<-as.data.frame(confint_lm_boot)
colnames(ci_lm)<-colnames(ci_lm_boot)<-c("lower", "upper")
ci_lm$variables<-ci_lm_boot$variables<-rownames(ci_lm)
ggplot2::ggplot() + ggplot2::geom_errorbar(data=ci_lm_boot, mapping=ggplot2::aes(x=variables, ymin=lower, ymax=upper), col="red", size=1.5) + ggplot2::geom_errorbar(data=ci_lm, mapping=ggplot2::aes(x=variables, ymin=lower, ymax=upper), col="blue", size=1.5)+ggplot2::theme_minimal()+ggplot2::labs(title="95% confidence intervals in simple linear (blue) & bootstrap regressions (red)",x ="explanatory variables", y = "coefficient level")
Bootstrap, just by looking at the plot above, did contribute to much more precise estimation of 95% confidence interval for some regression coefficients (intercept and famrel2, famrel3, famrel4 and famrel5). Unfortunately all of these variables (with an exception of the intercept do cross the null value and thus their significance is nullified). As for other variables (Fedu, Medu), red lines coincide with the blue lines, which indicates that bootstrapping did not help in the process of more precise estimation of confidence intervals for these variables.
hist(model_lm_boot, legend="separate")
It looks like all the regression coefficient are normally distributed (more less).
The bootstrap method did slightly improve the precision of estimation of regression coefficients. Especially, for famrel2, famrel3, famrel4 and famrel5 the red bars are much shorter than the previously estimated blue bars (simple linear regression). However, the only variable which is significant at the 95% level remains Medu4. For all other variables their 95% confidence interval crosses zero and thus their estimation is not significant. Unfortunately however, for Medu4 bootstrapping was not able to improve its 95% confidence interval at all.