Improvement on confidence intervals for regression coefficients using Bootstrap

Krzysztof Osesik

May,24 2022

Loading of R libraries

library(MASS)
library(broom)
library(tidyverse)
library(boot)
library(car)
library(corrplot)
library(magrittr)

Introduction

1. Database and data overview

The data considers student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. The dataset regards the performance of students in the subject of Mathematics. A similar dataset, which is not being considered here, collects data for the subject of the Portuguese language.

The dataset consists of 395 observations of 33 variables, including 30 explanatory variables and 3 dependent variables (G1, G2, G3). G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades.

The dataset is available at UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Student+Performance)

The description of the data was partially taken from that site.

2. Objectives

The main objective of the modelling process is to identify factors that influence students’ performance at school. The emphasis will be put on family relations and mother’s and father’s education and jobs as explanatory variables.

As for the bootstrap method the goal is to reduce uncertainty of confidence intervals for regression coefficient and see whether bootstrap method might be useful to achieve that goal.

3. Explanatory and dependent variables

Attributes for student-mat.csv dataset:

school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age (numeric: from 15 to 22)
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

Loading the dataset

The dataset contained in the “student_mat.csv” file is loaded.

dataset<-read.csv("student-mat.csv", header=TRUE, sep=";", stringsAsFactors = TRUE)

4. Data filtering

For illustrative purposes and in order to simplify the model, I will focus solely on grades for the whole year period (G3). I also reduced the number of independent variables from 30 to 4 to analyze the impact of parents’ education and family relations on student performance. The choice of independent variables used for modeling purposes was arbitrary. The 4 independent variables selected were as following:

address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

dt<-dataset[,c("address","Medu","Fedu","famrel","G3")]

Explanatory Data Analysis

5. Missing values and structure

Let us first determine if there are any missing values.

ifelse (sum(is.na(dt))==0, ("There are no missing values"),("The dataset contains missing values"))

## [1] "There are no missing values"

Structure of the data and summary

str(dt)

## 'data.frame':    395 obs. of  5 variables:
##  $ address: Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Medu   : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu   : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ famrel : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ G3     : int  6 6 10 15 10 15 11 6 19 15 ...

summary(dt)

##  address      Medu            Fedu           famrel            G3       
##  R: 88   Min.   :0.000   Min.   :0.000   Min.   :1.000   Min.   : 0.00  
##  U:307   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:4.000   1st Qu.: 8.00  
##          Median :3.000   Median :2.000   Median :4.000   Median :11.00  
##          Mean   :2.749   Mean   :2.522   Mean   :3.944   Mean   :10.42  
##          3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.:14.00  
##          Max.   :4.000   Max.   :4.000   Max.   :5.000   Max.   :20.00

Let us look at correlation for numerical variables (this is not entirely correct as variables are not continuous variables, but ordered variables).

corrplot::corrplot(cor(dt[,c(2,3,4,5)]), method="number")

Let us transform ordered variables from integrals into factors.

dt$Medu<-as.factor(dt$Medu)
dt$Fedu<-as.factor(dt$Fedu)
dt$famrel<-as.factor(dt$famrel)

Let us eliminate instances where there are only a few observations for a given factor level. These will negatively affect the model performance.

apply(dt,2, table)

## $address
## 
##   R   U 
##  88 307 
## 
## $Medu
## 
##   0   1   2   3   4 
##   3  59 103  99 131 
## 
## $Fedu
## 
##   0   1   2   3   4 
##   2  82 115 100  96 
## 
## $famrel
## 
##   1   2   3   4   5 
##   8  18  68 195 106 
## 
## $G3
## 
##  0  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
## 38  1  7 15  9 32 28 56 47 31 31 27 33 16  6 12  5  1

Arbitrarily, I decided to delete all observations where any number of occurrences of a given factor level for dependent variables is less than 8 (i.e. 0 levels for Medu and Fedu with respectively 3 and 2 occurrences).

dt<-dt[-c(which(dt$Medu==0),which(dt$Fedu==0)),]

Modelling

6. Linear regression

model_lm<-lm(G3~address+Medu+Fedu+famrel, data=dt)

summary(model_lm)

## 
## Call:
## lm(formula = G3 ~ address + Medu + Fedu + famrel, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.1821  -1.9282   0.6018   2.9783   9.0653 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.80430    1.75598   4.444 1.16e-05 ***
## addressU     0.69868    0.55507   1.259  0.20891    
## Medu2        0.94520    0.77939   1.213  0.22599    
## Medu3        1.51162    0.82875   1.824  0.06895 .  
## Medu4        2.89466    0.89629   3.230  0.00135 ** 
## Fedu2        0.32706    0.69914   0.468  0.64019    
## Fedu3        0.13100    0.77815   0.168  0.86640    
## Fedu4        0.33853    0.85784   0.395  0.69334    
## famrel2     -0.49859    1.91363  -0.261  0.79458    
## famrel3     -0.04003    1.69075  -0.024  0.98112    
## famrel4      0.15942    1.62751   0.098  0.92202    
## famrel5      0.65342    1.65418   0.395  0.69306    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.492 on 378 degrees of freedom
## Multiple R-squared:  0.06896,    Adjusted R-squared:  0.04187 
## F-statistic: 2.545 on 11 and 378 DF,  p-value: 0.004097

plot(model_lm)

plot(model_lm, 4)

Overall, the model seems to be in line with the assumptions of linear regression. The only problem is the lack of full compliance with the assumption of normality of residuals. For now,I decided to keep it as it is, especially since that assumption is not critical for the modeling purposes. Additionally, I decided to remove outliers from the dataset, which are identified by high Cook’s distance values (any observation with Cook’s distance higher than arbitrarily set treshold at 0.02).

model_lm_aug<-broom::augment(model_lm)

model_lm_aug<-cbind(nr=c(1:390),model_lm_aug)

head(dplyr::select(dplyr::arrange(model_lm_aug,desc(.cooksd)),nr,.rownames,.cooksd),10)

##     nr .rownames    .cooksd
## 1  385       390 0.05978726
## 2  139       141 0.04004013
## 3  293       297 0.03781713
## 4  220       223 0.02426498
## 5  149       151 0.02377933
## 6  169       171 0.01860045
## 7  133       135 0.01724869
## 8  135       137 0.01724869
## 9  290       294 0.01691917
## 10 387       392 0.01603881

dt2<-dt[-c(385,139,293,220,149),]

model_lm2<-lm(G3~address+Medu+Fedu+famrel, data=dt2)

summary(model_lm2)

## 
## Call:
## lm(formula = G3 ~ address + Medu + Fedu + famrel, data = dt2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2928  -1.7949   0.6266   2.7950   9.2051 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.273956   1.837329   5.048 7.01e-07 ***
## addressU     0.860879   0.542187   1.588    0.113    
## Medu2        0.591538   0.765026   0.773    0.440    
## Medu3        1.182193   0.814294   1.452    0.147    
## Medu4        2.739709   0.880411   3.112    0.002 ** 
## Fedu2        0.153090   0.682518   0.224    0.823    
## Fedu3        0.001853   0.762872   0.002    0.998    
## Fedu4        0.180384   0.837858   0.215    0.830    
## famrel2     -0.116684   2.030272  -0.057    0.954    
## famrel3     -1.292460   1.751083  -0.738    0.461    
## famrel4     -1.084602   1.692516  -0.641    0.522    
## famrel5     -0.583551   1.716357  -0.340    0.734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.371 on 373 degrees of freedom
## Multiple R-squared:  0.07647,    Adjusted R-squared:  0.04924 
## F-statistic: 2.808 on 11 and 373 DF,  p-value: 0.001568

plot(model_lm2)

plot(model_lm2, 4)

The model without outliers with high Cook’s distance value increased its R-square value from 6.9 to 7.6, which is quite substantial. Medu4 variable remains significant at the level of 0.01.

Now let us take a look at confidence intervals for explanatory variables based on simple linear regression model.

confint_lm<-confint(model_lm2, level = 0.95, type="norm")
confint_lm

##                  2.5 %    97.5 %
## (Intercept)  5.6611343 12.886777
## addressU    -0.2052468  1.927004
## Medu2       -0.9127669  2.095843
## Medu3       -0.4189888  2.783375
## Medu4        1.0085188  4.470900
## Fedu2       -1.1889750  1.495155
## Fedu3       -1.4982163  1.501922
## Fedu4       -1.4671335  1.827902
## famrel2     -4.1088979  3.875530
## famrel3     -4.7356911  2.150771
## famrel4     -4.4126711  2.243467
## famrel5     -3.9585001  2.791399

I will try to improve these 95% confidence intervals with the bootstrap method. The goal of the bootstrapping is to reduce uncertainty around regression coefficients and consequently improve their confidence intervals.

7. Bootstraping

model_lm_boot<-car::Boot(model_lm2, R=1999, method="case")

## Loading required namespace: boot

summary(model_lm_boot)

## 
## Number of bootstrap replications R = 1997 
##              original    bootBias  bootSE   bootMed
## (Intercept)  9.273956 -0.00637955 1.09096  9.255222
## addressU     0.860879 -0.00346421 0.54456  0.842666
## Medu2        0.591538  0.00826300 0.77262  0.608297
## Medu3        1.182193 -0.00665120 0.83903  1.175890
## Medu4        2.739709  0.00880206 0.86211  2.744813
## Fedu2        0.153090  0.01514570 0.72188  0.191842
## Fedu3        0.001853  0.00231289 0.74672 -0.023786
## Fedu4        0.180384 -0.01018358 0.85819  0.180738
## famrel2     -0.116684 -0.00471488 1.19694 -0.119610
## famrel3     -1.292460  0.01075080 0.97817 -1.257268
## famrel4     -1.084602 -0.00012461 0.82503 -1.063558
## famrel5     -0.583551  0.01586231 0.88412 -0.562816

b<-model_lm$coefficients
b_boot<-apply(na.omit(model_lm_boot$t), 2, mean)

Let us now look at relative improvement of precision of regression coefficient in absolute numbers.

abs((b_boot-b)/b)

## (Intercept)    addressU       Medu2       Medu3       Medu4       Fedu2 
##   0.1874962   0.2271937   0.3654233   0.2223273   0.0504899   0.4856140 
##       Fedu3       Fedu4     famrel2     famrel3     famrel4     famrel5 
##   0.9681991   0.4972346   0.7565156  31.0194865   7.8042914   1.8687929

Now let us look at the 95% confidence interval after applying bootstrap method.

confint_lm_boot<-confint(model_lm_boot, level = 0.95, type="norm")
confint_lm_boot

## Bootstrap normal confidence intervals
## 
##                  2.5 %     97.5 %
## (Intercept)  7.1420972 11.4185734
## addressU    -0.2029667  1.9316527
## Medu2       -0.9310266  2.0975765
## Medu3       -0.4556318  2.8333206
## Medu4        1.0411949  4.4206196
## Fedu2       -1.2769119  1.5528001
## Fedu3       -1.4640087  1.4630889
## Fedu4       -1.4914626  1.8725982
## famrel2     -2.4579226  2.2339846
## famrel3     -3.2203879  0.6139665
## famrel4     -2.7015011  0.5325464
## famrel5     -2.3322591  1.1334332

7. Inspection of coefficients (bootstrap vs. simple linear regression)

ci_lm<-as.data.frame(confint_lm)
ci_lm_boot<-as.data.frame(confint_lm_boot)
colnames(ci_lm)<-colnames(ci_lm_boot)<-c("lower", "upper")
ci_lm$variables<-ci_lm_boot$variables<-rownames(ci_lm)


ggplot2::ggplot() + ggplot2::geom_errorbar(data=ci_lm_boot, mapping=ggplot2::aes(x=variables, ymin=lower, ymax=upper), col="red", size=1.5) + ggplot2::geom_errorbar(data=ci_lm, mapping=ggplot2::aes(x=variables, ymin=lower, ymax=upper), col="blue", size=1.5)+ggplot2::theme_minimal()+ggplot2::labs(title="95% confidence intervals in simple linear (blue) & bootstrap regressions (red)",x ="explanatory variables", y = "coefficient level")

Bootstrap, just by looking at the plot above, did contribute to much more precise estimation of 95% confidence interval for some regression coefficients (intercept and famrel2, famrel3, famrel4 and famrel5). Unfortunately all of these variables (with an exception of the intercept do cross the null value and thus their significance is nullified). As for other variables (Fedu, Medu), red lines coincide with the blue lines, which indicates that bootstrapping did not help in the process of more precise estimation of confidence intervals for these variables.

hist(model_lm_boot, legend="separate")

It looks like all the regression coefficient are normally distributed (more less).

8. Conclusion

The bootstrap method did slightly improve the precision of estimation of regression coefficients. Especially, for famrel2, famrel3, famrel4 and famrel5 the red bars are much shorter than the previously estimated blue bars (simple linear regression). However, the only variable which is significant at the 95% level remains Medu4. For all other variables their 95% confidence interval crosses zero and thus their estimation is not significant. Unfortunately however, for Medu4 bootstrapping was not able to improve its 95% confidence interval at all.

Improvement on confidence intervals for regression coefficients using Bootstrap - Student performance database