setwd("C:/Users/avaan/OneDrive/Desktop")
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
## Warning: package 'psych' was built under R version 4.3.3
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(pander)
## Warning: package 'pander' was built under R version 4.3.3
d1=read.table("student.csv",sep=";",header=TRUE)

data <- select(d1, "age", "famsize", "Medu", "Fedu", "traveltime", "studytime", "failures", "famrel", "freetime", "goout", "Dalc", "Walc", "health", "absences", "G3")

Data Description

This data set was obtained from https://data.world.com. The data consists of information on students gathered from two different schools in Portugal about students habits and lives outside of school to see what impact these external factors might have on their final grade in mathematics. The data was collected through school surveys.

Questions Regarding Data

Pairwise Scatterplots

pairs.panels(data[, -c(1,5,6)], pch=21, main="Pair-wise Scatter Plot of 14 numerical variables")

Absences <- data$absences
Grade <- data$G3
plot(Absences, Grade, pch = 21, col ="navy",
     main = "Relationship between Absences and Alcohol Consumption")

From the pairwise scatter plot, we can see that final grade (the response variable) and the number of school absences have a negative linear correlation. This is the explanatory variable that will be focused on for the rest of the assignment. # SLR

plot(Absences, Grade, pch = 21, col ="navy",
     main = "Relationship between Absences and Alcohol Consumption")

parametric.model <- lm(Grade ~ Absences)
par(mfrow = c(2,2))
plot(parametric.model)

There appears to be a negative linear correlation between the number of absences a student has and their final math grade. From the residual plots we can see that there are not many clusters but the majority of the observations are clumped towards the left. The top right plot reveals that the data violates the normality assumption. From the bottom right plot we can see that there is one serious outlier to the far right. The top left plot also has a slight negative linear trend.

Bootstrap Confidence Intervals

vec.id <- 1:length(Grade)   
boot.id <- sample(vec.id, length(Grade), replace = TRUE)   
boot.Grade <- Grade[boot.id]   
boot.absence <- Absences[boot.id]

B <- 1000   
boot.beta0 <- NULL 
boot.beta1 <- NULL

vec.id <- 1:length(Grade)   
for(i in 1:B){
  boot.id <- sample(vec.id, length(Grade), replace = TRUE)   
  boot.Grade <- Grade[boot.id]          
  boot.absence <- Absences[boot.id]    

  boot.reg <-lm(Grade[boot.id] ~ Absences[boot.id]) 
  boot.beta0[i] <- coef(boot.reg)[1]   
  boot.beta1[i] <- coef(boot.reg)[2]  
}

boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2)
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2)
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci)) 
names(boot.coef) <- c("2.5%", "97.5%")
pander(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.")
Bootstrap confidence intervals of regression coefficients.
  2.5% 97.5%
boot.beta0.ci 9.653 10.88
boot.beta1.ci -0.0204 0.08121

Comparing the Models

reg.table <- coef(summary(parametric.model))
pander(reg.table, caption = "Inferential statistics for the parametric linear
      regression model: Final Math Grade and Number of School Absences")
Inferential statistics for the parametric linear regression model: Final Math Grade and Number of School Absences
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.3 0.2835 36.35 9.356e-128
Absences 0.01961 0.02886 0.6793 0.4973

Since the bootstrap confidence interval includes that value 0, absences and a students final math grade might not be statistically correlated even though they follow a linear pattern. The parametric model suggests that the coefficient is only a little more than 0. So there is only a small chance number of absences and final math grade are statistically correlated. Since there was a violation of the model assumptions, the bootstrap confidence interval should be more relied on.