setwd("C:/Users/avaan/OneDrive/Desktop")
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
## Warning: package 'psych' was built under R version 4.3.3
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(pander)
## Warning: package 'pander' was built under R version 4.3.3
d1=read.table("student.csv",sep=";",header=TRUE)
data <- select(d1, "age", "famsize", "Medu", "Fedu", "traveltime", "studytime", "failures", "famrel", "freetime", "goout", "Dalc", "Walc", "health", "absences", "G3")
This data set was obtained from https://data.world.com. The data consists of information on students gathered from two different schools in Portugal about students habits and lives outside of school to see what impact these external factors might have on their final grade in mathematics. The data was collected through school surveys.
pairs.panels(data[, -c(1,5,6)], pch=21, main="Pair-wise Scatter Plot of 14 numerical variables")
Absences <- data$absences
Grade <- data$G3
plot(Absences, Grade, pch = 21, col ="navy",
main = "Relationship between Absences and Alcohol Consumption")
From the pairwise scatter plot, we can see that final grade (the
response variable) and the number of school absences have a negative
linear correlation. This is the explanatory variable that will be
focused on for the rest of the assignment. # SLR
plot(Absences, Grade, pch = 21, col ="navy",
main = "Relationship between Absences and Alcohol Consumption")
parametric.model <- lm(Grade ~ Absences)
par(mfrow = c(2,2))
plot(parametric.model)
There appears to be a negative linear correlation between the number of
absences a student has and their final math grade. From the residual
plots we can see that there are not many clusters but the majority of
the observations are clumped towards the left. The top right plot
reveals that the data violates the normality assumption. From the bottom
right plot we can see that there is one serious outlier to the far
right. The top left plot also has a slight negative linear trend.
vec.id <- 1:length(Grade)
boot.id <- sample(vec.id, length(Grade), replace = TRUE)
boot.Grade <- Grade[boot.id]
boot.absence <- Absences[boot.id]
B <- 1000
boot.beta0 <- NULL
boot.beta1 <- NULL
vec.id <- 1:length(Grade)
for(i in 1:B){
boot.id <- sample(vec.id, length(Grade), replace = TRUE)
boot.Grade <- Grade[boot.id]
boot.absence <- Absences[boot.id]
boot.reg <-lm(Grade[boot.id] ~ Absences[boot.id])
boot.beta0[i] <- coef(boot.reg)[1]
boot.beta1[i] <- coef(boot.reg)[2]
}
boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2)
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2)
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci))
names(boot.coef) <- c("2.5%", "97.5%")
pander(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.")
2.5% | 97.5% | |
---|---|---|
boot.beta0.ci | 9.653 | 10.88 |
boot.beta1.ci | -0.0204 | 0.08121 |
reg.table <- coef(summary(parametric.model))
pander(reg.table, caption = "Inferential statistics for the parametric linear
regression model: Final Math Grade and Number of School Absences")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 10.3 | 0.2835 | 36.35 | 9.356e-128 |
Absences | 0.01961 | 0.02886 | 0.6793 | 0.4973 |
Since the bootstrap confidence interval includes that value 0, absences and a students final math grade might not be statistically correlated even though they follow a linear pattern. The parametric model suggests that the coefficient is only a little more than 0. So there is only a small chance number of absences and final math grade are statistically correlated. Since there was a violation of the model assumptions, the bootstrap confidence interval should be more relied on.