setwd("C:/Users/avaan/OneDrive/Desktop")
library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(psych)

## Warning: package 'psych' was built under R version 4.3.3

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(pander)

## Warning: package 'pander' was built under R version 4.3.3

d1=read.table("student.csv",sep=";",header=TRUE)

data <- select(d1, "age", "famsize", "Medu", "Fedu", "traveltime", "studytime", "failures", "famrel", "freetime", "goout", "Dalc", "Walc", "health", "absences", "G3")

Data Description

This data set was obtained from https://data.world.com. The data consists of information on students gathered from two different schools in Portugal about students habits and lives outside of school to see what impact these external factors might have on their final grade in mathematics. The data was collected through school surveys.

school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
Fedu - father’s education
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support
famsup - family educational support
paid - extra paid classes within the course subject
activities - extra-curricular activities
nursery - attended nursery school
higher - wants to take higher education
internet - Internet access at home
romantic - with a romantic relationship
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

Questions Regarding Data

Pairwise Scatterplots

pairs.panels(data[, -c(1,5,6)], pch=21, main="Pair-wise Scatter Plot of 14 numerical variables")

Absences <- data$absences
Grade <- data$G3
plot(Absences, Grade, pch = 21, col ="navy",
     main = "Relationship between Absences and Alcohol Consumption")

From the pairwise scatter plot, we can see that final grade (the response variable) and the number of school absences have a negative linear correlation. This is the explanatory variable that will be focused on for the rest of the assignment. # SLR

plot(Absences, Grade, pch = 21, col ="navy",
     main = "Relationship between Absences and Alcohol Consumption")

parametric.model <- lm(Grade ~ Absences)
par(mfrow = c(2,2))
plot(parametric.model)

There appears to be a negative linear correlation between the number of absences a student has and their final math grade. From the residual plots we can see that there are not many clusters but the majority of the observations are clumped towards the left. The top right plot reveals that the data violates the normality assumption. From the bottom right plot we can see that there is one serious outlier to the far right. The top left plot also has a slight negative linear trend.

Bootstrap Confidence Intervals

vec.id <- 1:length(Grade)   
boot.id <- sample(vec.id, length(Grade), replace = TRUE)   
boot.Grade <- Grade[boot.id]   
boot.absence <- Absences[boot.id]

B <- 1000   
boot.beta0 <- NULL 
boot.beta1 <- NULL

vec.id <- 1:length(Grade)   
for(i in 1:B){
  boot.id <- sample(vec.id, length(Grade), replace = TRUE)   
  boot.Grade <- Grade[boot.id]          
  boot.absence <- Absences[boot.id]    

  boot.reg <-lm(Grade[boot.id] ~ Absences[boot.id]) 
  boot.beta0[i] <- coef(boot.reg)[1]   
  boot.beta1[i] <- coef(boot.reg)[2]  
}

boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2)
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2)
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci)) 
names(boot.coef) <- c("2.5%", "97.5%")
pander(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.")

Bootstrap confidence intervals of regression coefficients.
	2.5%	97.5%
boot.beta0.ci	9.653	10.88
boot.beta1.ci	-0.0204	0.08121

Comparing the Models

reg.table <- coef(summary(parametric.model))
pander(reg.table, caption = "Inferential statistics for the parametric linear
      regression model: Final Math Grade and Number of School Absences")

Inferential statistics for the parametric linear regression model: Final Math Grade and Number of School Absences
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	10.3	0.2835	36.35	9.356e-128
Absences	0.01961	0.02886	0.6793	0.4973

Since the bootstrap confidence interval includes that value 0, absences and a students final math grade might not be statistically correlated even though they follow a linear pattern. The parametric model suggests that the coefficient is only a little more than 0. So there is only a small chance number of absences and final math grade are statistically correlated. Since there was a violation of the model assumptions, the bootstrap confidence interval should be more relied on.

Assignment 2

Ava Destefano

2024-09-10

Data Description

Questions Regarding Data

Pairwise Scatterplots

Bootstrap Confidence Intervals

Comparing the Models