Student Performance Dataset (from Kaggle)

Source: https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression?resource=download

Description: The Student Performance Dataset is a dataset designed to examine the factors influencing academic student performance. The dataset consists of 10,000 student records, with each record containing information about various predictors and a performance index.

Variables:

data <- read.csv("https://raw.githubusercontent.com/johnnydrodriguez/data605/main/Student_Performance.csv", header = TRUE, sep = ',', na.strings="", fill = TRUE)
head(data)
##   Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
## 1             7              99                        Yes           9
## 2             4              82                         No           4
## 3             8              51                        Yes           7
## 4             5              52                        Yes           5
## 5             7              75                         No           8
## 6             3              78                         No           9
##   Sample.Question.Papers.Practiced Performance.Index
## 1                                1                91
## 2                                2                65
## 3                                2                45
## 4                                2                36
## 5                                5                66
## 6                                6                61


Simple Linear Regression Plots

For this plot, I dropped Extra Curricular Activities column since it was categorical and would require conversion to dummy variables.

# Plot
par(mfrow=c(2, 2))

# Plot each variable against Performance.Index with a regression line
plot(data$Hours.Studied, data$Performance.Index, main = "Performance vs Hours Studied",
     xlab = "Hours Studied", ylab = "Performance Index")
abline(lm(Performance.Index ~ Hours.Studied, data=data), col = "red")

plot(data$Previous.Scores, data$Performance.Index, main = "Performance vs Previous Scores",
     xlab = "Previous Scores", ylab = "Performance Index")
abline(lm(Performance.Index ~ Previous.Scores, data=data), col = "red")

plot(data$Sleep.Hours, data$Performance.Index, main = "Performance vs Sleep Hours",
     xlab = "Sleep Hours", ylab = "Performance Index")
abline(lm(Performance.Index ~ Sleep.Hours, data=data), col = "red")

plot(data$Sample.Question.Papers.Practiced, data$Performance.Index, main = "Performance vs Sample Question Papers",
     xlab = "Sample Question Papers Practiced", ylab = "Performance Index")
abline(lm(Performance.Index ~ Sample.Question.Papers.Practiced, data=data), col = "red")


Multiple Regression Model Using Hours Studied and Previous Score

Since the simple linear regressions showed the a positive association with Hours Studied and Previous Score, we’ll use those variable in the Multiple Linear Regression. To include a dichotemous term, I converted Extra Curricular Activities in into dummy variable and included in the model

# Dummy variable 'Extracurricular.Activities' from 'Yes'/'No' to 1/0
data$Extracurricular.Activities <- ifelse(data$Extracurricular.Activities == "Yes", 1, 0)

#  multiple linear regression model
model <- lm(Performance.Index ~ Hours.Studied + Previous.Scores + Extracurricular.Activities, data = data)

# Summary
summary(model)
## 
## Call:
## lm(formula = Performance.Index ~ Hours.Studied + Previous.Scores + 
##     Extracurricular.Activities, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1188 -1.4981 -0.0046  1.5016  9.2138 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -30.096168   0.105862 -284.30   <2e-16 ***
## Hours.Studied                2.857185   0.008748  326.61   <2e-16 ***
## Previous.Scores              1.018980   0.001306  780.16   <2e-16 ***
## Extracurricular.Activities   0.589267   0.045301   13.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.265 on 9996 degrees of freedom
## Multiple R-squared:  0.9861, Adjusted R-squared:  0.9861 
## F-statistic: 2.365e+05 on 3 and 9996 DF,  p-value: < 2.2e-16


Interpretation

Coefficients:

Standard Error

P values

Model