Discussion Week 12

Student Performance Dataset (from Kaggle)

Source: https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression?resource=download

Description: The Student Performance Dataset is a dataset designed to examine the factors influencing academic student performance. The dataset consists of 10,000 student records, with each record containing information about various predictors and a performance index.

Variables:

Hours Studied: The total number of hours spent studying by each student.
Previous Scores: The scores obtained by students in previous tests.
Extracurricular Activities: Whether the student participates in extracurricular activities (Yes or No).
Sleep Hours: The average number of hours of sleep the student had per day.
Sample Question Papers Practiced: The number of sample question papers the student practiced.
Target Variable: Performance Index: A measure of the overall performance of each student. The performance index represents the student’s academic performance and has been rounded to the nearest integer. The index ranges from 10 to 100, with higher values indicating better performance.

data <- read.csv("https://raw.githubusercontent.com/johnnydrodriguez/data605/main/Student_Performance.csv", header = TRUE, sep = ',', na.strings="", fill = TRUE)
head(data)

##   Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
## 1             7              99                        Yes           9
## 2             4              82                         No           4
## 3             8              51                        Yes           7
## 4             5              52                        Yes           5
## 5             7              75                         No           8
## 6             3              78                         No           9
##   Sample.Question.Papers.Practiced Performance.Index
## 1                                1                91
## 2                                2                65
## 3                                2                45
## 4                                2                36
## 5                                5                66
## 6                                6                61

Simple Linear Regression Plots

For this plot, I dropped Extra Curricular Activities column since it was categorical and would require conversion to dummy variables.

# Plot
par(mfrow=c(2, 2))

# Plot each variable against Performance.Index with a regression line
plot(data$Hours.Studied, data$Performance.Index, main = "Performance vs Hours Studied",
     xlab = "Hours Studied", ylab = "Performance Index")
abline(lm(Performance.Index ~ Hours.Studied, data=data), col = "red")

plot(data$Previous.Scores, data$Performance.Index, main = "Performance vs Previous Scores",
     xlab = "Previous Scores", ylab = "Performance Index")
abline(lm(Performance.Index ~ Previous.Scores, data=data), col = "red")

plot(data$Sleep.Hours, data$Performance.Index, main = "Performance vs Sleep Hours",
     xlab = "Sleep Hours", ylab = "Performance Index")
abline(lm(Performance.Index ~ Sleep.Hours, data=data), col = "red")

plot(data$Sample.Question.Papers.Practiced, data$Performance.Index, main = "Performance vs Sample Question Papers",
     xlab = "Sample Question Papers Practiced", ylab = "Performance Index")
abline(lm(Performance.Index ~ Sample.Question.Papers.Practiced, data=data), col = "red")

Multiple Regression Model Using Hours Studied and Previous Score

Since the simple linear regressions showed the a positive association with Hours Studied and Previous Score, we’ll use those variable in the Multiple Linear Regression. To include a dichotemous term, I converted Extra Curricular Activities in into dummy variable and included in the model

# Dummy variable 'Extracurricular.Activities' from 'Yes'/'No' to 1/0
data$Extracurricular.Activities <- ifelse(data$Extracurricular.Activities == "Yes", 1, 0)

#  multiple linear regression model
model <- lm(Performance.Index ~ Hours.Studied + Previous.Scores + Extracurricular.Activities, data = data)

# Summary
summary(model)

## 
## Call:
## lm(formula = Performance.Index ~ Hours.Studied + Previous.Scores + 
##     Extracurricular.Activities, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1188 -1.4981 -0.0046  1.5016  9.2138 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -30.096168   0.105862 -284.30   <2e-16 ***
## Hours.Studied                2.857185   0.008748  326.61   <2e-16 ***
## Previous.Scores              1.018980   0.001306  780.16   <2e-16 ***
## Extracurricular.Activities   0.589267   0.045301   13.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.265 on 9996 degrees of freedom
## Multiple R-squared:  0.9861, Adjusted R-squared:  0.9861 
## F-statistic: 2.365e+05 on 3 and 9996 DF,  p-value: < 2.2e-16

Interpretation

Coefficients:

Intercept The intercept doesn’t make sense to have zero hours studied or zero previous scores so can be ignored in this context.
Hours.Studied (2.857185): For each additional hour studied, the Performance.Index is expected to increase by approximately 2.857 points, holding all other variables constant.
Previous.Scores (1.018980): For each additional point in Previous.Scores, the Performance.Index increases by about 1.019 points, assuming other variables are held constant.
Extracurricular.Activities (0.589267): Having extracurricular activities (value = 1) is associated with an average increase of about 0.589 points in Performance.Index compared to not having extracurricular activities (value = 0), assuming other variables are held constant.

Standard Error

Hours.Studied (0.008748), Previous.Scores (0.001306), and Extracurricular.Activities (0.045301) have small standard errors,indicating that their coefficients are estimated well.

P values

All variables, including the intercept, have p-values < 2e-16, which is extremely small, indicating very strong association and rejection of the null hypothesis.

Model

Residual Standard Error (2.265): Given the scale of Performance.Index, this might suggest a good fit since the error is small.
Multiple R-squared (0.9861): This indicates that 98.61% of the variability in Performance.Index is explained by the model. It’s high value suggest very good fit of the model to the data. The high Adjusted R-squared (0.9861) value suggests the same.
F-statistic (2.365e+05):The very high F-statistic and its extremely small p-value (< 2.2e-16) provide indicate the model as a whole is statistically significant.

Discussion Week 12

2024-04-15

Student Performance Dataset (from Kaggle)

Simple Linear Regression Plots

Multiple Regression Model Using Hours Studied and Previous Score

Interpretation