Description: The Student Performance Dataset is a dataset designed to examine the factors influencing academic student performance. The dataset consists of 10,000 student records, with each record containing information about various predictors and a performance index.
Variables:
Hours Studied: The total number of hours spent studying by each student.
Previous Scores: The scores obtained by students in previous tests.
Extracurricular Activities: Whether the student participates in extracurricular activities (Yes or No).
Sleep Hours: The average number of hours of sleep the student had per day.
Sample Question Papers Practiced: The number of sample question papers the student practiced.
Target Variable: Performance Index: A measure of the overall performance of each student. The performance index represents the student’s academic performance and has been rounded to the nearest integer. The index ranges from 10 to 100, with higher values indicating better performance.
data <- read.csv("https://raw.githubusercontent.com/johnnydrodriguez/data605/main/Student_Performance.csv", header = TRUE, sep = ',', na.strings="", fill = TRUE)
head(data)
## Hours.Studied Previous.Scores Extracurricular.Activities Sleep.Hours
## 1 7 99 Yes 9
## 2 4 82 No 4
## 3 8 51 Yes 7
## 4 5 52 Yes 5
## 5 7 75 No 8
## 6 3 78 No 9
## Sample.Question.Papers.Practiced Performance.Index
## 1 1 91
## 2 2 65
## 3 2 45
## 4 2 36
## 5 5 66
## 6 6 61
For this plot, I dropped Extra Curricular Activities column since it was categorical and would require conversion to dummy variables.
# Plot
par(mfrow=c(2, 2))
# Plot each variable against Performance.Index with a regression line
plot(data$Hours.Studied, data$Performance.Index, main = "Performance vs Hours Studied",
xlab = "Hours Studied", ylab = "Performance Index")
abline(lm(Performance.Index ~ Hours.Studied, data=data), col = "red")
plot(data$Previous.Scores, data$Performance.Index, main = "Performance vs Previous Scores",
xlab = "Previous Scores", ylab = "Performance Index")
abline(lm(Performance.Index ~ Previous.Scores, data=data), col = "red")
plot(data$Sleep.Hours, data$Performance.Index, main = "Performance vs Sleep Hours",
xlab = "Sleep Hours", ylab = "Performance Index")
abline(lm(Performance.Index ~ Sleep.Hours, data=data), col = "red")
plot(data$Sample.Question.Papers.Practiced, data$Performance.Index, main = "Performance vs Sample Question Papers",
xlab = "Sample Question Papers Practiced", ylab = "Performance Index")
abline(lm(Performance.Index ~ Sample.Question.Papers.Practiced, data=data), col = "red")
Since the simple linear regressions showed the a positive association with Hours Studied and Previous Score, we’ll use those variable in the Multiple Linear Regression. To include a dichotemous term, I converted Extra Curricular Activities in into dummy variable and included in the model
# Dummy variable 'Extracurricular.Activities' from 'Yes'/'No' to 1/0
data$Extracurricular.Activities <- ifelse(data$Extracurricular.Activities == "Yes", 1, 0)
# multiple linear regression model
model <- lm(Performance.Index ~ Hours.Studied + Previous.Scores + Extracurricular.Activities, data = data)
# Summary
summary(model)
##
## Call:
## lm(formula = Performance.Index ~ Hours.Studied + Previous.Scores +
## Extracurricular.Activities, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1188 -1.4981 -0.0046 1.5016 9.2138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30.096168 0.105862 -284.30 <2e-16 ***
## Hours.Studied 2.857185 0.008748 326.61 <2e-16 ***
## Previous.Scores 1.018980 0.001306 780.16 <2e-16 ***
## Extracurricular.Activities 0.589267 0.045301 13.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.265 on 9996 degrees of freedom
## Multiple R-squared: 0.9861, Adjusted R-squared: 0.9861
## F-statistic: 2.365e+05 on 3 and 9996 DF, p-value: < 2.2e-16
Coefficients:
Intercept The intercept doesn’t make sense to have zero hours studied or zero previous scores so can be ignored in this context.
Hours.Studied (2.857185): For each additional hour studied, the Performance.Index is expected to increase by approximately 2.857 points, holding all other variables constant.
Previous.Scores (1.018980): For each additional point in Previous.Scores, the Performance.Index increases by about 1.019 points, assuming other variables are held constant.
Extracurricular.Activities (0.589267): Having extracurricular activities (value = 1) is associated with an average increase of about 0.589 points in Performance.Index compared to not having extracurricular activities (value = 0), assuming other variables are held constant.
Standard Error
P values
Model
Residual Standard Error (2.265): Given the scale of Performance.Index, this might suggest a good fit since the error is small.
Multiple R-squared (0.9861): This indicates that 98.61% of the variability in Performance.Index is explained by the model. It’s high value suggest very good fit of the model to the data. The high Adjusted R-squared (0.9861) value suggests the same.
F-statistic (2.365e+05):The very high F-statistic and its extremely small p-value (< 2.2e-16) provide indicate the model as a whole is statistically significant.