Το dataset περιέχει πληροφορίες για μαθητές, όπως:
ώρες μελέτης ποσοστό παρουσιών βαθμούς (Math, Science, English) τελικό ποσοστό (Final_Percentage)
Στόχος: Να προβλέψουμε το Final_Percentage.
data <- read.csv("Student_Performance_Dataset.csv")
head(data)
## Student_ID Age Gender Class Study_Hours_Per_Day Attendance_Percentage
## 1 S0001 15 Male 12 1.0 65
## 2 S0002 19 Female 9 1.6 58
## 3 S0003 14 Female 12 3.6 64
## 4 S0004 18 Female 9 5.5 68
## 5 S0005 14 Male 10 5.0 80
## 6 S0006 19 Male 12 5.2 82
## Parental_Education Internet_Access Extracurricular_Activities Math_Score
## 1 Postgraduate No No 40
## 2 Graduate No Yes 80
## 3 High School Yes Yes 83
## 4 Postgraduate Yes No 68
## 5 High School Yes No 41
## 6 High School No Yes 88
## Science_Score English_Score Previous_Year_Score Final_Percentage
## 1 39 72 81 50.33
## 2 44 35 47 53.00
## 3 73 59 58 71.67
## 4 48 77 54 64.33
## 5 46 36 68 41.00
## 6 70 46 60 68.00
## Performance_Level Pass_Fail
## 1 Average Pass
## 2 Average Pass
## 3 Good Pass
## 4 Average Pass
## 5 Poor Fail
## 6 Good Pass
summary(data)
## Student_ID Age Gender Class
## Length:5000 Min. :14.00 Length:5000 Min. : 9.0
## Class :character 1st Qu.:15.00 Class :character 1st Qu.:10.0
## Mode :character Median :17.00 Mode :character Median :10.0
## Mean :16.51 Mean :10.5
## 3rd Qu.:18.00 3rd Qu.:11.0
## Max. :19.00 Max. :12.0
## Study_Hours_Per_Day Attendance_Percentage Parental_Education
## Min. :0.500 Min. : 50.00 Length:5000
## 1st Qu.:1.900 1st Qu.: 62.00 Class :character
## Median :3.300 Median : 75.00 Mode :character
## Mean :3.287 Mean : 74.92
## 3rd Qu.:4.700 3rd Qu.: 88.00
## Max. :6.000 Max. :100.00
## Internet_Access Extracurricular_Activities Math_Score Science_Score
## Length:5000 Length:5000 Min. : 35.00 Min. : 35.0
## Class :character Class :character 1st Qu.: 52.00 1st Qu.: 50.0
## Mode :character Mode :character Median : 68.00 Median : 67.0
## Mean : 67.75 Mean : 66.9
## 3rd Qu.: 84.00 3rd Qu.: 83.0
## Max. :100.00 Max. :100.0
## English_Score Previous_Year_Score Final_Percentage Performance_Level
## Min. : 35.00 Min. :40.00 Min. :36.33 Length:5000
## 1st Qu.: 51.00 1st Qu.:53.00 1st Qu.:59.67 Class :character
## Median : 68.00 Median :67.00 Median :67.33 Mode :character
## Mean : 67.78 Mean :67.28 Mean :67.48
## 3rd Qu.: 85.00 3rd Qu.:81.00 3rd Qu.:75.33
## Max. :100.00 Max. :95.00 Max. :98.33
## Pass_Fail
## Length:5000
## Class :character
## Mode :character
##
##
##
plot(data$Study_Hours_Per_Day, data$Final_Percentage,
main="Study Hours vs Final Percentage",
xlab="Study Hours",
ylab="Final %",
pch=19)
Boxplot
boxplot(data$Final_Percentage,
main="Κατανομή Τελικού Βαθμού")
3. Απλή Γραμμική Παλινδρόμηση
Προσπαθούμε να προβλέψουμε το Final Percentage από τις ώρες μελέτης.
model1 <- lm(Final_Percentage ~ Study_Hours_Per_Day, data=data)
summary(model1)
##
## Call:
## lm(formula = Final_Percentage ~ Study_Hours_Per_Day, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.9855 -7.8244 -0.0123 7.7841 30.8088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.77175 0.35650 190.101 <2e-16 ***
## Study_Hours_Per_Day -0.08946 0.09765 -0.916 0.36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.96 on 4998 degrees of freedom
## Multiple R-squared: 0.0001679, Adjusted R-squared: -3.215e-05
## F-statistic: 0.8393 on 1 and 4998 DF, p-value: 0.3596
Γραφική απεικόνιση
plot(data$Study_Hours_Per_Day, data$Final_Percentage)
abline(model1, col="red", lwd=2)
4. Πολλαπλή Γραμμική Παλινδρόμηση
Χρησιμοποιούμε περισσότερες μεταβλητές:
model2 <- lm(Final_Percentage ~ Study_Hours_Per_Day +
Attendance_Percentage +
Previous_Year_Score,
data=data)
summary(model2)
##
## Call:
## lm(formula = Final_Percentage ~ Study_Hours_Per_Day + Attendance_Percentage +
## Previous_Year_Score, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.056 -7.813 -0.003 7.793 31.020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.4879726 1.0780987 63.527 <2e-16 ***
## Study_Hours_Per_Day -0.0873326 0.0977150 -0.894 0.371
## Attendance_Percentage 0.0007018 0.0105732 0.066 0.947
## Previous_Year_Score -0.0115307 0.0096007 -1.201 0.230
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.96 on 4996 degrees of freedom
## Multiple R-squared: 0.0004578, Adjusted R-squared: -0.0001424
## F-statistic: 0.7627 on 3 and 4996 DF, p-value: 0.5148
# R-squared
summary(model1)$r.squared
## [1] 0.0001678996
summary(model2)$r.squared
## [1] 0.0004578019
# SSE
SSE1 <- sum(residuals(model1)^2)
SSE2 <- sum(residuals(model2)^2)
SSE1
## [1] 600834.3
SSE2
## [1] 600660.1
Αναμένουμε:
Το model2 να έχει μεγαλύτερο R² Και μικρότερο SSE (καλύτερη πρόβλεψη)
cor(data[, c("Study_Hours_Per_Day",
"Attendance_Percentage",
"Previous_Year_Score",
"Final_Percentage")])
## Study_Hours_Per_Day Attendance_Percentage
## Study_Hours_Per_Day 1.00000000 0.0269157371
## Attendance_Percentage 0.02691574 1.0000000000
## Previous_Year_Score 0.01966172 -0.0116111435
## Final_Percentage -0.01295761 0.0007960933
## Previous_Year_Score Final_Percentage
## Study_Hours_Per_Day 0.01966172 -0.0129576088
## Attendance_Percentage -0.01161114 0.0007960933
## Previous_Year_Score 1.00000000 -0.0172520949
## Final_Percentage -0.01725209 1.0000000000