————————————————————————
Project description and details:
Goal:- Predict students ability to pass and their grades based on certain variables and find which variable(s) is the best predictor
Dataset used is Student which is downloaded from UCI Repository
Student dataset has G3 variable, which is used for classifying Pass vs Fail and students grades into Fail, Sufficient, Satisfactory, Good and Excellent. These classifications will be predicted based on some independent variables.
Predictors are :- 1) ParentStatus (living together or not) 2) MotherEducation (factors:- none, upto 4th grade, upto 9th grade, secondary education and higher education) 3) Traveltime to school 4) Romantic status of the student 5) G1 - score from test1 6) G2 - score from test2
Different methods used :- 1) Linear regression 2) Decision Tree 3) Naive Bayes Method
Linear regression is used on variables G1 and G2 individually to predict G3
Decision tree is used on variables G1 and G2 together to predict pass-ability and Grades
Naive Bayes method is used on categorical variables - ParentStatus, MotherEducation, TravelTime and Romantic Status to predict pass-ability and grades.
————————————————————————
Load students data
students <- read.table("C:/Senthil/MSDataAnalytics/Semester1/Projects/IS607/FinalProject/student/student-mat.csv", header=TRUE,sep=";")
Original Attributes info from UCI Repository website for R
url <- "http://archive.ics.uci.edu/ml/datasets/Student+Performance#"
studentsdspage<-html(url)
scrapedhtml <- studentsdspage%>% html_nodes("p") %>% html_text()
dsattributeinfo <- scrapedhtml[27]
stringcollection <- strsplit(dsattributeinfo, split="\r")
stringcollection[1]
## [[1]]
## [1] "# Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:"
## [2] "1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)"
## [3] "2 sex - student's sex (binary: 'F' - female or 'M' - male)"
## [4] "3 age - student's age (numeric: from 15 to 22)"
## [5] "4 address - student's home address type (binary: 'U' - urban or 'R' - rural)"
## [6] "5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)"
## [7] "6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)"
## [8] "7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)"
## [9] "8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)"
## [10] "9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')"
## [11] "10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')"
## [12] "11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')"
## [13] "12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')"
## [14] "13 traveltime - home to school travel time (numeric: 1 - 1 hour)"
## [15] "14 studytime - weekly study time (numeric: 1 - 10 hours)"
## [16] "15 failures - number of past class failures (numeric: n if 116 schoolsup - extra educational support (binary: yes or no)"
## [17] "17 famsup - family educational support (binary: yes or no)"
## [18] "18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)"
## [19] "19 activities - extra-curricular activities (binary: yes or no)"
## [20] "20 nursery - attended nursery school (binary: yes or no)"
## [21] "21 higher - wants to take higher education (binary: yes or no)"
## [22] "22 internet - Internet access at home (binary: yes or no)"
## [23] "23 romantic - with a romantic relationship (binary: yes or no)"
## [24] "24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)"
## [25] "25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)"
## [26] "26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)"
## [27] "27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)"
## [28] "28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)"
## [29] "29 health - current health status (numeric: from 1 - very bad to 5 - very good)"
## [30] "30 absences - number of school absences (numeric: from 0 to 93)"
## [31] "# these grades are related with the course subject, Math or Portuguese:"
## [32] "31 G1 - first period grade (numeric: from 0 to 20)"
## [33] "31 G2 - second period grade (numeric: from 0 to 20)"
## [34] "32 G3 - final grade (numeric: from 0 to 20, output target)"
Remove unnecessary columns
students <- students[,c(6,7,13,23,31,32,33)]
- > Calculate Pass Or Fail variable
Pass <- ifelse(students$G3>9,'PASS','FAIL')
students <- data.frame(students,Pass)
- > Calculate Grade variable
Grade <- ifelse(students$G3<=9,'FAIL','Pass')
Grade <- ifelse(students$G3>=10 & students$G3<=11,'Sufficient',Grade)
Grade <- ifelse(students$G3>=12 & students$G3<=13,'Satisfactory',Grade)
Grade <- ifelse(students$G3>=14 & students$G3<=15,'Good',Grade)
Grade <- ifelse(students$G3>=16 & students$G3<=20,'Excellent',Grade)
students <- data.frame(students,Grade)
————————————————————————
exploration of data
————————————————————————
dimensions
dim(students)
## [1] 349 9
nrow(students)
## [1] 349
ncol(students)
## [1] 9
structure
str(students)
## 'data.frame': 349 obs. of 9 variables:
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ Pass : Factor w/ 2 levels "FAIL","PASS": 1 1 2 2 2 2 2 1 2 2 ...
## $ Grade : Factor w/ 5 levels "Excellent","FAIL",..: 2 2 5 3 5 3 5 2 1 3 ...
variable or column names
names(students)
## [1] "Pstatus" "Medu" "traveltime" "romantic" "G1"
## [6] "G2" "G3" "Pass" "Grade"
Attributes
attributes(students)
## $names
## [1] "Pstatus" "Medu" "traveltime" "romantic" "G1"
## [6] "G2" "G3" "Pass" "Grade"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
## [103] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
## [120] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
## [137] 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
## [154] 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
## [171] 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
## [188] 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204
## [205] 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
## [222] 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
## [239] 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
## [256] 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## [273] 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289
## [290] 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323
## [324] 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## [341] 341 342 343 344 345 346 347 348 349
##
## $class
## [1] "data.frame"
First ten rows
students[1:10,]
## Pstatus Medu traveltime romantic G1 G2 G3 Pass Grade
## 1 A 4 2 no 5 6 6 FAIL FAIL
## 2 T 1 1 no 5 5 6 FAIL FAIL
## 3 T 1 1 no 7 8 10 PASS Sufficient
## 4 T 4 1 yes 15 14 15 PASS Good
## 5 T 3 1 no 6 10 10 PASS Sufficient
## 6 T 4 1 no 15 15 15 PASS Good
## 7 T 2 1 no 12 12 11 PASS Sufficient
## 8 A 4 2 no 6 5 6 FAIL FAIL
## 9 A 3 1 no 16 18 19 PASS Excellent
## 10 T 3 1 no 14 15 15 PASS Good
Variable distribution before necessary factorization
summary(students)
## Pstatus Medu traveltime romantic G1
## A: 38 Min. :0.000 Min. :1.000 no :236 Min. : 3.00
## T:311 1st Qu.:2.000 1st Qu.:1.000 yes:113 1st Qu.: 8.00
## Median :3.000 Median :1.000 Median :11.00
## Mean :2.802 Mean :1.387 Mean :10.94
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:13.00
## Max. :4.000 Max. :4.000 Max. :19.00
## G2 G3 Pass Grade
## Min. : 0.00 Min. : 0.00 FAIL:113 Excellent : 37
## 1st Qu.: 9.00 1st Qu.: 8.00 PASS:236 FAIL :113
## Median :11.00 Median :11.00 Good : 56
## Mean :10.78 Mean :10.49 Satisfactory: 54
## 3rd Qu.:13.00 3rd Qu.:14.00 Sufficient : 89
## Max. :19.00 Max. :20.00
Factorize continuous predictor variables
students$Medu <- factor(students$Medu)
students$traveltime <- factor(students$traveltime)
Variable distribution after necessary factorization
summary(students)
## Pstatus Medu traveltime romantic G1 G2
## A: 38 0: 3 1:243 no :236 Min. : 3.00 Min. : 0.00
## T:311 1: 42 2: 84 yes:113 1st Qu.: 8.00 1st Qu.: 9.00
## 2: 96 3: 15 Median :11.00 Median :11.00
## 3: 88 4: 7 Mean :10.94 Mean :10.78
## 4:120 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :19.00 Max. :19.00
## G3 Pass Grade
## Min. : 0.00 FAIL:113 Excellent : 37
## 1st Qu.: 8.00 PASS:236 FAIL :113
## Median :11.00 Good : 56
## Mean :10.49 Satisfactory: 54
## 3rd Qu.:14.00 Sufficient : 89
## Max. :20.00
pie chart for pass
pie(with(students, table(Pass)))

pie chart for Grade
pie(with(students, table(Grade)))

bar graph for Pass
ggplot(students, aes(x = Pass)) + geom_bar(fill="#FF9999")

bar graph for Grade
reordervect <- rep(0,nrow(students))
reordervect[with(students, Grade == "Fail")] = 1
reordervect[with(students, Grade == "Sufficient")] = 2
reordervect[with(students, Grade == "Satisfactory")] = 3
reordervect[with(students, Grade == "Good")] = 4
reordervect[with(students, Grade == "Excellent")] = 5
students$Grade = with(students, reorder(Grade, reordervect))
rm(reordervect)
ggplot(students, aes(x = Grade)) + geom_bar(fill="#FF9999")

statistical data of G3
summary(with(students,G3))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.00 11.00 10.49 14.00 20.00
sprintf('variance is %f',var(with(students,G3)))
## [1] "variance is 21.394296"
sprintf('standard deviation is %f',sd(with(students,G3)))
## [1] "standard deviation is 4.625397"
Histogram of G3
#hist(with(students,G3), main='', xlab='Score', breaks=20)
ggplot(students, aes(x=G3)) + geom_histogram(fill="#FF9999")

————————————————————————
Predicting G3 using G1
————————————————————————
correllation between G1 and G3
r <- cor(with(students,G1), with(students,G3))
sprintf('Correlation between G1 and G3 is %f and coefficient of determination is %f',r, r^2)
## [1] "Correlation between G1 and G3 is 0.794055 and coefficient of determination is 0.630523"
scatterplot of G1, G3
#plot(with(students,G1), with(students,G3), xlab='G1', ylab='G3')
ggplot(students, aes(x=G1,y=G3)) + geom_point() + geom_smooth(method="lm", se=FALSE)

sprintf("G3 shows a positive correllation with G1")
## [1] "G3 shows a positive correllation with G1"
————————————————————————
Predicting G3 using G2
————————————————————————
correllation between G2 and G3
r <- cor(with(students,G2), with(students,G3))
sprintf('Correlation between G2 and G3 is %f and the coefficient of determination is %f',r, r^2)
## [1] "Correlation between G2 and G3 is 0.902783 and the coefficient of determination is 0.815016"
scatterplot of G2, G3
#plot(with(students,G2), with(students,G3), xlab='G1', ylab='G3')
ggplot(students, aes(x=G2,y=G3)) + geom_point() + geom_smooth(method="lm", se=FALSE)

sprintf("G3 shows strong positive correllation with G2")
## [1] "G3 shows strong positive correllation with G2"
————————————————————————
Predicting Pass and Fail using G1 + G2 using DecisionTree
————————————————————————
formula <- Pass ~ G1 + G2
tree <- ctree(formula, data=students)
print(tree)
##
## Conditional inference tree with 4 terminal nodes
##
## Response: Pass
## Inputs: G1, G2
## Number of observations: 349
##
## 1) G2 <= 9; criterion = 1, statistic = 179.794
## 2) G2 <= 8; criterion = 0.999, statistic = 11.579
## 3)* weights = 82
## 2) G2 > 8
## 4)* weights = 44
## 1) G2 > 9
## 5) G1 <= 10; criterion = 0.999, statistic = 12.194
## 6)* weights = 51
## 5) G1 > 10
## 7)* weights = 172
plot(tree)

plot(tree, type="simple")

sprintf('Errors-on-predictions Matrix')
## [1] "Errors-on-predictions Matrix"
table(predict(tree, newdata=students), students$Pass,dnn=c('Predicted','Actual'))
## Actual
## Predicted FAIL PASS
## FAIL 105 21
## PASS 8 215
df.confmatrix <- data.frame(table(predict(tree, newdata=students), students$Pass,dnn=c('Predicted','Actual')))
data_long <- gather(df.confmatrix, Type, Status, Predicted:Actual)
data_long <- data_long %>% group_by(Status,Type) %>% summarise(Frequency=sum(Freq))
ggplot(data_long, aes(x=Status,y=Frequency,fill=Type)) + geom_bar(stat='identity', position='dodge')

————————————————————————
Predicting Grades using G1 + G2 using DecisionTree
————————————————————————
formula <- Grade ~ G1 + G2
tree <- ctree(formula, data=students)
print(tree)
##
## Conditional inference tree with 8 terminal nodes
##
## Response: Grade
## Inputs: G1, G2
## Number of observations: 349
##
## 1) G2 <= 9; criterion = 1, statistic = 267.241
## 2) G2 <= 8; criterion = 0.999, statistic = 11.579
## 3)* weights = 82
## 2) G2 > 8
## 4)* weights = 44
## 1) G2 > 9
## 5) G2 <= 15; criterion = 1, statistic = 185.971
## 6) G2 <= 11; criterion = 1, statistic = 151.247
## 7)* weights = 72
## 6) G2 > 11
## 8) G2 <= 13; criterion = 1, statistic = 77.643
## 9) G2 <= 12; criterion = 0.998, statistic = 14.374
## 10)* weights = 36
## 9) G2 > 12
## 11)* weights = 32
## 8) G2 > 13
## 12) G2 <= 14; criterion = 0.993, statistic = 11.398
## 13)* weights = 21
## 12) G2 > 14
## 14)* weights = 32
## 5) G2 > 15
## 15)* weights = 30
plot(tree)

plot(tree, type="simple")

sprintf('Errors-on-predictions Matrix')
## [1] "Errors-on-predictions Matrix"
table(predict(tree, newdata=students), students$Grade,dnn=c('Predicted','Actual'))
## Actual
## Predicted FAIL Sufficient Satisfactory Good Excellent
## FAIL 105 21 0 0 0
## Sufficient 8 56 8 0 0
## Satisfactory 0 12 43 13 0
## Good 0 0 3 40 10
## Excellent 0 0 0 3 27
df.confmatrix <- data.frame(table(predict(tree, newdata=students), students$Grade,dnn=c('Predicted','Actual')))
data_long <- gather(df.confmatrix, Type, Status, Predicted:Actual)
data_long <- data_long %>% group_by(Status,Type) %>% summarise(Frequency=sum(Freq))
ggplot(data_long, aes(x=Status,y=Frequency,fill=Type)) + geom_bar(stat='identity', position='dodge')

————————————————————————
Predicting Pass and Fail using Naive Bayes Prediction
————————————————————————
classifier<-naiveBayes(students[,1:4], students[,8])
#table(predict(classifier, students[,1:4]), students[,8], dnn = c('Predicted','Actual'))
sprintf('Errors-on-predictions Matrix')
## [1] "Errors-on-predictions Matrix"
table(predict(classifier, students[,1:4]), students[,8], dnn = c('Predicted','Actual'))
## Actual
## Predicted FAIL PASS
## FAIL 8 9
## PASS 105 227
df.confmatrix <- data.frame(table(predict(classifier, students[,1:4]), students[,8], dnn = c('Predicted','Actual')))
data_long <- gather(df.confmatrix, Type, Status, Predicted:Actual)
data_long <- data_long %>% group_by(Status,Type) %>% summarise(Frequency=sum(Freq))
ggplot(data_long, aes(x=Status,y=Frequency,fill=Type)) + geom_bar(stat='identity', position='dodge')

————————————————————————
Predicting Grades using Naive Bayes Prediction
————————————————————————
classifier<-naiveBayes(students[,1:4], students[,9])
sprintf('Errors-on-predictions Matrix')
## [1] "Errors-on-predictions Matrix"
table(predict(classifier, students[,1:4]), students[,9], dnn = c('Predicted','Actual'))
## Actual
## Predicted FAIL Sufficient Satisfactory Good Excellent
## FAIL 87 49 39 33 20
## Sufficient 14 28 9 4 5
## Satisfactory 0 0 0 0 0
## Good 11 11 6 17 11
## Excellent 1 1 0 2 1
df.confmatrix <- data.frame(table(predict(classifier, students[,1:4]), students[,9], dnn = c('Predicted','Actual')))
data_long <- gather(df.confmatrix, Type, Status, Predicted:Actual)
data_long <- data_long %>% group_by(Status,Type) %>% summarise(Frequency=sum(Freq))
ggplot(data_long, aes(x=Status,y=Frequency,fill=Type)) + geom_bar(stat='identity', position='dodge')

————————————————————————
Conclusion
————————————————————————
Linear regression showed strong relationship between G3 and G2. G1 also showed positive relationship but not as strong as G2.
Decision tree prediction on the same dataset showed very less errors on predictions making G1 and G2 suitable for predicting Grades and Pass-ability of students
Naive Bayes method showed large errors on predictions. So, either the those four variables are not good predictors or Naive Bayes method is not a good predicting model for this dataset.
Based on all the analysis, G2 is the strongest predictor for G3, which in turn, for pass-ability and grades.