————————————————————————

Project description and details:

Goal:- Predict students ability to pass and their grades based on certain variables and find which variable(s) is the best predictor

Dataset used is Student which is downloaded from UCI Repository

Student dataset has G3 variable, which is used for classifying Pass vs Fail and students grades into Fail, Sufficient, Satisfactory, Good and Excellent. These classifications will be predicted based on some independent variables.

Predictors are :- 1) ParentStatus (living together or not) 2) MotherEducation (factors:- none, upto 4th grade, upto 9th grade, secondary education and higher education) 3) Traveltime to school 4) Romantic status of the student 5) G1 - score from test1 6) G2 - score from test2

Different methods used :- 1) Linear regression 2) Decision Tree 3) Naive Bayes Method

Linear regression is used on variables G1 and G2 individually to predict G3

Decision tree is used on variables G1 and G2 together to predict pass-ability and Grades

Naive Bayes method is used on categorical variables - ParentStatus, MotherEducation, TravelTime and Romantic Status to predict pass-ability and grades.

————————————————————————

Load students data

students <- read.table("C:/Senthil/MSDataAnalytics/Semester1/Projects/IS607/FinalProject/student/student-mat.csv", header=TRUE,sep=";")

Original Attributes info from UCI Repository website for R

url <- "http://archive.ics.uci.edu/ml/datasets/Student+Performance#"
studentsdspage<-html(url)
scrapedhtml <- studentsdspage%>% html_nodes("p") %>% html_text()
dsattributeinfo <- scrapedhtml[27]

stringcollection <- strsplit(dsattributeinfo, split="\r")

stringcollection[1]
## [[1]]
##  [1] "# Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:"                                                  
##  [2] "1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)"                                                                     
##  [3] "2 sex - student's sex (binary: 'F' - female or 'M' - male)"                                                                                                      
##  [4] "3 age - student's age (numeric: from 15 to 22)"                                                                                                                  
##  [5] "4 address - student's home address type (binary: 'U' - urban or 'R' - rural)"                                                                                    
##  [6] "5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)"                                                                          
##  [7] "6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)"                                                                         
##  [8] "7 Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)"
##  [9] "8 Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)"
## [10] "9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')"                       
## [11] "10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')"                      
## [12] "11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')"                                        
## [13] "12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')"                                                                                       
## [14] "13 traveltime - home to school travel time (numeric: 1 - 1 hour)"                                                                                                
## [15] "14 studytime - weekly study time (numeric: 1 - 10 hours)"                                                                                                        
## [16] "15 failures - number of past class failures (numeric: n if 116 schoolsup - extra educational support (binary: yes or no)"                                        
## [17] "17 famsup - family educational support (binary: yes or no)"                                                                                                      
## [18] "18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)"                                                                 
## [19] "19 activities - extra-curricular activities (binary: yes or no)"                                                                                                 
## [20] "20 nursery - attended nursery school (binary: yes or no)"                                                                                                        
## [21] "21 higher - wants to take higher education (binary: yes or no)"                                                                                                  
## [22] "22 internet - Internet access at home (binary: yes or no)"                                                                                                       
## [23] "23 romantic - with a romantic relationship (binary: yes or no)"                                                                                                  
## [24] "24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)"                                                                       
## [25] "25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)"                                                                              
## [26] "26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)"                                                                                 
## [27] "27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)"                                                                             
## [28] "28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)"                                                                             
## [29] "29 health - current health status (numeric: from 1 - very bad to 5 - very good)"                                                                                 
## [30] "30 absences - number of school absences (numeric: from 0 to 93)"                                                                                                 
## [31] "# these grades are related with the course subject, Math or Portuguese:"                                                                                         
## [32] "31 G1 - first period grade (numeric: from 0 to 20)"                                                                                                              
## [33] "31 G2 - second period grade (numeric: from 0 to 20)"                                                                                                             
## [34] "32 G3 - final grade (numeric: from 0 to 20, output target)"

Remove unnecessary columns

students <- students[,c(6,7,13,23,31,32,33)]

- > Calculate Pass Or Fail variable

Pass <- ifelse(students$G3>9,'PASS','FAIL')
students <- data.frame(students,Pass)

- > Calculate Grade variable

Grade <- ifelse(students$G3<=9,'FAIL','Pass')
Grade <- ifelse(students$G3>=10 & students$G3<=11,'Sufficient',Grade)
Grade <- ifelse(students$G3>=12 & students$G3<=13,'Satisfactory',Grade)
Grade <- ifelse(students$G3>=14 & students$G3<=15,'Good',Grade)
Grade <- ifelse(students$G3>=16 & students$G3<=20,'Excellent',Grade)
students <- data.frame(students,Grade)

————————————————————————

exploration of data

————————————————————————

dimensions

dim(students)
## [1] 349   9
nrow(students)
## [1] 349
ncol(students)
## [1] 9

structure

str(students)
## 'data.frame':    349 obs. of  9 variables:
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ Pass      : Factor w/ 2 levels "FAIL","PASS": 1 1 2 2 2 2 2 1 2 2 ...
##  $ Grade     : Factor w/ 5 levels "Excellent","FAIL",..: 2 2 5 3 5 3 5 2 1 3 ...

variable or column names

names(students)
## [1] "Pstatus"    "Medu"       "traveltime" "romantic"   "G1"        
## [6] "G2"         "G3"         "Pass"       "Grade"

Attributes

attributes(students)
## $names
## [1] "Pstatus"    "Medu"       "traveltime" "romantic"   "G1"        
## [6] "G2"         "G3"         "Pass"       "Grade"     
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102
## [103] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
## [120] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
## [137] 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
## [154] 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
## [171] 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
## [188] 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204
## [205] 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
## [222] 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
## [239] 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
## [256] 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272
## [273] 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289
## [290] 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323
## [324] 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## [341] 341 342 343 344 345 346 347 348 349
## 
## $class
## [1] "data.frame"

First ten rows

students[1:10,]
##    Pstatus Medu traveltime romantic G1 G2 G3 Pass      Grade
## 1        A    4          2       no  5  6  6 FAIL       FAIL
## 2        T    1          1       no  5  5  6 FAIL       FAIL
## 3        T    1          1       no  7  8 10 PASS Sufficient
## 4        T    4          1      yes 15 14 15 PASS       Good
## 5        T    3          1       no  6 10 10 PASS Sufficient
## 6        T    4          1       no 15 15 15 PASS       Good
## 7        T    2          1       no 12 12 11 PASS Sufficient
## 8        A    4          2       no  6  5  6 FAIL       FAIL
## 9        A    3          1       no 16 18 19 PASS  Excellent
## 10       T    3          1       no 14 15 15 PASS       Good

Variable distribution before necessary factorization

summary(students)
##  Pstatus      Medu         traveltime    romantic        G1       
##  A: 38   Min.   :0.000   Min.   :1.000   no :236   Min.   : 3.00  
##  T:311   1st Qu.:2.000   1st Qu.:1.000   yes:113   1st Qu.: 8.00  
##          Median :3.000   Median :1.000             Median :11.00  
##          Mean   :2.802   Mean   :1.387             Mean   :10.94  
##          3rd Qu.:4.000   3rd Qu.:2.000             3rd Qu.:13.00  
##          Max.   :4.000   Max.   :4.000             Max.   :19.00  
##        G2              G3          Pass              Grade    
##  Min.   : 0.00   Min.   : 0.00   FAIL:113   Excellent   : 37  
##  1st Qu.: 9.00   1st Qu.: 8.00   PASS:236   FAIL        :113  
##  Median :11.00   Median :11.00              Good        : 56  
##  Mean   :10.78   Mean   :10.49              Satisfactory: 54  
##  3rd Qu.:13.00   3rd Qu.:14.00              Sufficient  : 89  
##  Max.   :19.00   Max.   :20.00

Factorize continuous predictor variables

students$Medu <- factor(students$Medu)
students$traveltime <- factor(students$traveltime)

Variable distribution after necessary factorization

summary(students)
##  Pstatus Medu    traveltime romantic        G1              G2       
##  A: 38   0:  3   1:243      no :236   Min.   : 3.00   Min.   : 0.00  
##  T:311   1: 42   2: 84      yes:113   1st Qu.: 8.00   1st Qu.: 9.00  
##          2: 96   3: 15                Median :11.00   Median :11.00  
##          3: 88   4:  7                Mean   :10.94   Mean   :10.78  
##          4:120                        3rd Qu.:13.00   3rd Qu.:13.00  
##                                       Max.   :19.00   Max.   :19.00  
##        G3          Pass              Grade    
##  Min.   : 0.00   FAIL:113   Excellent   : 37  
##  1st Qu.: 8.00   PASS:236   FAIL        :113  
##  Median :11.00              Good        : 56  
##  Mean   :10.49              Satisfactory: 54  
##  3rd Qu.:14.00              Sufficient  : 89  
##  Max.   :20.00

pie chart for pass

pie(with(students, table(Pass)))

pie chart for Grade

pie(with(students, table(Grade)))

bar graph for Pass

ggplot(students, aes(x = Pass)) + geom_bar(fill="#FF9999")

bar graph for Grade

reordervect <- rep(0,nrow(students))
reordervect[with(students, Grade == "Fail")] = 1
reordervect[with(students, Grade == "Sufficient")] = 2
reordervect[with(students, Grade == "Satisfactory")] = 3
reordervect[with(students, Grade == "Good")] = 4
reordervect[with(students, Grade == "Excellent")] = 5

students$Grade = with(students, reorder(Grade, reordervect))
rm(reordervect)

ggplot(students, aes(x = Grade)) + geom_bar(fill="#FF9999") 

statistical data of G3

summary(with(students,G3))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.00   11.00   10.49   14.00   20.00
sprintf('variance is %f',var(with(students,G3)))
## [1] "variance is 21.394296"
sprintf('standard deviation is %f',sd(with(students,G3)))
## [1] "standard deviation is 4.625397"

Histogram of G3

#hist(with(students,G3), main='', xlab='Score', breaks=20)
ggplot(students, aes(x=G3)) + geom_histogram(fill="#FF9999")

————————————————————————

Predicting G3 using G1

————————————————————————

correllation between G1 and G3

r <- cor(with(students,G1), with(students,G3))
sprintf('Correlation between G1 and G3 is %f and coefficient of determination is %f',r, r^2)
## [1] "Correlation between G1 and G3 is 0.794055 and coefficient of determination is 0.630523"

scatterplot of G1, G3

#plot(with(students,G1), with(students,G3), xlab='G1', ylab='G3')
ggplot(students, aes(x=G1,y=G3)) + geom_point() + geom_smooth(method="lm", se=FALSE)

sprintf("G3 shows a positive correllation with G1")
## [1] "G3 shows a positive correllation with G1"

fit linear regression using G1 as predictor to predict G3

fit <- with(students,lm(G3 ~ G1))
fit
## 
## Call:
## lm(formula = G3 ~ G1)
## 
## Coefficients:
## (Intercept)           G1  
##      -1.616        1.107
attributes(fit)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"
summary(fit)
## 
## Call:
## lm(formula = G3 ~ G1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6631  -0.8763   0.4434   1.6566   4.9763 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.61569    0.51980  -3.108  0.00204 ** 
## G1           1.10657    0.04547  24.334  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.816 on 347 degrees of freedom
## Multiple R-squared:  0.6305, Adjusted R-squared:  0.6295 
## F-statistic: 592.2 on 1 and 347 DF,  p-value: < 2.2e-16
plot(fit)

sprintf("Residual graph is random in nature suggesting linear regression is not a bad choice for this data")
## [1] "Residual graph is random in nature suggesting linear regression is not a bad choice for this data"

————————————————————————

Predicting G3 using G2

————————————————————————

correllation between G2 and G3

r <- cor(with(students,G2), with(students,G3))
sprintf('Correlation between G2 and G3 is %f and the coefficient of determination is %f',r, r^2)
## [1] "Correlation between G2 and G3 is 0.902783 and the coefficient of determination is 0.815016"

scatterplot of G2, G3

#plot(with(students,G2), with(students,G3), xlab='G1', ylab='G3')
ggplot(students, aes(x=G2,y=G3)) + geom_point() + geom_smooth(method="lm", se=FALSE)

sprintf("G3 shows strong positive correllation with G2")
## [1] "G3 shows strong positive correllation with G2"

fit linear regression using G2 as predictor to predict G3

fit <- with(students,lm(G3 ~ G2))
fit
## 
## Call:
## lm(formula = G3 ~ G2)
## 
## Coefficients:
## (Intercept)           G2  
##      -1.332        1.096
attributes(fit)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"
summary(fit)
## 
## Call:
## lm(formula = G3 ~ G2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6323 -0.3430  0.2713  1.0784  3.5606 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.33213    0.32061  -4.155  4.1e-05 ***
## G2           1.09644    0.02804  39.100  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.992 on 347 degrees of freedom
## Multiple R-squared:  0.815,  Adjusted R-squared:  0.8145 
## F-statistic:  1529 on 1 and 347 DF,  p-value: < 2.2e-16
plot(fit)

sprintf("Residual graph is random in nature suggesting linear regression is not a bad choice for this data")
## [1] "Residual graph is random in nature suggesting linear regression is not a bad choice for this data"

————————————————————————

Predicting Pass and Fail using G1 + G2 using DecisionTree

————————————————————————

formula <- Pass ~ G1 + G2
tree <- ctree(formula, data=students)

print(tree)
## 
##   Conditional inference tree with 4 terminal nodes
## 
## Response:  Pass 
## Inputs:  G1, G2 
## Number of observations:  349 
## 
## 1) G2 <= 9; criterion = 1, statistic = 179.794
##   2) G2 <= 8; criterion = 0.999, statistic = 11.579
##     3)*  weights = 82 
##   2) G2 > 8
##     4)*  weights = 44 
## 1) G2 > 9
##   5) G1 <= 10; criterion = 0.999, statistic = 12.194
##     6)*  weights = 51 
##   5) G1 > 10
##     7)*  weights = 172
plot(tree)

plot(tree, type="simple")

sprintf('Errors-on-predictions Matrix')
## [1] "Errors-on-predictions Matrix"
table(predict(tree, newdata=students), students$Pass,dnn=c('Predicted','Actual'))
##          Actual
## Predicted FAIL PASS
##      FAIL  105   21
##      PASS    8  215
df.confmatrix <- data.frame(table(predict(tree, newdata=students), students$Pass,dnn=c('Predicted','Actual')))


data_long <- gather(df.confmatrix, Type, Status, Predicted:Actual)
data_long <- data_long %>% group_by(Status,Type) %>% summarise(Frequency=sum(Freq))
ggplot(data_long, aes(x=Status,y=Frequency,fill=Type)) + geom_bar(stat='identity', position='dodge') 

————————————————————————

Predicting Grades using G1 + G2 using DecisionTree

————————————————————————

formula <- Grade ~ G1 + G2
tree <- ctree(formula, data=students)

print(tree)
## 
##   Conditional inference tree with 8 terminal nodes
## 
## Response:  Grade 
## Inputs:  G1, G2 
## Number of observations:  349 
## 
## 1) G2 <= 9; criterion = 1, statistic = 267.241
##   2) G2 <= 8; criterion = 0.999, statistic = 11.579
##     3)*  weights = 82 
##   2) G2 > 8
##     4)*  weights = 44 
## 1) G2 > 9
##   5) G2 <= 15; criterion = 1, statistic = 185.971
##     6) G2 <= 11; criterion = 1, statistic = 151.247
##       7)*  weights = 72 
##     6) G2 > 11
##       8) G2 <= 13; criterion = 1, statistic = 77.643
##         9) G2 <= 12; criterion = 0.998, statistic = 14.374
##           10)*  weights = 36 
##         9) G2 > 12
##           11)*  weights = 32 
##       8) G2 > 13
##         12) G2 <= 14; criterion = 0.993, statistic = 11.398
##           13)*  weights = 21 
##         12) G2 > 14
##           14)*  weights = 32 
##   5) G2 > 15
##     15)*  weights = 30
plot(tree)

plot(tree, type="simple")

sprintf('Errors-on-predictions Matrix')
## [1] "Errors-on-predictions Matrix"
table(predict(tree, newdata=students), students$Grade,dnn=c('Predicted','Actual'))
##               Actual
## Predicted      FAIL Sufficient Satisfactory Good Excellent
##   FAIL          105         21            0    0         0
##   Sufficient      8         56            8    0         0
##   Satisfactory    0         12           43   13         0
##   Good            0          0            3   40        10
##   Excellent       0          0            0    3        27
df.confmatrix <- data.frame(table(predict(tree, newdata=students), students$Grade,dnn=c('Predicted','Actual')))


data_long <- gather(df.confmatrix, Type, Status, Predicted:Actual)
data_long <- data_long %>% group_by(Status,Type) %>% summarise(Frequency=sum(Freq))
ggplot(data_long, aes(x=Status,y=Frequency,fill=Type)) + geom_bar(stat='identity', position='dodge') 

————————————————————————

Predicting Pass and Fail using Naive Bayes Prediction

————————————————————————

classifier<-naiveBayes(students[,1:4], students[,8]) 


#table(predict(classifier, students[,1:4]), students[,8], dnn = c('Predicted','Actual'))


sprintf('Errors-on-predictions Matrix')
## [1] "Errors-on-predictions Matrix"
table(predict(classifier, students[,1:4]), students[,8], dnn = c('Predicted','Actual'))
##          Actual
## Predicted FAIL PASS
##      FAIL    8    9
##      PASS  105  227
df.confmatrix <- data.frame(table(predict(classifier, students[,1:4]), students[,8], dnn = c('Predicted','Actual')))

data_long <- gather(df.confmatrix, Type, Status, Predicted:Actual)
data_long <- data_long %>% group_by(Status,Type) %>% summarise(Frequency=sum(Freq))
ggplot(data_long, aes(x=Status,y=Frequency,fill=Type)) + geom_bar(stat='identity', position='dodge') 

————————————————————————

Predicting Grades using Naive Bayes Prediction

————————————————————————

classifier<-naiveBayes(students[,1:4], students[,9]) 



sprintf('Errors-on-predictions Matrix')
## [1] "Errors-on-predictions Matrix"
table(predict(classifier, students[,1:4]), students[,9], dnn = c('Predicted','Actual'))
##               Actual
## Predicted      FAIL Sufficient Satisfactory Good Excellent
##   FAIL           87         49           39   33        20
##   Sufficient     14         28            9    4         5
##   Satisfactory    0          0            0    0         0
##   Good           11         11            6   17        11
##   Excellent       1          1            0    2         1
df.confmatrix <- data.frame(table(predict(classifier, students[,1:4]), students[,9], dnn = c('Predicted','Actual')))

data_long <- gather(df.confmatrix, Type, Status, Predicted:Actual)
data_long <- data_long %>% group_by(Status,Type) %>% summarise(Frequency=sum(Freq))
ggplot(data_long, aes(x=Status,y=Frequency,fill=Type)) + geom_bar(stat='identity', position='dodge') 

————————————————————————

Conclusion

————————————————————————

Linear regression showed strong relationship between G3 and G2. G1 also showed positive relationship but not as strong as G2.

Decision tree prediction on the same dataset showed very less errors on predictions making G1 and G2 suitable for predicting Grades and Pass-ability of students

Naive Bayes method showed large errors on predictions. So, either the those four variables are not good predictors or Naive Bayes method is not a good predicting model for this dataset.

Based on all the analysis, G2 is the strongest predictor for G3, which in turn, for pass-ability and grades.