As a student, I sometimes question myself whether getting a bad grade on a quiz or exam has any effect on me passing the class. Or are there other factors that contribute to people failing in school other than quizes?

In this paper, I will using data from the Source:

Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez to answer the question above.

Data Set Information: This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Attribute Information: Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: 1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira) 2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male) 3 age - student’s age (numeric: from 15 to 22) 4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural) 5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3) 6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart) 7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) 10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) 11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’) 12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’) 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 16 schoolsup - extra educational support (binary: yes or no) 17 famsup - family educational support (binary: yes or no) 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 19 activities - extra-curricular activities (binary: yes or no) 20 nursery - attended nursery school (binary: yes or no) 21 higher - wants to take higher education (binary: yes or no) 22 internet - Internet access at home (binary: yes or no) 23 romantic - with a romantic relationship (binary: yes or no) 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high) 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 29 health - current health status (numeric: from 1 - very bad to 5 - very good) 30 absences - number of school absences (numeric: from 0 to 93)

these grades are related with the course subject, Math or Portuguese: 31 G1 - first period grade (numeric: from 0 to 20) 31 G2 - second period grade (numeric: from 0 to 20) 32 G3 - final grade (numeric: from 0 to 20, output target)

Method

# Loading the data in the console, assigning it to a variable(grade) and cleaning the data
grade<-read.csv("student-mat.csv",header = TRUE, sep = ";",quote="\\", fill=TRUE, comment.char='')
head(grade)
##   school   sex age address famsize Pstatus Medu Fedu         Mjob
## 1    "GP ""F""  18   ""U"" ""GT3""   ""A""    4    4  ""at_home""
## 2    "GP ""F""  17   ""U"" ""GT3""   ""T""    1    1  ""at_home""
## 3    "GP ""F""  15   ""U"" ""LE3""   ""T""    1    1  ""at_home""
## 4    "GP ""F""  15   ""U"" ""GT3""   ""T""    4    2   ""health""
## 5    "GP ""F""  16   ""U"" ""GT3""   ""T""    3    3    ""other""
## 6    "GP ""M""  16   ""U"" ""LE3""   ""T""    4    3 ""services""
##           Fjob         reason   guardian traveltime studytime failures
## 1  ""teacher""     ""course"" ""mother""          2         2        0
## 2    ""other""     ""course"" ""father""          1         2        0
## 3    ""other""      ""other"" ""mother""          1         2        3
## 4 ""services""       ""home"" ""mother""          1         3        0
## 5    ""other""       ""home"" ""father""          1         2        0
## 6    ""other"" ""reputation"" ""mother""          1         2        0
##   schoolsup  famsup    paid activities nursery  higher internet romantic
## 1   ""yes""  ""no""  ""no""     ""no"" ""yes"" ""yes""   ""no""   ""no""
## 2    ""no"" ""yes""  ""no""     ""no""  ""no"" ""yes""  ""yes""   ""no""
## 3   ""yes""  ""no"" ""yes""     ""no"" ""yes"" ""yes""  ""yes""   ""no""
## 4    ""no"" ""yes"" ""yes""    ""yes"" ""yes"" ""yes""  ""yes""  ""yes""
## 5    ""no"" ""yes"" ""yes""     ""no"" ""yes"" ""yes""   ""no""   ""no""
## 6    ""no"" ""yes"" ""yes""    ""yes"" ""yes"" ""yes""  ""yes""   ""no""
##   famrel freetime goout Dalc Walc health absences     G1     G2  G3
## 1      4        3     4    1    1      3        6  ""5""  ""6""  6"
## 2      5        3     3    1    1      3        4  ""5""  ""5""  6"
## 3      4        3     2    2    3      3       10  ""7""  ""8"" 10"
## 4      3        2     2    1    1      5        2 ""15"" ""14"" 15"
## 5      4        3     2    1    2      5        4  ""6"" ""10"" 10"
## 6      5        4     2    1    2      5       10 ""15"" ""15"" 15"
# checking to see if there is any NA in the data
# Checking to see if there are correlations among the variables
#visually seeing the correlation amoong the variables

any(is.na(grade))
## [1] FALSE
num.col<-sapply(grade, is.numeric)
cor.data<-cor(grade[,num.col])
cor.data
##                     age         Medu         Fedu   traveltime
## age         1.000000000 -0.163658419 -0.163438069  0.070640721
## Medu       -0.163658419  1.000000000  0.623455112 -0.171639305
## Fedu       -0.163438069  0.623455112  1.000000000 -0.158194054
## traveltime  0.070640721 -0.171639305 -0.158194054  1.000000000
## studytime  -0.004140037  0.064944137 -0.009174639 -0.100909119
## failures    0.243665377 -0.236679963 -0.250408444  0.092238746
## famrel      0.053940096 -0.003914458 -0.001369727 -0.016807986
## freetime    0.016434389  0.030890867 -0.012845528 -0.017024944
## goout       0.126963880  0.064094438  0.043104668  0.028539674
## Dalc        0.131124605  0.019834099  0.002386429  0.138325309
## Walc        0.117276052 -0.047123460 -0.012631018  0.134115752
## health     -0.062187369 -0.046877829  0.014741537  0.007500606
## absences    0.175230079  0.100284818  0.024472887 -0.012943775
##               studytime    failures       famrel    freetime        goout
## age        -0.004140037  0.24366538  0.053940096  0.01643439  0.126963880
## Medu        0.064944137 -0.23667996 -0.003914458  0.03089087  0.064094438
## Fedu       -0.009174639 -0.25040844 -0.001369727 -0.01284553  0.043104668
## traveltime -0.100909119  0.09223875 -0.016807986 -0.01702494  0.028539674
## studytime   1.000000000 -0.17356303  0.039730704 -0.14319841 -0.063903675
## failures   -0.173563031  1.00000000 -0.044336626  0.09198747  0.124560922
## famrel      0.039730704 -0.04433663  1.000000000  0.15070144  0.064568411
## freetime   -0.143198407  0.09198747  0.150701444  1.00000000  0.285018715
## goout      -0.063903675  0.12456092  0.064568411  0.28501871  1.000000000
## Dalc       -0.196019263  0.13604693 -0.077594357  0.20900085  0.266993848
## Walc       -0.253784731  0.14196203 -0.113397308  0.14782181  0.420385745
## health     -0.075615863  0.06582728  0.094055728  0.07573336 -0.009577254
## absences   -0.062700175  0.06372583 -0.044354095 -0.05807792  0.044302220
##                    Dalc        Walc       health    absences
## age         0.131124605  0.11727605 -0.062187369  0.17523008
## Medu        0.019834099 -0.04712346 -0.046877829  0.10028482
## Fedu        0.002386429 -0.01263102  0.014741537  0.02447289
## traveltime  0.138325309  0.13411575  0.007500606 -0.01294378
## studytime  -0.196019263 -0.25378473 -0.075615863 -0.06270018
## failures    0.136046931  0.14196203  0.065827282  0.06372583
## famrel     -0.077594357 -0.11339731  0.094055728 -0.04435409
## freetime    0.209000848  0.14782181  0.075733357 -0.05807792
## goout       0.266993848  0.42038575 -0.009577254  0.04430222
## Dalc        1.000000000  0.64754423  0.077179582  0.11190803
## Walc        0.647544230  1.00000000  0.092476317  0.13629110
## health      0.077179582  0.09247632  1.000000000 -0.02993671
## absences    0.111908026  0.13629110 -0.029936711  1.00000000
corrgram(grade)

From above,it looks like there is no NA in the data and also negative correlation between age and female education, study time and failures and among others.

The next step is to train some of the data and make predictions of them.

# Going to use the CATools pakage to train and test the data
# Going to randomly select data to train to make a model.

set.seed(101)

#Split the sample data into train and test

sample<-sample.split(grade$failures,SplitRatio = 0.7)
train<-subset(grade, sample==TRUE)
test<-subset(grade, sample==FALSE)

From the above code, I have assigned 0.7 of data to train and 0.3 to test

#Going ahead to constuct a linear regression model with all the predictors in the data on G3 to train the model

model<-lm(failures~.,data = train)
print(summary(model))
## 
## Call:
## lm(formula = failures ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8479 -0.2832 -0.0285  0.1730  2.0704 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.0303116  0.8526206   2.381 0.018237 *  
## school"MS            -0.0244233  0.1466465  -0.167 0.867904    
## sex""M""             -0.0541844  0.0949276  -0.571 0.568808    
## age                  -0.0548572  0.0434604  -1.262 0.208404    
## address""U""         -0.2112857  0.1054914  -2.003 0.046605 *  
## famsize""LE3""        0.0720256  0.0939860   0.766 0.444418    
## Pstatus""T""         -0.0053380  0.1340535  -0.040 0.968278    
## Medu                 -0.0162330  0.0609554  -0.266 0.790289    
## Fedu                 -0.1056766  0.0531168  -1.990 0.048072 *  
## Mjob""health""        0.2863075  0.2135445   1.341 0.181598    
## Mjob""other""         0.1404051  0.1322927   1.061 0.289884    
## Mjob""services""      0.3431986  0.1556583   2.205 0.028661 *  
## Mjob""teacher""       0.1108670  0.1919391   0.578 0.564204    
## Fjob""health""        0.1728342  0.2534577   0.682 0.496125    
## Fjob""other""        -0.1388350  0.1798093  -0.772 0.440996    
## Fjob""services""     -0.0264459  0.1813379  -0.146 0.884203    
## Fjob""teacher""       0.1019187  0.2255834   0.452 0.651925    
## reason""home""       -0.0296447  0.1077146  -0.275 0.783448    
## reason""other""      -0.4116089  0.1628733  -2.527 0.012309 *  
## reason""reputation"" -0.0763820  0.1139298  -0.670 0.503394    
## guardian""mother""   -0.0115298  0.1007873  -0.114 0.909043    
## guardian""other""     0.9376822  0.1836033   5.107 7.89e-07 ***
## traveltime            0.0187496  0.0612864   0.306 0.759988    
## studytime            -0.1136639  0.0574435  -1.979 0.049286 *  
## schoolsup""yes""      0.0001293  0.1341382   0.001 0.999232    
## famsup""yes""         0.0402378  0.0900640   0.447 0.655548    
## paid""yes""          -0.1745930  0.0922289  -1.893 0.059865 .  
## activities""yes""     0.0393789  0.0859178   0.458 0.647235    
## nursery""yes""        0.0467528  0.0977367   0.478 0.632945    
## higher""yes""        -0.3218682  0.2015848  -1.597 0.111988    
## internet""yes""       0.0866330  0.1160460   0.747 0.456259    
## romantic""yes""       0.1039588  0.0920372   1.130 0.260091    
## famrel               -0.0594080  0.0457809  -1.298 0.195969    
## freetime             -0.0325151  0.0433191  -0.751 0.453820    
## goout                 0.0749607  0.0429148   1.747 0.082291 .  
## Dalc                 -0.0190772  0.0592075  -0.322 0.747647    
## Walc                  0.0248503  0.0487925   0.509 0.611126    
## health                0.0334150  0.0319141   1.047 0.296408    
## absences             -0.0030493  0.0054971  -0.555 0.579738    
## G1""11""              0.1986151  0.1800038   1.103 0.271244    
## G1""12""              0.2205169  0.1958535   1.126 0.261609    
## G1""13""              0.1674221  0.2126322   0.787 0.432036    
## G1""14""              0.2034868  0.2261711   0.900 0.369411    
## G1""15""              0.4229882  0.2664555   1.587 0.114062    
## G1""16""              0.1858228  0.2951501   0.630 0.529717    
## G1""17""              0.2506314  0.3963755   0.632 0.527942    
## G1""18""              0.7636873  0.4974032   1.535 0.126353    
## G1""19""              0.3755865  0.6193214   0.606 0.544937    
## G1""4""               0.6253973  0.6889023   0.908 0.365119    
## G1""5""               1.2449871  0.3307396   3.764 0.000222 ***
## G1""6""               0.6688907  0.2760386   2.423 0.016318 *  
## G1""7""              -0.0201662  0.2219260  -0.091 0.927692    
## G1""8""               0.1787301  0.1900000   0.941 0.348055    
## G1""9""               0.3564424  0.1770123   2.014 0.045449 *  
## G2""10""              0.0326292  0.3499702   0.093 0.925815    
## G2""11""              0.1438811  0.3881185   0.371 0.711261    
## G2""12""             -0.0325447  0.3935729  -0.083 0.934184    
## G2""13""              0.0484363  0.4468689   0.108 0.913800    
## G2""14""             -0.1514523  0.4922920  -0.308 0.758686    
## G2""15""             -0.3377089  0.5477131  -0.617 0.538246    
## G2""16""             -0.7229262  0.6143692  -1.177 0.240781    
## G2""17""             -0.5807257  0.7983680  -0.727 0.467877    
## G2""18""             -1.0684933  0.8912852  -1.199 0.232082    
## G2""19""             -1.8260797  1.1402770  -1.601 0.110933    
## G2""5""               0.4179956  0.3757578   1.112 0.267361    
## G2""6""               0.3296118  0.3612912   0.912 0.362752    
## G2""7""               0.9199268  0.3160201   2.911 0.004031 ** 
## G2""8""               0.2066787  0.3174420   0.651 0.515780    
## G2""9""               0.2234922  0.3293463   0.679 0.498217    
## G310"                -0.3323976  0.2778899  -1.196 0.233121    
## G311"                -0.4686460  0.3036713  -1.543 0.124421    
## G312"                -0.4405074  0.3380741  -1.303 0.194147    
## G313"                -0.3202569  0.3837553  -0.835 0.405023    
## G314"                -0.2881165  0.4199378  -0.686 0.493486    
## G315"                -0.0933081  0.4857176  -0.192 0.847865    
## G316"                -0.1563175  0.5392728  -0.290 0.772232    
## G317"                 0.2192431  0.7571216   0.290 0.772456    
## G318"                 0.0912391  0.7450178   0.122 0.902659    
## G319"                 0.7303166  0.8878686   0.823 0.411789    
## G34"                  0.9622841  0.7026146   1.370 0.172427    
## G35"                 -0.1924942  0.3576683  -0.538 0.591071    
## G36"                 -1.3170034  0.3071352  -4.288 2.86e-05 ***
## G37"                 -0.1195442  0.3052900  -0.392 0.695808    
## G38"                 -0.5785927  0.2438521  -2.373 0.018651 *  
## G39"                 -0.4929259  0.2631764  -1.873 0.062598 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5708 on 191 degrees of freedom
## Multiple R-squared:  0.5888, Adjusted R-squared:  0.408 
## F-statistic: 3.256 on 84 and 191 DF,  p-value: 9.148e-12

From the above, guardian aside from parents, low scores for test 1-3( G1, G2, G3) are highly significance in explaining a student failure. I am also interested in seeing the residuals visually. Also, the model is a good fit, having an extremely low p-value and a good adjusted R^2 value.

res<-residuals(model)
res<-as.data.frame(res)
head(res)
##            res
## 1 -0.279515882
## 2 -0.486892849
## 5 -0.293488719
## 6 -0.078265050
## 7 -0.030824907
## 8  0.004141234
ggplot(res, aes(res))+geom_histogram(fill='blue', alpha=0.5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Looks like there are negative values, which shouldn't be there since there was no negative scores or values in the data. Let me go ahead and create a function to remove that.
zero<- function(x){
  if (x<0){
    return(0) }
  
  else{return(x)}
 
  results$prediction<-sapply(results$prediction, zero) 
  
}

In conclusion, failure is correlated with test scores and guardians who are not our parents.

PS: You can use the train data to make further predictions.