Synopsis

Problem Statement
This project aims to find out the influence of various variables and parameters on the grade of a secondary school student.
The dataset is extensive with atleast 33 different variables which are then filtered for the most obvious predictors and analysis is performed to understand the influence of these predictors on the final grades of the students.

Methodology Employed
1. Simple Exploratory Analysis
2. Interesting Visualizations
3. Decision tree algorithm

Interesting Insights
Some interesting insights were discovered on the following parameters:
1. Gender Based
2. Alcohol consumption
3. Goals of the student
4. Health and attendance in class
5. Access to internet

Packages Used

library(prettydoc) #template
library(readr) #for reading in the data
library(dplyr) #exploratory data analysis
library(ggplot2)#visualizations
library(tidyverse)#tidying up the data and exploratory data analysis
library(gsheet)#Reading from google sheets
library(readxl) #Reading in Excel Files
library(knitr) #Kable function for printing tables
library(DT) #Printing dataframes concisely
library(caret) #Dummy Variables for Correlation Matrix
library(gridExtra) #Multiple plots 
library(rpart) #Decision Tree

Data Import

We import the data saved as a zip file from the UCI archives using download.file command which is then unzipped using unzip into the csv file for math scores.
The file is read in to R using read.table and merged to produce the data which will be used for the project.

temp <- tempfile()
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/00356/student.zip",temp, mode="wb")
unzip(temp, "student-mat.csv")
data <- read.table("student-mat.csv",sep= ";", header= T)
unlink(temp)
(data <- as_tibble(data))

# A tibble: 395 x 33
   school    sex   age address famsize Pstatus  Medu  Fedu     Mjob
   <fctr> <fctr> <int>  <fctr>  <fctr>  <fctr> <int> <int>   <fctr>
1      GP      F    18       U     GT3       A     4     4  at_home
2      GP      F    17       U     GT3       T     1     1  at_home
3      GP      F    15       U     LE3       T     1     1  at_home
4      GP      F    15       U     GT3       T     4     2   health
5      GP      F    16       U     GT3       T     3     3    other
6      GP      M    16       U     LE3       T     4     3 services
7      GP      M    16       U     LE3       T     2     2    other
8      GP      F    17       U     GT3       A     4     4    other
9      GP      M    15       U     LE3       A     3     2 services
10     GP      M    15       U     GT3       T     3     4    other
# ... with 385 more rows, and 24 more variables: Fjob <fctr>,
#   reason <fctr>, guardian <fctr>, traveltime <int>, studytime <int>,
#   failures <int>, schoolsup <fctr>, famsup <fctr>, paid <fctr>,
#   activities <fctr>, nursery <fctr>, higher <fctr>, internet <fctr>,
#   romantic <fctr>, famrel <int>, freetime <int>, goout <int>,
#   Dalc <int>, Walc <int>, health <int>, absences <int>, G1 <int>,
#   G2 <int>, G3 <int>

Source Code

Description of the data

The dataset is composed of data from two Portuguese Schools namely, Gabriel Pereira HS and Mousinho da Silveira HS and measures the performance in Math.
The dataset is composed of detailed observation for each student across multiple variables like family status and size, performance, alcohol consumption, student’s work situation, parents job situation etc.
The dataset was obtained as part of the study for using Data Mining to Predict Secondary School Student Performance by P.Cortez and A.Silva from two public schools in the school year 2005-2006 from the Alentejo Region of Portugal. Citation:
Using Data Mining To Predict Secondary School Student Alcohol Consumption.
Fabio Pagnotta, Hossain Mohammad Amran
Department of Computer Science,University of Camerino
Two different sources were used:
1. Mark Reports
2. Questionnaires
The first source had scarce information of grades and attendance details and is complemented with the second source which provides the other variables to the data.

Sampling Information

The questionaires were answered by 788 students in total and 111 answers were discarded due to lack of identification to merge with the mark reports.
The data was then integrated into two different datasets : Mathematics(395 observations) and Portuguese(649 observations).
For the purpose of this project, we will be using the math dataset only.

Structure and Details of the data

The codebook and data description is provided in the UCI Machine Learning Database
The variables and their classifications are as given below:

variablelist <- gsheet2tbl('https://docs.google.com/spreadsheets/d/1mDsF0aMNgODx7063l2mxV1_zP32fAe_P55SBmtG72G8')
kable(variablelist)

Attribute.Name	Description	Format.Or.Data.Type	Sample.Values
School	Student’s School	Factor	‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira
Sex	Student’s Sex	Factor	‘F’ - female or ‘M’ - male
Age	Student’s Age	Integer	numeric: from 15 to 22
Address	Student’s Home Address Type	String	‘U’ - urban or ‘R’ - rural
FamSize	Family Size	Factor	‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3
Pstatus	Parent’s Cohabitation Status	String	‘T’ - living together or ‘A’ - apart
Medu	Mother’s Education	Factor	0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education
Fedu	Father’s Education	Factor	0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education
Mjob	Mother’s Job	Factor	‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’
Fjob	Father’s Job	Factor	‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’
Reason	Reason To Choose This School	Factor	close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’
Guardian	Student’s Guardian	String	‘mother’, ‘father’ or ‘other’
TravelTime	Home To School Travel Time	Integer	1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour
StudyTime	Weekly Study Time	Integer	1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours
Failures	Number Of Past Class Failures	Integer	n if 1<=n<3, else 4
Schoolsup	Extra Educational Support	Boolean	yes or no
Famsup	Family Educational Support	Boolean	yes or no
Paid	Extra Paid Classes Within The Course Subject (Math Or Portuguese)	Boolean	yes or no
Activities	Extra-Curricular Activities	Boolean	yes or no
Nursery	Attended Nursery School	Boolean	yes or no
Higher	Wants To Take Higher Education	Boolean	yes or no
Internet	Internet Access At Home	Boolean	yes or no
Romantic	With A Romantic Relationship	Boolean	yes or no
Famrel	Quality Of Family Relationships	Factor	from 1 - very bad to 5 - excellent
Freetime	Free Time After School	Factor	from 1 - very low to 5 - very high
Goout	Going Out With Friends	Factor	from 1 - very low to 5 - very high
Dalc	Workday Alcohol Consumption	Factor	from 1 - very low to 5 - very high
Walc	Weekend Alcohol Consumption	Factor	from 1 - very low to 5 - very high
Health	Current Health Status	Factor	from 1 - very bad to 5 - very good
Absences	Number Of School Absences	Integer	from 0 to 93
G1	First Period Grade	Integer	from 0 to 20
G2	Second Period Grade	Integer	from 0 to 20
G3	Final Grade	Integer	from 0 to 20, output target

Missing data: There is no missing data in the dataset.

sum(is.na(data))

[1] 0

Data Cleaning

The data is already cleaned and in a tidy format and does not require much cleaning for the data analysis. The majority of the cleaning is done for selecting the necessary variables.

data <- data %>%
  as_tibble()%>%
  select(sex, age, address,Pstatus, Medu, Fedu, Mjob, Fjob,studytime,traveltime,failures,higher,internet, 
         goout, Dalc,Walc,health, absences,G1,G2,G3)
DT::datatable(data)

Effectively, the number of variables that will be used for the analysis has reduced to 21 from 33.

Data Description

Summary Statistics:

We have chosen 18 different variables as predictors or independant variables which are presented as:
7 categorical variables and 11 numerical variables,

We have also chosen 3 variables as predicted or dependant variables which are presented as:
3 numerical variables- numerical grades of students for exploratory data analysis.

glimpse(data)

Observations: 395
Variables: 21
$ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
$ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
$ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
$ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
$ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
$ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
$ Mjob       <fctr> at_home, at_home, at_home, health, other, services...
$ Fjob       <fctr> teacher, other, other, services, other, other, oth...
$ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
$ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
$ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
$ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
$ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
$ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
$ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
$ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6...
$ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, ...
$ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, ...
$ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11,...

Summary:

summary(data)

 sex          age       address Pstatus      Medu            Fedu      
 F:208   Min.   :15.0   R: 88   A: 41   Min.   :0.000   Min.   :0.000  
 M:187   1st Qu.:16.0   U:307   T:354   1st Qu.:2.000   1st Qu.:2.000  
         Median :17.0                   Median :3.000   Median :2.000  
         Mean   :16.7                   Mean   :2.749   Mean   :2.522  
         3rd Qu.:18.0                   3rd Qu.:4.000   3rd Qu.:3.000  
         Max.   :22.0                   Max.   :4.000   Max.   :4.000  
       Mjob           Fjob       studytime       traveltime   
 at_home : 59   at_home : 20   Min.   :1.000   Min.   :1.000  
 health  : 34   health  : 18   1st Qu.:1.000   1st Qu.:1.000  
 other   :141   other   :217   Median :2.000   Median :1.000  
 services:103   services:111   Mean   :2.035   Mean   :1.448  
 teacher : 58   teacher : 29   3rd Qu.:2.000   3rd Qu.:2.000  
                               Max.   :4.000   Max.   :4.000  
    failures      higher    internet      goout            Dalc      
 Min.   :0.0000   no : 20   no : 66   Min.   :1.000   Min.   :1.000  
 1st Qu.:0.0000   yes:375   yes:329   1st Qu.:2.000   1st Qu.:1.000  
 Median :0.0000                       Median :3.000   Median :1.000  
 Mean   :0.3342                       Mean   :3.109   Mean   :1.481  
 3rd Qu.:0.0000                       3rd Qu.:4.000   3rd Qu.:2.000  
 Max.   :3.0000                       Max.   :5.000   Max.   :5.000  
      Walc           health         absences            G1       
 Min.   :1.000   Min.   :1.000   Min.   : 0.000   Min.   : 3.00  
 1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00  
 Median :2.000   Median :4.000   Median : 4.000   Median :11.00  
 Mean   :2.291   Mean   :3.554   Mean   : 5.709   Mean   :10.91  
 3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00  
 Max.   :5.000   Max.   :5.000   Max.   :75.000   Max.   :19.00  
       G2              G3       
 Min.   : 0.00   Min.   : 0.00  
 1st Qu.: 9.00   1st Qu.: 8.00  
 Median :11.00   Median :11.00  
 Mean   :10.71   Mean   :10.42  
 3rd Qu.:13.00   3rd Qu.:14.00  
 Max.   :19.00   Max.   :20.00

Simple Exploratory Data Analysis

Number of Male vs Female Students who failed in the final exam

(data2<-data%>%
 mutate(pass=ifelse(G3>=10,1,0), fail= ifelse(G3<10,1,0))%>%
  filter(sex=="F"|sex=="M")%>%
  group_by(sex)%>%
  summarise(Pass=sum(pass), 
            Fail=sum(fail)))

# A tibble: 2 x 3
     sex  Pass  Fail
  <fctr> <dbl> <dbl>
1      F   133    75
2      M   132    55

data2%>%
  ggplot(aes(x=sex,y=Fail))+
  geom_bar(stat="identity")

We find that the number of female students who have failed is higher than the number of male students who have failed.

Average Final Score across different weekend and weekday alcohol consumption levels

data%>%
   group_by(Walc)%>%
  aggregate(G3~Walc, data=., mean)%>%
  arrange(desc(G3))

  Walc        G3
1    1 10.735099
2    3 10.725000
3    5 10.142857
4    2 10.082353
5    4  9.686275

data%>%
   group_by(Dalc)%>%
  aggregate(G3~Dalc, data=., mean)%>%
  arrange(desc(G3))

  Dalc        G3
1    1 10.731884
2    5 10.666667
3    3 10.500000
4    4  9.888889
5    2  9.253333

We find that the average grade decreases as the alcohol consumption in weekend increases whereas a very clear trend cannot be established for Weekday.

Relationship between going out and average grade of the students

data$goout <- as.factor(data$goout)
data%>%
   group_by(goout)%>%
   summarise(AverageScore= mean(G3,na.rm=TRUE))%>%
  arrange(desc(AverageScore))

# A tibble: 5 x 2
   goout AverageScore
  <fctr>        <dbl>
1      2    11.194175
2      3    10.961538
3      1     9.869565
4      4     9.651163
5      5     9.037736

We see that Going Out does have an impact on the average grades in the final exam as the number of days of going out increases beyond 3, the average grades of students are also decreasing.

Data Analysis Through Visualizations

This project will concentrate on several aspects which will affect a students performance and correlation between these factors.

Correlation between alcohol consumption and performance of the students with differences in gender

dmy <- dummyVars("~.", data=data)
newdata <- data.frame(predict(dmy, newdata=data))
correl1 <-cor(newdata[,c("G3","sex.F","sex.M","Walc","Dalc")])
source("https://raw.githubusercontent.com/briatte/ggcorr/master/ggcorr.R")
correl1 %>%
  ggcorr(label = TRUE)+ ggtitle("Correlation between Alcohol Consumption,Gender and Performance")

We find that Gender has a lower correlation with respect to performance as compared to Alcohol Consumption. We find that both Weekend and Workday Alchol Consumption is highly negatively correlated with the grades.

data$Dalc <- as.factor(data$Dalc)
data$Walc <- as.factor(data$Walc)
g1a<-data %>%
  ggplot(aes(x=Dalc, y=G3, fill= Dalc))+
  geom_boxplot()+
  coord_flip()+
      xlab("Work Day Alcohol consumption")+
      ylab("Grade")+
  facet_grid(~sex)
g1b<-data %>%
  ggplot(aes(x=Walc, y=G3, fill= Walc))+
  geom_boxplot()+
  coord_flip()+
      xlab("Week End Alcohol consumption")+
      ylab("Grade")+
  facet_grid(~sex)
grid.arrange(g1a,g1b,ncol=2)

Relationship between the consumption of alcohol by students and the set up in which they live :urban vs rural and the relationship to performance in school.

 g2a<-data %>% 
  group_by(address)%>%
  ggplot(aes(x=factor(Dalc), y= G3))+
           geom_jitter(alpha=0.6)+
    scale_x_discrete("WorkDay Alcohol")+
    scale_y_continuous("Grade")+
    facet_grid(~address)
 g2b<-data %>% 
   group_by(address)%>%
   ggplot(aes(x=factor(Walc), y= G3))+
           geom_jitter(alpha=0.6)+
    scale_x_discrete("Weekday Alcohol")+
    scale_y_continuous("Grade")+
    facet_grid(~address)
grid.arrange(g2a,g2b,ncol=2)

We see that students from Urban Areas have more alcohol consumption than students from Rural Areas. We also find that the final grades start decreasing as the level of alcohol consumption increases.

The current performance Vs future goals of the student based on gender of the student

data%>%
  ggplot(aes(x=higher, y=G3))+
  geom_boxplot()+
  facet_grid(~sex)

We find that on an average, students who are planning on a higher education perform better than those who do not. We also find that Male students outperform female students when they plan to go for higher education and viceversa if not.

The relationship between health and attendance in class and the gender of the student.

data%>%
  group_by(sex)%>%
  ggplot(aes(x=factor(health), y=absences, color=sex))+
  geom_smooth(aes(group=sex), method="lm", se=FALSE)

We find that female students have lower attendance on average and as the health scale increases, the absence decreases as expected for both male and female students.

Relationship between access to internet and the performance of the students.

data%>%
  group_by(internet)%>%
  ggplot(aes(x=G3, fill=internet))+
  geom_density( alpha=0.5)

We find that the presence of internet is actually detrimental to the average grades of the student.

Decision Tree Analysis

We use R to create Decision Trees to predict the final performance grade using all the variables in the dataset.

Methodology Used

Classification methodology was used for this particular data and the the response variable grade is modeled as a binary variable.

Advantages of using Decision Trees for this Analysis

Simplicity of Interpretation: The entire data is represented in a visual format which enables us to understand the data quickly and make a prediction.
Ability to use both numeric and categorical variables as predictors: Our data is a mix of categorical and numeric variables and other commonly used methods such as linear or logistic regressions can use either of these categories but not both in a single algorithm.

Algorithm behind

The package used to construct the tree here is Rpart which uses recursive partitioning of the data with the default criterion being used for each split being Gini Coefficient.

Variables used

We create the tree using all the variables and rank the variables in terms of importance to figure out the variables used by the decision tree algorithm to predict the income.

library(caret)
newdata<- data%>%select(sex, age, address,Pstatus, Medu, Fedu, Mjob, Fjob,studytime,traveltime,failures,higher,internet, 
         goout, Dalc,Walc,health, absences,G1, G2, G3)
tree <- rpart(G3 ~ .,
              data = newdata,
              method = "class")
imp <- varImp(tree)
rownames(imp)[order(imp$Overall, decreasing=TRUE)]

 [1] "G2"         "G1"         "absences"   "Walc"       "failures"  
 [6] "Fjob"       "Mjob"       "Dalc"       "Medu"       "health"    
[11] "studytime"  "goout"      "traveltime" "sex"        "age"       
[16] "address"    "Pstatus"    "Fedu"       "higher"     "internet"

We find that Grades in 1st and 2nd exam are key predictors followed by attendance levls,alcohol consumption and jobs of parents.
The tree logic is as below where only “attendance, Father’s job, Grades in 1st and 2nd Exam” are used as variables by the tree based on correlation and collinearity between some of the other variables.

printcp(tree)


Classification tree:
rpart(formula = G3 ~ ., data = newdata, method = "class")

Variables actually used in tree construction:
[1] absences Fjob     G1       G2      

Root node error: 339/395 = 0.85823

n= 395 

        CP nsplit rel error  xerror     xstd
1 0.094395      0   1.00000 1.00000 0.020450
2 0.067847      1   0.90560 0.90265 0.024494
3 0.060472      2   0.83776 0.83481 0.026424
4 0.032448      4   0.71681 0.76696 0.027807
5 0.026549      7   0.61947 0.74926 0.028088
6 0.022124      9   0.56637 0.69617 0.028751
7 0.011799     11   0.52212 0.59587 0.029306
8 0.010000     13   0.49853 0.60767 0.029286

plotcp(tree)

Summary

Problem Statement
We find that some variables are highly correlated with others and hence do not make a significant impact in predicting the final grades. Different tabular and visual interpretations were made to analyze the relatioship between the different variables towards predicting or influencing the grade of a student.

Methodology Used
1. Simple Exploratory Analysis is first done to slice and dice the data for different predictors and results are presented in tables.
2. Interesting Visualizations are employed to find out the inter relationship between different varaiables and hence their effect on the grades and other factors.
3. Decision tree algorithm is used to predict the grades based on certain important variables as chosen by the algorithm due to the correlation and collinearity exhibited by the variables.

Insights Provided
1.Gender Based: We find that more female students tend to fail in the final exam as compared to male students but the gender does not seem to have as much as a correlation to predict the grades as alcohol consumption. We also find the attendance level of female students is lesser than male students and male students with a plan of higher education outperform female students.
2.Alcohol consumption: We find a clear detrimental impact on grades due to the consumption of alcohol during weekend but a clear trend cannot be established for weekdays. The alcohol consumption also increases in urban settings when compared to rural settings.
3.Goals of the student: We find that on an average, students who are planning on a higher education perform better than those who do not.
4.Health and attendance in class: As expected, as health condition detoriates, we find that the number of absences increases.
5. Access to internet: Access to internet has a detroriating effect on the performance of the students as their average grades are lesser than those without access to internet.

We find that the variables which actually impact the prediction of final grades are Absences, Fathers’ job and Grades in Exam 1 and Exam 2 as found in decision trees algorithm.

Predicting the performance of Students

Dhivya Rajprasad