Required R libraries

About the dataset

Dataset Name : “Student Performance Data Set”.

Source : UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Student+Performance)

Information : This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

The above descrition is taken from the above mentioned source

Attribute Information:

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: 1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)

Obtaining the dataset

Both the dataset “student-mat.csv” and “student-por.csv” are downloaded from the source data folder and uploaded to github.
Below are the Github urls:
1. https://raw.githubusercontent.com/arunk13/MSDA-Assignments/master/IS607Fall2015/Assignment3/student-mat.csv
2. https://raw.githubusercontent.com/arunk13/MSDA-Assignments/master/IS607Fall2015/Assignment3/student-por.csv

Downloading the data from github into R environment

math_student_data_url <- "https://raw.githubusercontent.com/arunk13/MSDA-Assignments/master/IS607Fall2015/Assignment3/student-mat.csv";
math_student_data <- read.table(file = math_student_data_url, header = TRUE, sep = ";");
por_student_data_url <- "https://raw.githubusercontent.com/arunk13/MSDA-Assignments/master/IS607Fall2015/Assignment3/student-por.csv";
por_student_data <- read.table(file = por_student_data_url, header = TRUE, sep = ";");

Knowing our data

Size of the dataset

Number of observations : 395;
Number of variables : 33;

A snapshot of the data
head(math_student_data, n = 3);
##   school sex age address famsize Pstatus Medu Fedu    Mjob    Fjob reason
## 1     GP   F  18       U     GT3       A    4    4 at_home teacher course
## 2     GP   F  17       U     GT3       T    1    1 at_home   other course
## 3     GP   F  15       U     LE3       T    1    1 at_home   other  other
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        3       yes     no  yes         no
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
##   absences G1 G2 G3
## 1        6  5  6  6
## 2        4  5  5  6
## 3       10  7  8 10
str(math_student_data);
## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

Data transformation

Subsetting the dataset

For my ease, I will pick few columns which I believe are the most important variables in the dataset.

Variables - #1, #2, #3, #6, #7, #8, #13, #14, #15, #21, #22, #30, #31, #32, #33

working_student_dataset <- math_student_data[c(1, 2, 3, 6, 7, 8, 13, 14, 15, 21, 22, 30, 31, 32, 33)];

We will be working on the new dataset “working_student_dataset”.

head(working_student_dataset, n = 3);
##   school sex age Pstatus Medu Fedu traveltime studytime failures higher
## 1     GP   F  18       A    4    4          2         2        0    yes
## 2     GP   F  17       T    1    1          1         2        0    yes
## 3     GP   F  15       T    1    1          1         2        3    yes
##   internet absences G1 G2 G3
## 1       no        6  5  6  6
## 2      yes        4  5  5  6
## 3      yes       10  7  8 10
Renaming few variables

I will like to rename the below variables to make it more understandable. Below are the variables and their actual description.
Pstatus -> parent’s cohabitation status. Medu -> mother’s education Fedu -> father’s education G1 -> first period grade G2 -> second period grade G3 -> final grade

#Using the setnames function in data.table package for easy group renaming using header names instead of column index.
setnames(working_student_dataset, old = c("Pstatus", "Medu", "Fedu", "G1", "G2", "G3"), new = c("parent_cohabitation_status", "mother_edu", "father_edu", "first_period_grade", "second_period_grade", "final_grade"));
Finding missing values
summary(working_student_dataset);
##  school   sex          age       parent_cohabitation_status
##  GP:349   F:208   Min.   :15.0   A: 41                     
##  MS: 46   M:187   1st Qu.:16.0   T:354                     
##                   Median :17.0                             
##                   Mean   :16.7                             
##                   3rd Qu.:18.0                             
##                   Max.   :22.0                             
##    mother_edu      father_edu      traveltime      studytime    
##  Min.   :0.000   Min.   :0.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :2.000   Median :1.000   Median :2.000  
##  Mean   :2.749   Mean   :2.522   Mean   :1.448   Mean   :2.035  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##     failures      higher    internet     absences      first_period_grade
##  Min.   :0.0000   no : 20   no : 66   Min.   : 0.000   Min.   : 3.00     
##  1st Qu.:0.0000   yes:375   yes:329   1st Qu.: 0.000   1st Qu.: 8.00     
##  Median :0.0000                       Median : 4.000   Median :11.00     
##  Mean   :0.3342                       Mean   : 5.709   Mean   :10.91     
##  3rd Qu.:0.0000                       3rd Qu.: 8.000   3rd Qu.:13.00     
##  Max.   :3.0000                       Max.   :75.000   Max.   :19.00     
##  second_period_grade  final_grade   
##  Min.   : 0.00       Min.   : 0.00  
##  1st Qu.: 9.00       1st Qu.: 8.00  
##  Median :11.00       Median :11.00  
##  Mean   :10.71       Mean   :10.42  
##  3rd Qu.:13.00       3rd Qu.:14.00  
##  Max.   :19.00       Max.   :20.00

Observation: I dont see any missing values in any of the columns.

Renaming values

The school varibales has two values - GP and MS.
GP stands for “Gabriel Pereira”.
MS stands for “Mousinho da Silveira”

Lets use the stringr package for these transformations.

Lets first see what is the class type of school variable.

class(working_student_dataset$school);
## [1] "factor"
working_student_dataset$school <- str_replace_all(str_c(working_student_dataset$school), c("^GP$" = "Gabriel Pereira", "^MS$" = "Mousinho da Silveira"));
head(working_student_dataset, n = 1);
##            school sex age parent_cohabitation_status mother_edu father_edu
## 1 Gabriel Pereira   F  18                          A          4          4
##   traveltime studytime failures higher internet absences
## 1          2         2        0    yes       no        6
##   first_period_grade second_period_grade final_grade
## 1                  5                   6           6
Subsetting the data

Lets subset the data for parent’s cohabitaion status - “Apart”. I want to see if parent’s cohabitation status has any impact on student’s grade.

subset_student_data <- subset(working_student_dataset, working_student_dataset$parent_cohabitation_status == 'A');

#comparing the mean of final grade of overall data with mean of final grade of subsetted data.  
mean(subset_student_data$final_grade) < mean(working_student_dataset$final_grade);
## [1] FALSE

Few analysis

Compare mean final grade of two schools
aggregate(working_student_dataset[,15], working_student_dataset["school"], mean);
##                 school         x
## 1      Gabriel Pereira 10.489971
## 2 Mousinho da Silveira  9.847826
Compare the final grade distribution of boys versus girls.

Correlation between grade in the first class and second class.

Observation: There seems to be a strong correlation between the grades in the first class and second class.