Dataset Name : “Student Performance Data Set”.
Source : UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Student+Performance)
Information : This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
The above descrition is taken from the above mentioned source
Attribute Information:
Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: 1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
These grades are related with the course subject, Math or Portuguese:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
Both the dataset “student-mat.csv” and “student-por.csv” are downloaded from the source data folder and uploaded to github.
Below are the Github urls:
1. https://raw.githubusercontent.com/arunk13/MSDA-Assignments/master/IS607Fall2015/Assignment3/student-mat.csv
2. https://raw.githubusercontent.com/arunk13/MSDA-Assignments/master/IS607Fall2015/Assignment3/student-por.csv
Downloading the data from github into R environment
math_student_data_url <- "https://raw.githubusercontent.com/arunk13/MSDA-Assignments/master/IS607Fall2015/Assignment3/student-mat.csv";
math_student_data <- read.table(file = math_student_data_url, header = TRUE, sep = ";");
por_student_data_url <- "https://raw.githubusercontent.com/arunk13/MSDA-Assignments/master/IS607Fall2015/Assignment3/student-por.csv";
por_student_data <- read.table(file = por_student_data_url, header = TRUE, sep = ";");
Number of observations : 395;
Number of variables : 33;
head(math_student_data, n = 3);
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## guardian traveltime studytime failures schoolsup famsup paid activities
## 1 mother 2 2 0 yes no no no
## 2 father 1 2 0 no yes no no
## 3 mother 1 2 3 yes no yes no
## nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1 yes yes no no 4 3 4 1 1 3
## 2 no yes yes no 5 3 3 1 1 3
## 3 yes yes yes no 4 3 2 2 3 3
## absences G1 G2 G3
## 1 6 5 6 6
## 2 4 5 5 6
## 3 10 7 8 10
str(math_student_data);
## 'data.frame': 395 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
For my ease, I will pick few columns which I believe are the most important variables in the dataset.
Variables - #1, #2, #3, #6, #7, #8, #13, #14, #15, #21, #22, #30, #31, #32, #33
working_student_dataset <- math_student_data[c(1, 2, 3, 6, 7, 8, 13, 14, 15, 21, 22, 30, 31, 32, 33)];
We will be working on the new dataset “working_student_dataset”.
head(working_student_dataset, n = 3);
## school sex age Pstatus Medu Fedu traveltime studytime failures higher
## 1 GP F 18 A 4 4 2 2 0 yes
## 2 GP F 17 T 1 1 1 2 0 yes
## 3 GP F 15 T 1 1 1 2 3 yes
## internet absences G1 G2 G3
## 1 no 6 5 6 6
## 2 yes 4 5 5 6
## 3 yes 10 7 8 10
I will like to rename the below variables to make it more understandable. Below are the variables and their actual description.
Pstatus -> parent’s cohabitation status. Medu -> mother’s education Fedu -> father’s education G1 -> first period grade G2 -> second period grade G3 -> final grade
#Using the setnames function in data.table package for easy group renaming using header names instead of column index.
setnames(working_student_dataset, old = c("Pstatus", "Medu", "Fedu", "G1", "G2", "G3"), new = c("parent_cohabitation_status", "mother_edu", "father_edu", "first_period_grade", "second_period_grade", "final_grade"));
summary(working_student_dataset);
## school sex age parent_cohabitation_status
## GP:349 F:208 Min. :15.0 A: 41
## MS: 46 M:187 1st Qu.:16.0 T:354
## Median :17.0
## Mean :16.7
## 3rd Qu.:18.0
## Max. :22.0
## mother_edu father_edu traveltime studytime
## Min. :0.000 Min. :0.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :2.000 Median :1.000 Median :2.000
## Mean :2.749 Mean :2.522 Mean :1.448 Mean :2.035
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
## failures higher internet absences first_period_grade
## Min. :0.0000 no : 20 no : 66 Min. : 0.000 Min. : 3.00
## 1st Qu.:0.0000 yes:375 yes:329 1st Qu.: 0.000 1st Qu.: 8.00
## Median :0.0000 Median : 4.000 Median :11.00
## Mean :0.3342 Mean : 5.709 Mean :10.91
## 3rd Qu.:0.0000 3rd Qu.: 8.000 3rd Qu.:13.00
## Max. :3.0000 Max. :75.000 Max. :19.00
## second_period_grade final_grade
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 8.00
## Median :11.00 Median :11.00
## Mean :10.71 Mean :10.42
## 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :19.00 Max. :20.00
Observation: I dont see any missing values in any of the columns.
The school varibales has two values - GP and MS.
GP stands for “Gabriel Pereira”.
MS stands for “Mousinho da Silveira”
Lets use the stringr package for these transformations.
Lets first see what is the class type of school variable.
class(working_student_dataset$school);
## [1] "factor"
working_student_dataset$school <- str_replace_all(str_c(working_student_dataset$school), c("^GP$" = "Gabriel Pereira", "^MS$" = "Mousinho da Silveira"));
head(working_student_dataset, n = 1);
## school sex age parent_cohabitation_status mother_edu father_edu
## 1 Gabriel Pereira F 18 A 4 4
## traveltime studytime failures higher internet absences
## 1 2 2 0 yes no 6
## first_period_grade second_period_grade final_grade
## 1 5 6 6
Lets subset the data for parent’s cohabitaion status - “Apart”. I want to see if parent’s cohabitation status has any impact on student’s grade.
subset_student_data <- subset(working_student_dataset, working_student_dataset$parent_cohabitation_status == 'A');
#comparing the mean of final grade of overall data with mean of final grade of subsetted data.
mean(subset_student_data$final_grade) < mean(working_student_dataset$final_grade);
## [1] FALSE
aggregate(working_student_dataset[,15], working_student_dataset["school"], mean);
## school x
## 1 Gabriel Pereira 10.489971
## 2 Mousinho da Silveira 9.847826
Observation: There seems to be a strong correlation between the grades in the first class and second class.