Litereture review and Guiding Questions

Several factors may affect students’ academic performance at high school. Academic performance is considerably important since whole education system revolves around that. Signh, Malik and Signh (2016) also argued that there is a direct link between students’ performance and social and economic development of a country, which brings more attention to academic performance of students. Also in another study it has been noted that due to the fact that graduate students are the future leader of society, their performance plays a vital role in society (Ali et. al, 2009). Several studies have been conducted to find out students’ academic performance (Applegate and Daly, 2006; Hedjazi and Omidi, 2008; Ramadan and Quraan, 1994; Al-Rofo, 2010; Naser and Peel, 1998; Abdullah, 2005). Since students, teachers, institute and parents all have their importance in their process of learning, it is significantly important to recongnize the level of importance those variables. For instance, Jencks (1972) argued that the family plays an important role in formal and informal education. However, the role of family depends on which lens we are looking through. Family characteristics can represent a number of variables like education, income, beliefs, jobs, the amount of siblings also have impact on the performance of student (Khan et. al, 2015).Significant reliable research studies have told that social and economic status of parent is the best predictor of student academic achievement (Coleman et al., 1966). Although the parental background is significant, we should not ignore other factors associated student’s time of study, travel and ect. Therefore this study aims to answer the question that which of the variables has the most impact on student’s performance. The second question I wish to answer is that does family affiliation (e.g. Parents job, parent’s education… Etc) with student’s impact final grades more or out of family affiliation (e.g. going out with friends, workday alcohol consumption etc) impact final grades more? Third, or are other factors having more impact directly on students` final performance?

Data resource

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it has been collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). The data has been exteracted from the P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUtureBUsinessTEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: 1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira) 2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male) 3 age - student’s age (numeric: from 15 to 22) 4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural) 5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3) 6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart) 7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) 10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) 11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’) 12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’) 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 16 schoolsup - extra educational support (binary: yes or no) 17 famsup - family educational support (binary: yes or no) 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 19 activities - extra-curricular activities (binary: yes or no) 20 nursery - attended nursery school (binary: yes or no) 21 higher - wants to take higher education (binary: yes or no) 22 internet - Internet access at home (binary: yes or no) 23 romantic - with a romantic relationship (binary: yes or no) 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high) 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 29 health - current health status (numeric: from 1 - very bad to 5 - very good) 30 absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese: 31 G1 - first period grade (numeric: from 0 to 20) 31 G2 - second period grade (numeric: from 0 to 20) 32 G3 - final grade (numeric: from 0 to 20, output target)

Target Audience

I hope the result of this study can help parents, teachers and students to understand the outside and inside variables which may impact student’s success and it can help them to even recognize students at risk for the future research. It can help schools and universities to recognize students’ performance according to the variable impacting them. In addition, Parental involvement and engagement in education matters now more than ever because it’s in decline. In 2016, research showed a drop in parents who believe that intimate parent-teacher communication is effective. Therefore, by showing the importance of parental effect, I hope we can encourage parents to take action in this regard. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it has been collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). The data has been exteracted from the P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUtureBUsinessTEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

Data wrangling

At first, I will load packages I am going to use during the project. Simultaneously, some of the important packages such as tidyverse, tidymodels, workflows, rsample are installed and used for data inspection, summary statistics, cleaning, transformation and modelling for final predictions.

install.packages("workflows")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("rsample")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("tidymodels")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(skimr)
grade<- read.csv('data/grade.csv')
glimpse(grade)
## Rows: 649
## Columns: 33
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ failures   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ paid       <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ absences   <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2, 6…
## $ G1         <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14, 1…
## $ G2         <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 14,…
## $ G3         <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 15,…
names(grade)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"

There are 33 total variables with numeric and characters types. Some are in binary, some variables have values in numeric ranges.
Since not all variables are relevant, some of the variables will be removed. I will keep some of the variables that I believe have more potential to impact student’s performance.

grade_vars<- grade %>%
select(Pstatus,Medu,Fedu,studytime,schoolsup,
famsup,paid,activities,higher,internet,famrel,freetime,goout,
health,absences,G1,G2,G3)

Some correlation analysis and visualization

cor(grade_vars$famrel, grade_vars$G3)
## [1] 0.06336113
cor(grade_vars$absences, grade_vars$G3)
## [1] -0.09137906
cor(grade_vars$G1, grade_vars$G3)
## [1] 0.8263871
cor(grade_vars$G2, grade_vars$G3)
## [1] 0.918548
ggplot(grade_vars, aes(G1, G3)) + geom_point() +
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(grade_vars, aes(G2, G3)) + geom_point() +
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'

skim(grade_vars)
Data summary
Name grade_vars
Number of rows 649
Number of columns 18
_______________________
Column type frequency:
character 7
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Pstatus 0 1 1 1 0 2 0
schoolsup 0 1 2 3 0 2 0
famsup 0 1 2 3 0 2 0
paid 0 1 2 3 0 2 0
activities 0 1 2 3 0 2 0
higher 0 1 2 3 0 2 0
internet 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Medu 0 1 2.51 1.13 0 2 2 4 4 ▁▆▇▆▇
Fedu 0 1 2.31 1.10 0 1 2 3 4 ▁▆▇▅▅
studytime 0 1 1.93 0.83 1 1 2 2 4 ▆▇▁▂▁
famrel 0 1 3.93 0.96 1 4 4 5 5 ▁▁▂▇▅
freetime 0 1 3.18 1.05 1 3 3 4 5 ▂▃▇▆▂
goout 0 1 3.18 1.18 1 2 3 4 5 ▂▆▇▆▅
health 0 1 3.54 1.45 1 2 4 5 5 ▃▂▃▃▇
absences 0 1 3.66 4.64 0 0 2 6 32 ▇▂▁▁▁
G1 0 1 11.40 2.75 0 10 11 13 19 ▁▂▇▇▁
G2 0 1 11.57 2.91 0 10 11 13 19 ▁▁▇▇▂
G3 0 1 11.91 3.23 0 10 12 14 19 ▁▁▇▇▂
summary(grade_vars)
##    Pstatus               Medu            Fedu         studytime    
##  Length:649         Min.   :0.000   Min.   :0.000   Min.   :1.000  
##  Class :character   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Mode  :character   Median :2.000   Median :2.000   Median :2.000  
##                     Mean   :2.515   Mean   :2.307   Mean   :1.931  
##                     3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:2.000  
##                     Max.   :4.000   Max.   :4.000   Max.   :4.000  
##   schoolsup            famsup              paid            activities       
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet             famrel         freetime   
##  Length:649         Length:649         Min.   :1.000   Min.   :1.00  
##  Class :character   Class :character   1st Qu.:4.000   1st Qu.:3.00  
##  Mode  :character   Mode  :character   Median :4.000   Median :3.00  
##                                        Mean   :3.931   Mean   :3.18  
##                                        3rd Qu.:5.000   3rd Qu.:4.00  
##                                        Max.   :5.000   Max.   :5.00  
##      goout           health         absences            G1      
##  Min.   :1.000   Min.   :1.000   Min.   : 0.000   Min.   : 0.0  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.: 0.000   1st Qu.:10.0  
##  Median :3.000   Median :4.000   Median : 2.000   Median :11.0  
##  Mean   :3.185   Mean   :3.536   Mean   : 3.659   Mean   :11.4  
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.: 6.000   3rd Qu.:13.0  
##  Max.   :5.000   Max.   :5.000   Max.   :32.000   Max.   :19.0  
##        G2              G3       
##  Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:10.00   1st Qu.:10.00  
##  Median :11.00   Median :12.00  
##  Mean   :11.57   Mean   :11.91  
##  3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :19.00   Max.   :19.00
glimpse(grade_vars)
## Rows: 649
## Columns: 18
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ paid       <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ absences   <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2, 6…
## $ G1         <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14, 1…
## $ G2         <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 14,…
## $ G3         <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 15,…

G3 indicates the final grade and the potential dependent variable to be predicted. The correlation of G3 with other variables shows that first term grade G1 and second term grade G2 are highly correlated with outcome variable G3, while other variables are little correlated with outcome variable. But for the sake of our prediction for final grade performance, we will take into all variables and will see their impacts and how they contribute to the final grade performance in light of the three questions set to be answered.

In this part, I have determined the outcome variable for the prediction. In this regard, I have created new variable(performance) to categorize their performance into two categories ‘good’ and ‘poor’. My aim is to see how many of the students with all kinds of supports are performing poor and may be other measures by responsible authorities have to be taken to tackle the issue related to the poor performance of some students.

grade_vars<- grade_vars %>%
mutate(performance = if_else(G3 >= 11, 'good', 'poor')) 
glimpse(grade_vars)
## Rows: 649
## Columns: 19
## $ Pstatus     <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T"…
## $ Medu        <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4…
## $ Fedu        <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3…
## $ studytime   <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1…
## $ schoolsup   <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "…
## $ famsup      <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes"…
## $ paid        <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no"…
## $ activities  <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "y…
## $ higher      <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ internet    <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes"…
## $ famrel      <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3…
## $ freetime    <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1…
## $ goout       <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3…
## $ health      <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5…
## $ absences    <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2, …
## $ G1          <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14, …
## $ G2          <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 14…
## $ G3          <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 15…
## $ performance <chr> "good", "good", "good", "good", "good", "good", "good", "g…

Also to tidy the data set more I have convert character variables to factor variables used this option to categorize and store the data because modeling data for categorical variables is different from the continuous variables. And then look at the data set using the function glimpse.

grade_vars<- grade_vars %>%
mutate_if(is.character, as.factor)


glimpse(grade_vars)
## Rows: 649
## Columns: 19
## $ Pstatus     <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T…
## $ Medu        <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4…
## $ Fedu        <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3…
## $ studytime   <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1…
## $ schoolsup   <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no,…
## $ famsup      <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, y…
## $ paid        <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no…
## $ activities  <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, n…
## $ higher      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ internet    <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, y…
## $ famrel      <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3…
## $ freetime    <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1…
## $ goout       <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3…
## $ health      <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5…
## $ absences    <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2, …
## $ G1          <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14, …
## $ G2          <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 14…
## $ G3          <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 15…
## $ performance <fct> good, good, good, good, good, good, good, good, good, good…

The next step is counting proportion of students` performance.

grade_vars %>%
count(performance) %>%
mutate(performance_proportion = n/sum(n))
##   performance   n performance_proportion
## 1        good 452              0.6964561
## 2        poor 197              0.3035439

It is seen from above that around 30 percent of the students performed poor and 70 percent performed good. With this in mind, modelling with different variables will be used how accurately they predict for future performance.

Data Analysis

It is best and recommended practice for prediction with Machine Learning models to split cleaned dataset into training and test sets to get biased-free outcome. So our dataset below is split into both training and test sets with random sampling where by default around 75% of full dataset goes to training set and 25% goes for test set.

library(rsample)
set.seed(42)
grade_split<- initial_split(grade_vars, strata = performance)
grade_split
## <Analysis/Assess/Total>
## <486/163/649>
train_data<- training(grade_split)
test_data<- testing(grade_split)
glimpse(train_data)
## Rows: 486
## Columns: 19
## $ Pstatus     <fct> A, T, T, T, T, T, A, T, T, T, T, T, T, T, T, T, T, T, T, T…
## $ Medu        <int> 4, 1, 1, 4, 3, 4, 3, 3, 4, 2, 4, 4, 3, 4, 4, 4, 2, 4, 4, 4…
## $ Fedu        <int> 4, 1, 1, 2, 3, 3, 2, 4, 4, 1, 3, 4, 3, 3, 3, 2, 2, 2, 4, 4…
## $ studytime   <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2…
## $ schoolsup   <fct> yes, no, yes, no, no, no, no, no, no, no, no, no, yes, no,…
## $ famsup      <fct> no, yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
## $ paid        <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no…
## $ activities  <fct> no, no, no, yes, no, yes, no, yes, no, yes, no, no, yes, y…
## $ higher      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ internet    <fct> no, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, …
## $ famrel      <int> 4, 5, 4, 3, 4, 5, 4, 5, 3, 5, 5, 4, 5, 3, 4, 4, 4, 2, 4, 5…
## $ freetime    <int> 3, 3, 3, 2, 3, 4, 2, 5, 3, 2, 4, 4, 3, 1, 4, 5, 2, 2, 4, 4…
## $ goout       <int> 4, 3, 2, 2, 2, 2, 2, 1, 3, 2, 3, 4, 2, 3, 1, 1, 2, 4, 5, 2…
## $ health      <int> 3, 3, 3, 5, 5, 5, 1, 5, 2, 4, 3, 2, 4, 5, 1, 5, 5, 1, 5, 5…
## $ absences    <int> 4, 2, 6, 0, 0, 6, 0, 0, 2, 0, 0, 6, 2, 6, 0, 0, 8, 0, 4, 0…
## $ G1          <int> 0, 9, 12, 14, 11, 12, 15, 12, 14, 10, 12, 17, 13, 12, 12, …
## $ G2          <int> 11, 11, 13, 14, 13, 12, 16, 12, 14, 12, 12, 17, 14, 12, 13…
## $ G3          <int> 11, 11, 12, 14, 13, 13, 17, 13, 14, 13, 13, 17, 14, 12, 14…
## $ performance <fct> good, good, good, good, good, good, good, good, good, good…
glimpse(test_data)
## Rows: 163
## Columns: 19
## $ Pstatus     <fct> T, A, T, A, T, T, T, A, T, T, T, T, T, T, A, T, T, T, T, T…
## $ Medu        <int> 2, 4, 4, 2, 4, 4, 2, 3, 4, 2, 4, 2, 4, 4, 2, 4, 4, 1, 3, 4…
## $ Fedu        <int> 2, 4, 4, 2, 4, 4, 2, 4, 4, 2, 4, 2, 2, 4, 1, 4, 2, 1, 1, 3…
## $ studytime   <int> 2, 2, 1, 3, 3, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 4, 2, 1, 2…
## $ schoolsup   <fct> no, yes, no, no, no, no, no, yes, no, no, no, yes, no, yes…
## $ famsup      <fct> no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ paid        <fct> no, no, no, no, no, yes, no, yes, no, no, no, no, no, no, …
## $ activities  <fct> no, no, yes, no, yes, no, no, yes, yes, yes, yes, no, no, …
## $ higher      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ internet    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ famrel      <int> 4, 4, 4, 4, 3, 5, 1, 5, 4, 3, 4, 5, 4, 3, 5, 3, 3, 3, 5, 4…
## $ freetime    <int> 4, 1, 3, 5, 2, 4, 2, 3, 3, 3, 3, 4, 3, 3, 3, 2, 3, 3, 3, 3…
## $ goout       <int> 4, 4, 3, 2, 3, 2, 2, 3, 1, 3, 3, 1, 3, 4, 4, 2, 3, 4, 2, 3…
## $ health      <int> 3, 1, 5, 3, 2, 5, 5, 5, 5, 3, 5, 1, 5, 5, 2, 5, 3, 5, 5, 5…
## $ absences    <int> 0, 2, 0, 0, 10, 0, 6, 2, 2, 16, 0, 0, 4, 0, 2, 8, 0, 2, 0,…
## $ G1          <int> 13, 10, 12, 14, 13, 11, 10, 12, 15, 11, 14, 9, 11, 13, 12,…
## $ G2          <int> 12, 13, 13, 14, 13, 12, 11, 12, 15, 11, 15, 10, 12, 12, 13…
## $ G3          <int> 13, 13, 12, 15, 14, 12, 12, 13, 15, 10, 15, 10, 13, 12, 12…
## $ performance <fct> good, good, good, good, good, good, good, good, good, poor…

To prepares my data for modeling, I will create recipes and use some packages we already learned them. Four recipes will be prepared to make four models. Correlation analysis showed that some variables are highly correlated, while some do not have much correlation. Therefore, I have to make four models. One will be with full dataset, one will be with only first and second grade results, third one to see how variables related to family affiliation have impact on final grade and one to make prediction with only variables related to out of family affiliation such as school support.

library(workflows)
library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom        0.7.9      ✓ recipes      0.1.17
## ✓ dials        0.0.10     ✓ tune         0.1.6 
## ✓ infer        1.0.0      ✓ workflowsets 0.1.0 
## ✓ modeldata    0.1.1      ✓ yardstick    0.0.9 
## ✓ parsnip      0.1.7
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
recipe_full_data<- recipe(performance ~ ., data = train_data) %>%
step_dummy(all_nominal_predictors())
recipe_G1_G2_data <- recipe(performance ~ G1+G2, data = train_data) %>%
step_dummy(all_nominal_predictors())
recipe_family_affiliation<- recipe(performance ~ Pstatus+Medu+Fedu+
famsup+internet+famrel,data = train_data) %>%
step_dummy(all_nominal_predictors())
recipe_external_affiliation<- recipe(performance ~ studytime+schoolsup+
paid+activities+higher+freetime+
goout+health+absences,
data = train_data) %>%
step_dummy(all_nominal_predictors())

Modelling

Setting up engine

A general model engine is set up to accommodate it into different models and adjusted with workflows.

install.packages("workflows")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(parsnip)
lr_model<- logistic_reg() %>%
set_engine('glm')

lr_workflow_full<- workflow() %>%
add_model(lr_model) %>%
add_recipe(recipe_full_data)

Model fitting to training data

Dataset is trained and fitted to models for predictions.

set.seed(42)
lr_fit_full_data<- lr_workflow_full %>%
fit(data = train_data)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
lr_fit_full_data %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 19 × 5
##    term            estimate std.error   statistic p.value
##    <chr>              <dbl>     <dbl>       <dbl>   <dbl>
##  1 (Intercept)    471.         85575.  0.00551      0.996
##  2 Medu            -0.0517      4766. -0.0000108    1.00 
##  3 Fedu            -0.104       4823. -0.0000215    1.00 
##  4 studytime        0.220       5255.  0.0000418    1.00 
##  5 famrel           0.0855      4545.  0.0000188    1.00 
##  6 freetime         0.0503      4249.  0.0000118    1.00 
##  7 goout           -0.0897      3675. -0.0000244    1.00 
##  8 health          -0.0564      3077. -0.0000183    1.00 
##  9 absences        -0.00602      889. -0.00000678   1.00 
## 10 G1               0.0251      3104.  0.00000807   1.00 
## 11 G2               0.120       5331.  0.0000225    1.00 
## 12 G3             -45.0         9150. -0.00492      0.996
## 13 Pstatus_T       -0.0322     12502. -0.00000257   1.00 
## 14 schoolsup_yes   -0.202      11978. -0.0000169    1.00 
## 15 famsup_yes      -0.134       8460. -0.0000158    1.00 
## 16 paid_yes        -0.431      17439. -0.0000247    1.00 
## 17 activities_yes   0.500       8134.  0.0000614    1.00 
## 18 higher_yes      -0.114      11415. -0.0000100    1.00 
## 19 internet_yes     0.114       9134.  0.0000125    1.00

Testing the model with test data

Models are then tested and predictions are made.

predictions<- augment(lr_fit_full_data, test_data)

glimpse(predictions)
## Rows: 163
## Columns: 22
## $ Pstatus     <fct> T, A, T, A, T, T, T, A, T, T, T, T, T, T, A, T, T, T, T, T…
## $ Medu        <int> 2, 4, 4, 2, 4, 4, 2, 3, 4, 2, 4, 2, 4, 4, 2, 4, 4, 1, 3, 4…
## $ Fedu        <int> 2, 4, 4, 2, 4, 4, 2, 4, 4, 2, 4, 2, 2, 4, 1, 4, 2, 1, 1, 3…
## $ studytime   <int> 2, 2, 1, 3, 3, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 4, 2, 1, 2…
## $ schoolsup   <fct> no, yes, no, no, no, no, no, yes, no, no, no, yes, no, yes…
## $ famsup      <fct> no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ paid        <fct> no, no, no, no, no, yes, no, yes, no, no, no, no, no, no, …
## $ activities  <fct> no, no, yes, no, yes, no, no, yes, yes, yes, yes, no, no, …
## $ higher      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ internet    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ famrel      <int> 4, 4, 4, 4, 3, 5, 1, 5, 4, 3, 4, 5, 4, 3, 5, 3, 3, 3, 5, 4…
## $ freetime    <int> 4, 1, 3, 5, 2, 4, 2, 3, 3, 3, 3, 4, 3, 3, 3, 2, 3, 3, 3, 3…
## $ goout       <int> 4, 4, 3, 2, 3, 2, 2, 3, 1, 3, 3, 1, 3, 4, 4, 2, 3, 4, 2, 3…
## $ health      <int> 3, 1, 5, 3, 2, 5, 5, 5, 5, 3, 5, 1, 5, 5, 2, 5, 3, 5, 5, 5…
## $ absences    <int> 0, 2, 0, 0, 10, 0, 6, 2, 2, 16, 0, 0, 4, 0, 2, 8, 0, 2, 0,…
## $ G1          <int> 13, 10, 12, 14, 13, 11, 10, 12, 15, 11, 14, 9, 11, 13, 12,…
## $ G2          <int> 12, 13, 13, 14, 13, 12, 11, 12, 15, 11, 15, 10, 12, 12, 13…
## $ G3          <int> 13, 13, 12, 15, 14, 12, 12, 13, 15, 10, 15, 10, 13, 12, 12…
## $ performance <fct> good, good, good, good, good, good, good, good, good, poor…
## $ .pred_class <fct> good, good, good, good, good, good, good, good, good, poor…
## $ .pred_good  <dbl> 1.000000e+00, 1.000000e+00, 1.000000e+00, 1.000000e+00, 1.…
## $ .pred_poor  <dbl> 2.220446e-16, 2.220446e-16, 2.220446e-16, 2.220446e-16, 2.…

Checking model accuracy by confusion matrix

It is important to see how the models we designed have performed. For this purpose, we can use confusion matrix or class coefficients.

predictions %>% 
conf_mat(performance,  .pred_class)
##           Truth
## Prediction good poor
##       good  113    0
##       poor    0   50
#or 
predictions %>% 
select(performance, .pred_class) %>%
accuracy(truth = performance, .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary             1

Fitting and predicting with only G1 and G2 variables

lr_workflow_G1_G2 <- workflow() %>%
add_model(lr_model) %>%
add_recipe(recipe_G1_G2_data)

set.seed(42)
lr_fit_G1_G2 <- lr_workflow_G1_G2 %>%
fit(data = train_data)
lr_fit_G1_G2 %>% 
extract_fit_parsnip() %>%
tidy()
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   19.0       2.05       9.26 2.05e-20
## 2 G1            -0.444     0.137     -3.23 1.24e- 3
## 3 G2            -1.44      0.196     -7.35 2.00e-13
predictions_G1_G2 <- augment(lr_fit_G1_G2, test_data) 

Checking model accuracy by confusion matrix

predictions_G1_G2 %>% conf_mat(performance,  .pred_class)
##           Truth
## Prediction good poor
##       good  105    4
##       poor    8   46
?predic
## No documentation for 'predic' in specified packages and libraries:
## you could try '??predic'
predictions_G1_G2 %>% 
select(performance, .pred_class) %>%
accuracy(truth = performance, .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.926

Fitting and predicting with family affiliation variables

lr_workflow_family<- workflow() %>%
add_model(lr_model) %>%
add_recipe(recipe_family_affiliation)

set.seed(42)
lr_fit_family_data<- lr_workflow_family %>%
fit(data = train_data)
lr_fit_family_data %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 7 × 5
##   term         estimate std.error statistic p.value
##   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   1.31        0.591    2.22    0.0262
## 2 Medu         -0.295       0.121   -2.44    0.0148
## 3 Fedu         -0.284       0.126   -2.26    0.0236
## 4 famrel       -0.142       0.112   -1.27    0.204 
## 5 Pstatus_T    -0.124       0.316   -0.393   0.694 
## 6 famsup_yes   -0.00302     0.211   -0.0143  0.989 
## 7 internet_yes -0.196       0.245   -0.799   0.424
predictions_family_vars<- augment(lr_fit_family_data, test_data)

Checking model accuracy by confusion matrix

predictions_family_vars %>%conf_mat(performance,  .pred_class)
##           Truth
## Prediction good poor
##       good  103   43
##       poor   10    7
predictions_family_vars %>%
select(performance, .pred_class) %>%
accuracy(truth = performance, .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.675

Fitting and predicting with out of family affiliation variables

lr_workflow_external<- workflow() %>%
add_model(lr_model) %>%
add_recipe(recipe_external_affiliation)

set.seed(42)
lr_fit_external<- lr_workflow_external %>%
fit(data = train_data)
lr_fit_external %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 10 × 5
##    term           estimate std.error statistic       p.value
##    <chr>             <dbl>     <dbl>     <dbl>         <dbl>
##  1 (Intercept)     0.560      0.641     0.873  0.382        
##  2 studytime      -0.376      0.143    -2.63   0.00859      
##  3 freetime        0.152      0.113     1.35   0.178        
##  4 goout           0.00267    0.0985    0.0271 0.978        
##  5 health          0.116      0.0791    1.47   0.142        
##  6 absences        0.0678     0.0240    2.82   0.00477      
##  7 schoolsup_yes   0.479      0.327     1.46   0.143        
##  8 paid_yes        0.254      0.445     0.572  0.568        
##  9 activities_yes -0.183      0.222    -0.822  0.411        
## 10 higher_yes     -2.13       0.360    -5.91   0.00000000352
predictions_external_vars<- augment(lr_fit_external, test_data)

Checking model accuracy by confusion matrix

predictions_external_vars %>%conf_mat(performance,  .pred_class)
##           Truth
## Prediction good poor
##       good  107   39
##       poor    6   11
predictions_external_vars %>%
select(performance, .pred_class) %>%
accuracy(truth = performance, .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.724

After the making predictions for all four models, we see get a set of mixed results and they differ in performance standard with highest model accuracy of around 93%. Predictions based on separate variables for family and out of family affiliation performed poorly below 73%

Communication

Findings and insights

To answer our first question, prediction was made with all the variables to see how all the elements help to perform for final grade and it gave an accuracy of around 90%, while first and second term grades explained mainly for final grade outcome. It is noticeable from the predictions that only family affiliation or out of family affiliation for students study support cannot alone contribute significantly to improve students` final grade performance. More particularly, first and second term grade predictors contributed most to give better final grade output. Hence, it is assumed from the analysis that if family support, school support, family orientations are properly aligned and ensured for students in the first and second terms of the schools, they will have impact on final grade performance. So, all kinds of supports and environment have to be ensured for students in first and second terms to tackle the group of poor-performing students.

Future researches

  1. What other factors, which still may be unidentified, are responsible for the poor performance of a large group of students even after giving all seemingly good supports and study-friendly environment?
  2. Is there a way to investigate that while all students have equal access to study resources and support in schools, are students from certain race systematically lagged behind who might comprise the large group of poorly-performing students?

Limitations of the study

  1. There is still not much clarification for some variables that could help to explore the dataset better.
  2. Dataset is limited to only two groups of Portugeese schools which are unable to depict full picture of students final grade performance for students in other subject areas and regions.

Citations

Singh, S. P., Malik, S., & Singh, P. (2016). Research paper factors affecting academic performance of students. Indian Journal of Research, 5(4), 176-178.

Ali, N., Jusof, K., Ali, S., Mokhtar, N., & Salamat, A. S. A. (2009). THE FACTORS INFLUENCING STUDENTS’PERFORMANCE AT UNIVERSITI TEKNOLOGI MARA KEDAH, MALAYSIA. Management Science and Engineering, 3(4), 81-90.

Abdullah, A. M. (2011). Factors affecting business students' performance in Arab Open University: The case of Kuwait. International Journal of Business and Management, 6(5), 146.

Applegate, C., & Daly, A. (2006). The impact of paid work on the academic performance of students: A case study from the University of Canberra.Australian Journal of Education, 50(2), 155-166.

Hedjazi, Y., & Omidi, M. (2008) Factors affecting the Academic success of Agricultural Students at university of Tehran, Iran. Journal of Agricultural Science and Technology. Vol. 10. No. 3. Pp. 205-214,April 2008.

Ramadan, S., & Quraan, A. (1994). Determinants of students’ performance in introductory accounting courses. Journal of King Saud University, 6 (2): 65-80.

Al-Rofo, M. A. (2010). The dimensions that affect the students’ low accumulative average in Tafila Technical University. Journal of Social Sciences, 22(1), 53-59.

Naser, K., & Peel, M. J. (1998). An exploratory study of the impact of intervening variables on student performance in a principles of accounting course. Accounting Education, 7(3), 209-223.

Khan, R. M. A., Iqbal, N., & Tasneem, S. (2015). The Influence of Parents Educational Level on Secondary School Students Academic Achievements in District Rajanpur. Journal of Education and Practice, 6(16), 76-79.

Konstantopoulos, S., & Borman, G. (2011). Family background and school effects on student achievement: A multilevel analysis of the Coleman data. Teachers College Record, 113(1), 97-132.