Several factors may affect students’ academic performance at high school. Academic performance is considerably important since whole education system revolves around that. Signh, Malik and Signh (2016) also argued that there is a direct link between students’ performance and social and economic development of a country, which brings more attention to academic performance of students. Also in another study it has been noted that due to the fact that graduate students are the future leader of society, their performance plays a vital role in society (Ali et. al, 2009). Several studies have been conducted to find out students’ academic performance (Applegate and Daly, 2006; Hedjazi and Omidi, 2008; Ramadan and Quraan, 1994; Al-Rofo, 2010; Naser and Peel, 1998; Abdullah, 2005). Since students, teachers, institute and parents all have their importance in their process of learning, it is significantly important to recongnize the level of importance those variables. For instance, Jencks (1972) argued that the family plays an important role in formal and informal education. However, the role of family depends on which lens we are looking through. Family characteristics can represent a number of variables like education, income, beliefs, jobs, the amount of siblings also have impact on the performance of student (Khan et. al, 2015).Significant reliable research studies have told that social and economic status of parent is the best predictor of student academic achievement (Coleman et al., 1966). Although the parental background is significant, we should not ignore other factors associated student’s time of study, travel and ect. Therefore this study aims to answer the question that which of the variables has the most impact on student’s performance. The second question I wish to answer is that does family affiliation (e.g. Parents job, parent’s education… Etc) with student’s impact final grades more or out of family affiliation (e.g. going out with friends, workday alcohol consumption etc) impact final grades more? Third, or are other factors having more impact directly on students` final performance?
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it has been collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). The data has been exteracted from the P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUtureBUsinessTEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: 1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira) 2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male) 3 age - student’s age (numeric: from 15 to 22) 4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural) 5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3) 6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart) 7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) 9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) 10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) 11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’) 12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’) 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 16 schoolsup - extra educational support (binary: yes or no) 17 famsup - family educational support (binary: yes or no) 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 19 activities - extra-curricular activities (binary: yes or no) 20 nursery - attended nursery school (binary: yes or no) 21 higher - wants to take higher education (binary: yes or no) 22 internet - Internet access at home (binary: yes or no) 23 romantic - with a romantic relationship (binary: yes or no) 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high) 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 29 health - current health status (numeric: from 1 - very bad to 5 - very good) 30 absences - number of school absences (numeric: from 0 to 93)
These grades are related with the course subject, Math or Portuguese: 31 G1 - first period grade (numeric: from 0 to 20) 31 G2 - second period grade (numeric: from 0 to 20) 32 G3 - final grade (numeric: from 0 to 20, output target)
I hope the result of this study can help parents, teachers and students to understand the outside and inside variables which may impact student’s success and it can help them to even recognize students at risk for the future research. It can help schools and universities to recognize students’ performance according to the variable impacting them. In addition, Parental involvement and engagement in education matters now more than ever because it’s in decline. In 2016, research showed a drop in parents who believe that intimate parent-teacher communication is effective. Therefore, by showing the importance of parental effect, I hope we can encourage parents to take action in this regard. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it has been collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). The data has been exteracted from the P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUtureBUsinessTEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
At first, I will load packages I am going to use during the project. Simultaneously, some of the important packages such as tidyverse, tidymodels, workflows, rsample are installed and used for data inspection, summary statistics, cleaning, transformation and modelling for final predictions.
install.packages("workflows")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("rsample")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("tidymodels")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(skimr)
grade<- read.csv('data/grade.csv')
glimpse(grade)
## Rows: 649
## Columns: 33
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ failures <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ paid <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ absences <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2, 6…
## $ G1 <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14, 1…
## $ G2 <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 14,…
## $ G3 <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 15,…
names(grade)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures"
## [16] "schoolsup" "famsup" "paid" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3"
There are 33 total variables with numeric and characters types. Some are in binary, some variables have values in numeric ranges.
Since not all variables are relevant, some of the variables will be removed. I will keep some of the variables that I believe have more potential to impact student’s performance.
grade_vars<- grade %>%
select(Pstatus,Medu,Fedu,studytime,schoolsup,
famsup,paid,activities,higher,internet,famrel,freetime,goout,
health,absences,G1,G2,G3)
cor(grade_vars$famrel, grade_vars$G3)
## [1] 0.06336113
cor(grade_vars$absences, grade_vars$G3)
## [1] -0.09137906
cor(grade_vars$G1, grade_vars$G3)
## [1] 0.8263871
cor(grade_vars$G2, grade_vars$G3)
## [1] 0.918548
ggplot(grade_vars, aes(G1, G3)) + geom_point() +
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
ggplot(grade_vars, aes(G2, G3)) + geom_point() +
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
skim(grade_vars)
| Name | grade_vars |
| Number of rows | 649 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Pstatus | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| schoolsup | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| famsup | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| paid | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| activities | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| higher | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| internet | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Medu | 0 | 1 | 2.51 | 1.13 | 0 | 2 | 2 | 4 | 4 | ▁▆▇▆▇ |
| Fedu | 0 | 1 | 2.31 | 1.10 | 0 | 1 | 2 | 3 | 4 | ▁▆▇▅▅ |
| studytime | 0 | 1 | 1.93 | 0.83 | 1 | 1 | 2 | 2 | 4 | ▆▇▁▂▁ |
| famrel | 0 | 1 | 3.93 | 0.96 | 1 | 4 | 4 | 5 | 5 | ▁▁▂▇▅ |
| freetime | 0 | 1 | 3.18 | 1.05 | 1 | 3 | 3 | 4 | 5 | ▂▃▇▆▂ |
| goout | 0 | 1 | 3.18 | 1.18 | 1 | 2 | 3 | 4 | 5 | ▂▆▇▆▅ |
| health | 0 | 1 | 3.54 | 1.45 | 1 | 2 | 4 | 5 | 5 | ▃▂▃▃▇ |
| absences | 0 | 1 | 3.66 | 4.64 | 0 | 0 | 2 | 6 | 32 | ▇▂▁▁▁ |
| G1 | 0 | 1 | 11.40 | 2.75 | 0 | 10 | 11 | 13 | 19 | ▁▂▇▇▁ |
| G2 | 0 | 1 | 11.57 | 2.91 | 0 | 10 | 11 | 13 | 19 | ▁▁▇▇▂ |
| G3 | 0 | 1 | 11.91 | 3.23 | 0 | 10 | 12 | 14 | 19 | ▁▁▇▇▂ |
summary(grade_vars)
## Pstatus Medu Fedu studytime
## Length:649 Min. :0.000 Min. :0.000 Min. :1.000
## Class :character 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Mode :character Median :2.000 Median :2.000 Median :2.000
## Mean :2.515 Mean :2.307 Mean :1.931
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :4.000 Max. :4.000 Max. :4.000
## schoolsup famsup paid activities
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet famrel freetime
## Length:649 Length:649 Min. :1.000 Min. :1.00
## Class :character Class :character 1st Qu.:4.000 1st Qu.:3.00
## Mode :character Mode :character Median :4.000 Median :3.00
## Mean :3.931 Mean :3.18
## 3rd Qu.:5.000 3rd Qu.:4.00
## Max. :5.000 Max. :5.00
## goout health absences G1
## Min. :1.000 Min. :1.000 Min. : 0.000 Min. : 0.0
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 0.000 1st Qu.:10.0
## Median :3.000 Median :4.000 Median : 2.000 Median :11.0
## Mean :3.185 Mean :3.536 Mean : 3.659 Mean :11.4
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.: 6.000 3rd Qu.:13.0
## Max. :5.000 Max. :5.000 Max. :32.000 Max. :19.0
## G2 G3
## Min. : 0.00 Min. : 0.00
## 1st Qu.:10.00 1st Qu.:10.00
## Median :11.00 Median :12.00
## Mean :11.57 Mean :11.91
## 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :19.00 Max. :19.00
glimpse(grade_vars)
## Rows: 649
## Columns: 18
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ paid <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ absences <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2, 6…
## $ G1 <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14, 1…
## $ G2 <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 14,…
## $ G3 <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 15,…
G3 indicates the final grade and the potential dependent variable to be predicted. The correlation of G3 with other variables shows that first term grade G1 and second term grade G2 are highly correlated with outcome variable G3, while other variables are little correlated with outcome variable. But for the sake of our prediction for final grade performance, we will take into all variables and will see their impacts and how they contribute to the final grade performance in light of the three questions set to be answered.
In this part, I have determined the outcome variable for the prediction. In this regard, I have created new variable(performance) to categorize their performance into two categories ‘good’ and ‘poor’. My aim is to see how many of the students with all kinds of supports are performing poor and may be other measures by responsible authorities have to be taken to tackle the issue related to the poor performance of some students.
grade_vars<- grade_vars %>%
mutate(performance = if_else(G3 >= 11, 'good', 'poor'))
glimpse(grade_vars)
## Rows: 649
## Columns: 19
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T"…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes"…
## $ paid <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no"…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "y…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes"…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5…
## $ absences <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2, …
## $ G1 <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14, …
## $ G2 <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 14…
## $ G3 <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 15…
## $ performance <chr> "good", "good", "good", "good", "good", "good", "good", "g…
Also to tidy the data set more I have convert character variables to factor variables used this option to categorize and store the data because modeling data for categorical variables is different from the continuous variables. And then look at the data set using the function glimpse.
grade_vars<- grade_vars %>%
mutate_if(is.character, as.factor)
glimpse(grade_vars)
## Rows: 649
## Columns: 19
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1…
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no,…
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, y…
## $ paid <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, n…
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, y…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1…
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3…
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5…
## $ absences <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 10, 2, 2, …
## $ G1 <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12, 12, 14, …
## $ G2 <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 13, 12, 14…
## $ G3 <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 12, 13, 15…
## $ performance <fct> good, good, good, good, good, good, good, good, good, good…
The next step is counting proportion of students` performance.
grade_vars %>%
count(performance) %>%
mutate(performance_proportion = n/sum(n))
## performance n performance_proportion
## 1 good 452 0.6964561
## 2 poor 197 0.3035439
It is seen from above that around 30 percent of the students performed poor and 70 percent performed good. With this in mind, modelling with different variables will be used how accurately they predict for future performance.
It is best and recommended practice for prediction with Machine Learning models to split cleaned dataset into training and test sets to get biased-free outcome. So our dataset below is split into both training and test sets with random sampling where by default around 75% of full dataset goes to training set and 25% goes for test set.
library(rsample)
set.seed(42)
grade_split<- initial_split(grade_vars, strata = performance)
grade_split
## <Analysis/Assess/Total>
## <486/163/649>
train_data<- training(grade_split)
test_data<- testing(grade_split)
glimpse(train_data)
## Rows: 486
## Columns: 19
## $ Pstatus <fct> A, T, T, T, T, T, A, T, T, T, T, T, T, T, T, T, T, T, T, T…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 3, 3, 4, 2, 4, 4, 3, 4, 4, 4, 2, 4, 4, 4…
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 4, 1, 3, 4, 3, 3, 3, 2, 2, 2, 4, 4…
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2…
## $ schoolsup <fct> yes, no, yes, no, no, no, no, no, no, no, no, no, yes, no,…
## $ famsup <fct> no, yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
## $ paid <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no…
## $ activities <fct> no, no, no, yes, no, yes, no, yes, no, yes, no, no, yes, y…
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ internet <fct> no, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, …
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 5, 3, 5, 5, 4, 5, 3, 4, 4, 4, 2, 4, 5…
## $ freetime <int> 3, 3, 3, 2, 3, 4, 2, 5, 3, 2, 4, 4, 3, 1, 4, 5, 2, 2, 4, 4…
## $ goout <int> 4, 3, 2, 2, 2, 2, 2, 1, 3, 2, 3, 4, 2, 3, 1, 1, 2, 4, 5, 2…
## $ health <int> 3, 3, 3, 5, 5, 5, 1, 5, 2, 4, 3, 2, 4, 5, 1, 5, 5, 1, 5, 5…
## $ absences <int> 4, 2, 6, 0, 0, 6, 0, 0, 2, 0, 0, 6, 2, 6, 0, 0, 8, 0, 4, 0…
## $ G1 <int> 0, 9, 12, 14, 11, 12, 15, 12, 14, 10, 12, 17, 13, 12, 12, …
## $ G2 <int> 11, 11, 13, 14, 13, 12, 16, 12, 14, 12, 12, 17, 14, 12, 13…
## $ G3 <int> 11, 11, 12, 14, 13, 13, 17, 13, 14, 13, 13, 17, 14, 12, 14…
## $ performance <fct> good, good, good, good, good, good, good, good, good, good…
glimpse(test_data)
## Rows: 163
## Columns: 19
## $ Pstatus <fct> T, A, T, A, T, T, T, A, T, T, T, T, T, T, A, T, T, T, T, T…
## $ Medu <int> 2, 4, 4, 2, 4, 4, 2, 3, 4, 2, 4, 2, 4, 4, 2, 4, 4, 1, 3, 4…
## $ Fedu <int> 2, 4, 4, 2, 4, 4, 2, 4, 4, 2, 4, 2, 2, 4, 1, 4, 2, 1, 1, 3…
## $ studytime <int> 2, 2, 1, 3, 3, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 4, 2, 1, 2…
## $ schoolsup <fct> no, yes, no, no, no, no, no, yes, no, no, no, yes, no, yes…
## $ famsup <fct> no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ paid <fct> no, no, no, no, no, yes, no, yes, no, no, no, no, no, no, …
## $ activities <fct> no, no, yes, no, yes, no, no, yes, yes, yes, yes, no, no, …
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ internet <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ famrel <int> 4, 4, 4, 4, 3, 5, 1, 5, 4, 3, 4, 5, 4, 3, 5, 3, 3, 3, 5, 4…
## $ freetime <int> 4, 1, 3, 5, 2, 4, 2, 3, 3, 3, 3, 4, 3, 3, 3, 2, 3, 3, 3, 3…
## $ goout <int> 4, 4, 3, 2, 3, 2, 2, 3, 1, 3, 3, 1, 3, 4, 4, 2, 3, 4, 2, 3…
## $ health <int> 3, 1, 5, 3, 2, 5, 5, 5, 5, 3, 5, 1, 5, 5, 2, 5, 3, 5, 5, 5…
## $ absences <int> 0, 2, 0, 0, 10, 0, 6, 2, 2, 16, 0, 0, 4, 0, 2, 8, 0, 2, 0,…
## $ G1 <int> 13, 10, 12, 14, 13, 11, 10, 12, 15, 11, 14, 9, 11, 13, 12,…
## $ G2 <int> 12, 13, 13, 14, 13, 12, 11, 12, 15, 11, 15, 10, 12, 12, 13…
## $ G3 <int> 13, 13, 12, 15, 14, 12, 12, 13, 15, 10, 15, 10, 13, 12, 12…
## $ performance <fct> good, good, good, good, good, good, good, good, good, poor…
To prepares my data for modeling, I will create recipes and use some packages we already learned them. Four recipes will be prepared to make four models. Correlation analysis showed that some variables are highly correlated, while some do not have much correlation. Therefore, I have to make four models. One will be with full dataset, one will be with only first and second grade results, third one to see how variables related to family affiliation have impact on final grade and one to make prediction with only variables related to out of family affiliation such as school support.
library(workflows)
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom 0.7.9 ✓ recipes 0.1.17
## ✓ dials 0.0.10 ✓ tune 0.1.6
## ✓ infer 1.0.0 ✓ workflowsets 0.1.0
## ✓ modeldata 0.1.1 ✓ yardstick 0.0.9
## ✓ parsnip 0.1.7
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
recipe_full_data<- recipe(performance ~ ., data = train_data) %>%
step_dummy(all_nominal_predictors())
recipe_G1_G2_data <- recipe(performance ~ G1+G2, data = train_data) %>%
step_dummy(all_nominal_predictors())
recipe_family_affiliation<- recipe(performance ~ Pstatus+Medu+Fedu+
famsup+internet+famrel,data = train_data) %>%
step_dummy(all_nominal_predictors())
recipe_external_affiliation<- recipe(performance ~ studytime+schoolsup+
paid+activities+higher+freetime+
goout+health+absences,
data = train_data) %>%
step_dummy(all_nominal_predictors())
A general model engine is set up to accommodate it into different models and adjusted with workflows.
install.packages("workflows")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.1'
## (as 'lib' is unspecified)
library(parsnip)
lr_model<- logistic_reg() %>%
set_engine('glm')
lr_workflow_full<- workflow() %>%
add_model(lr_model) %>%
add_recipe(recipe_full_data)
Dataset is trained and fitted to models for predictions.
set.seed(42)
lr_fit_full_data<- lr_workflow_full %>%
fit(data = train_data)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
lr_fit_full_data %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 19 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 471. 85575. 0.00551 0.996
## 2 Medu -0.0517 4766. -0.0000108 1.00
## 3 Fedu -0.104 4823. -0.0000215 1.00
## 4 studytime 0.220 5255. 0.0000418 1.00
## 5 famrel 0.0855 4545. 0.0000188 1.00
## 6 freetime 0.0503 4249. 0.0000118 1.00
## 7 goout -0.0897 3675. -0.0000244 1.00
## 8 health -0.0564 3077. -0.0000183 1.00
## 9 absences -0.00602 889. -0.00000678 1.00
## 10 G1 0.0251 3104. 0.00000807 1.00
## 11 G2 0.120 5331. 0.0000225 1.00
## 12 G3 -45.0 9150. -0.00492 0.996
## 13 Pstatus_T -0.0322 12502. -0.00000257 1.00
## 14 schoolsup_yes -0.202 11978. -0.0000169 1.00
## 15 famsup_yes -0.134 8460. -0.0000158 1.00
## 16 paid_yes -0.431 17439. -0.0000247 1.00
## 17 activities_yes 0.500 8134. 0.0000614 1.00
## 18 higher_yes -0.114 11415. -0.0000100 1.00
## 19 internet_yes 0.114 9134. 0.0000125 1.00
Models are then tested and predictions are made.
predictions<- augment(lr_fit_full_data, test_data)
glimpse(predictions)
## Rows: 163
## Columns: 22
## $ Pstatus <fct> T, A, T, A, T, T, T, A, T, T, T, T, T, T, A, T, T, T, T, T…
## $ Medu <int> 2, 4, 4, 2, 4, 4, 2, 3, 4, 2, 4, 2, 4, 4, 2, 4, 4, 1, 3, 4…
## $ Fedu <int> 2, 4, 4, 2, 4, 4, 2, 4, 4, 2, 4, 2, 2, 4, 1, 4, 2, 1, 1, 3…
## $ studytime <int> 2, 2, 1, 3, 3, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 4, 2, 1, 2…
## $ schoolsup <fct> no, yes, no, no, no, no, no, yes, no, no, no, yes, no, yes…
## $ famsup <fct> no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ paid <fct> no, no, no, no, no, yes, no, yes, no, no, no, no, no, no, …
## $ activities <fct> no, no, yes, no, yes, no, no, yes, yes, yes, yes, no, no, …
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ internet <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ famrel <int> 4, 4, 4, 4, 3, 5, 1, 5, 4, 3, 4, 5, 4, 3, 5, 3, 3, 3, 5, 4…
## $ freetime <int> 4, 1, 3, 5, 2, 4, 2, 3, 3, 3, 3, 4, 3, 3, 3, 2, 3, 3, 3, 3…
## $ goout <int> 4, 4, 3, 2, 3, 2, 2, 3, 1, 3, 3, 1, 3, 4, 4, 2, 3, 4, 2, 3…
## $ health <int> 3, 1, 5, 3, 2, 5, 5, 5, 5, 3, 5, 1, 5, 5, 2, 5, 3, 5, 5, 5…
## $ absences <int> 0, 2, 0, 0, 10, 0, 6, 2, 2, 16, 0, 0, 4, 0, 2, 8, 0, 2, 0,…
## $ G1 <int> 13, 10, 12, 14, 13, 11, 10, 12, 15, 11, 14, 9, 11, 13, 12,…
## $ G2 <int> 12, 13, 13, 14, 13, 12, 11, 12, 15, 11, 15, 10, 12, 12, 13…
## $ G3 <int> 13, 13, 12, 15, 14, 12, 12, 13, 15, 10, 15, 10, 13, 12, 12…
## $ performance <fct> good, good, good, good, good, good, good, good, good, poor…
## $ .pred_class <fct> good, good, good, good, good, good, good, good, good, poor…
## $ .pred_good <dbl> 1.000000e+00, 1.000000e+00, 1.000000e+00, 1.000000e+00, 1.…
## $ .pred_poor <dbl> 2.220446e-16, 2.220446e-16, 2.220446e-16, 2.220446e-16, 2.…
It is important to see how the models we designed have performed. For this purpose, we can use confusion matrix or class coefficients.
predictions %>%
conf_mat(performance, .pred_class)
## Truth
## Prediction good poor
## good 113 0
## poor 0 50
#or
predictions %>%
select(performance, .pred_class) %>%
accuracy(truth = performance, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 1
lr_workflow_G1_G2 <- workflow() %>%
add_model(lr_model) %>%
add_recipe(recipe_G1_G2_data)
set.seed(42)
lr_fit_G1_G2 <- lr_workflow_G1_G2 %>%
fit(data = train_data)
lr_fit_G1_G2 %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 19.0 2.05 9.26 2.05e-20
## 2 G1 -0.444 0.137 -3.23 1.24e- 3
## 3 G2 -1.44 0.196 -7.35 2.00e-13
predictions_G1_G2 <- augment(lr_fit_G1_G2, test_data)
predictions_G1_G2 %>% conf_mat(performance, .pred_class)
## Truth
## Prediction good poor
## good 105 4
## poor 8 46
?predic
## No documentation for 'predic' in specified packages and libraries:
## you could try '??predic'
predictions_G1_G2 %>%
select(performance, .pred_class) %>%
accuracy(truth = performance, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.926
lr_workflow_family<- workflow() %>%
add_model(lr_model) %>%
add_recipe(recipe_family_affiliation)
set.seed(42)
lr_fit_family_data<- lr_workflow_family %>%
fit(data = train_data)
lr_fit_family_data %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.31 0.591 2.22 0.0262
## 2 Medu -0.295 0.121 -2.44 0.0148
## 3 Fedu -0.284 0.126 -2.26 0.0236
## 4 famrel -0.142 0.112 -1.27 0.204
## 5 Pstatus_T -0.124 0.316 -0.393 0.694
## 6 famsup_yes -0.00302 0.211 -0.0143 0.989
## 7 internet_yes -0.196 0.245 -0.799 0.424
predictions_family_vars<- augment(lr_fit_family_data, test_data)
predictions_family_vars %>%conf_mat(performance, .pred_class)
## Truth
## Prediction good poor
## good 103 43
## poor 10 7
predictions_family_vars %>%
select(performance, .pred_class) %>%
accuracy(truth = performance, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.675
lr_workflow_external<- workflow() %>%
add_model(lr_model) %>%
add_recipe(recipe_external_affiliation)
set.seed(42)
lr_fit_external<- lr_workflow_external %>%
fit(data = train_data)
lr_fit_external %>%
extract_fit_parsnip() %>%
tidy()
## # A tibble: 10 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.560 0.641 0.873 0.382
## 2 studytime -0.376 0.143 -2.63 0.00859
## 3 freetime 0.152 0.113 1.35 0.178
## 4 goout 0.00267 0.0985 0.0271 0.978
## 5 health 0.116 0.0791 1.47 0.142
## 6 absences 0.0678 0.0240 2.82 0.00477
## 7 schoolsup_yes 0.479 0.327 1.46 0.143
## 8 paid_yes 0.254 0.445 0.572 0.568
## 9 activities_yes -0.183 0.222 -0.822 0.411
## 10 higher_yes -2.13 0.360 -5.91 0.00000000352
predictions_external_vars<- augment(lr_fit_external, test_data)
predictions_external_vars %>%conf_mat(performance, .pred_class)
## Truth
## Prediction good poor
## good 107 39
## poor 6 11
predictions_external_vars %>%
select(performance, .pred_class) %>%
accuracy(truth = performance, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.724
After the making predictions for all four models, we see get a set of mixed results and they differ in performance standard with highest model accuracy of around 93%. Predictions based on separate variables for family and out of family affiliation performed poorly below 73%
To answer our first question, prediction was made with all the variables to see how all the elements help to perform for final grade and it gave an accuracy of around 90%, while first and second term grades explained mainly for final grade outcome. It is noticeable from the predictions that only family affiliation or out of family affiliation for students study support cannot alone contribute significantly to improve students` final grade performance. More particularly, first and second term grade predictors contributed most to give better final grade output. Hence, it is assumed from the analysis that if family support, school support, family orientations are properly aligned and ensured for students in the first and second terms of the schools, they will have impact on final grade performance. So, all kinds of supports and environment have to be ensured for students in first and second terms to tackle the group of poor-performing students.
Singh, S. P., Malik, S., & Singh, P. (2016). Research paper factors affecting academic performance of students. Indian Journal of Research, 5(4), 176-178.
Ali, N., Jusof, K., Ali, S., Mokhtar, N., & Salamat, A. S. A. (2009). THE FACTORS INFLUENCING STUDENTS’PERFORMANCE AT UNIVERSITI TEKNOLOGI MARA KEDAH, MALAYSIA. Management Science and Engineering, 3(4), 81-90.
Abdullah, A. M. (2011). Factors affecting business students' performance in Arab Open University: The case of Kuwait. International Journal of Business and Management, 6(5), 146.
Applegate, C., & Daly, A. (2006). The impact of paid work on the academic performance of students: A case study from the University of Canberra.Australian Journal of Education, 50(2), 155-166.
Hedjazi, Y., & Omidi, M. (2008) Factors affecting the Academic success of Agricultural Students at university of Tehran, Iran. Journal of Agricultural Science and Technology. Vol. 10. No. 3. Pp. 205-214,April 2008.
Ramadan, S., & Quraan, A. (1994). Determinants of students’ performance in introductory accounting courses. Journal of King Saud University, 6 (2): 65-80.
Al-Rofo, M. A. (2010). The dimensions that affect the students’ low accumulative average in Tafila Technical University. Journal of Social Sciences, 22(1), 53-59.
Naser, K., & Peel, M. J. (1998). An exploratory study of the impact of intervening variables on student performance in a principles of accounting course. Accounting Education, 7(3), 209-223.
Khan, R. M. A., Iqbal, N., & Tasneem, S. (2015). The Influence of Parents Educational Level on Secondary School Students Academic Achievements in District Rajanpur. Journal of Education and Practice, 6(16), 76-79.
Konstantopoulos, S., & Borman, G. (2011). Family background and school effects on student achievement: A multilevel analysis of the Coleman data. Teachers College Record, 113(1), 97-132.