Machine Learning-Based Prediction about Secondary Students Education

Introduction

The source of dataset has been obtained via Kaggle. This dataset includes a survey of students of mathematics subject as well as Portuguese language courses in secondary school. The survey includes many valuable data especially the social, gender, and students’ study information. Not to mention there are a total 30 suitable variables found in this dataset that has met our target project requirement.

About data set

The data set that we’ll be working on includes a survey of students math and portuguese language courses in secondary school. The survey include many data about social, gender, and study information about students.

Variables	Description
school	(bin) Student’s school (E.g: ‘GP’: Gabriel Pereira)
sex	(bin) Student’s sex (E.g: ‘F’- female, ‘M’-male)
age	(num) Student’s age from 15 to 22
address	(bin) Student’s home address (E.g: ‘U’ - urban, ‘R’ - rural)
famsize	(bin) Family size (E.g: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus	(bin) Parent’s cohabitation status (E.g: ‘T’ - living together or ‘A’ - apart)
Medu	(num) Mother’s education (E.g: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu	(num) Father’s education (E.g: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob	(nom) Mother’s job (E.g: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob	(nom) Father’s job (E.g: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason	(nom) Reason to choose this school (E.g: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian	(nom) Student’s guardian (E.g: ‘mother’, ‘father’ or ‘other’)
traveltime	(num) Home to school travel time (E.g: 1 - 1 hour)
studytime	(num) Weekly study time (E.g: 1 - 10 hours)
failures	(num) Number of past class failures (E.g: n if 1<=n<3, else 4)
schoolsup	(bin) Extra educational support (E.g: yes or no)
famsup	(bin) Family educational support (E.g: yes or no)
paid	(bin) Extra paid classes within the course subject (Math or Portuguese) (E.g: yes or no)
activities	(bin) Extra-curricular activities (E.g: yes or no)
nursery	(bin) Attended nursery school (E.g: yes or no)
higher	(bin) Wants to take higher education (E.g: yes or no)
internet	(bin) Internet access at home (E.g: yes or no)
romantic	(bin) With a romantic relationship (E.g: yes or no)
famrel	(num) Quality of family relationships (E.g: from 1 - very bad to 5 - excellent)
freetime	(num) Free time after school (E.g: from 1 - very low to 5 - very high)
goout	(num) Going out with friends (E.g: from 1 - very low to 5 - very high)
Dalc	(num) Workday alcohol consumption (E.g: from 1 - very low to 5 - very high)
Walc	(num) Weekend alcohol consumption (E.g: from 1 - very low to 5 - very high)
health	(num) Current health status (E.g: from 1 - very bad to 5 - very good)
absences	(num) Number of school absences (E.g: from 0 to 93)

Objectives

There are 3 main objectives that we want to achieve from this project. They are as the following:

To predict the final grade of a student based on his/her previous tests taken.
To determine whether a student wants to pursue higher education later in his/her life.
To determine whether a student will join extra-curricular activities or not based on the location of their home (urban/rural).

Packages Information

1. Load packages

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(forcats)
library(stringr)
library(caTools)


library(ggplot2)
library(ggthemes)
library(dplyr)
library(DT)
library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(pander)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(grid)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(corrplot)

## corrplot 0.89 loaded

library(VIM)

## Loading required package: colorspace

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

library(knitr)
library(vcd)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

Data Preparation

Data Preprocessing

1. Importing dataset

mat <- read.csv('student-mat.csv')
por <- read.csv('student-por.csv')
df <- bind_rows(mat, por)

dim(df) #dimension of the dataset

## [1] 1044   33

str(df) #general oveview/summary on the dataset

## 'data.frame':    1044 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

2. Cleaning the dataset

Missing values

missing <- df %>% summarize_all(funs(sum(is.na(.))/n()))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

missing <- gather(missing, key="feature", value="missing_pct")

missing %>% 

  ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct)) +

  geom_bar(stat="identity",fill="red")+

  coord_flip()+theme_bw()

Duplicates

anyDuplicated(df)

## [1] 0

Outliers

#numeric df

df_num <- df[ ,sapply(df, is.numeric)]
head(df_num)

##   age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc
## 1  18    4    4          2         2        0      4        3     4    1    1
## 2  17    1    1          1         2        0      5        3     3    1    1
## 3  15    1    1          1         2        3      4        3     2    2    3
## 4  15    4    2          1         3        0      3        2     2    1    1
## 5  16    3    3          1         2        0      4        3     2    1    2
## 6  16    4    3          1         2        0      5        4     2    1    2
##   health absences G1 G2 G3
## 1      3        6  5  6  6
## 2      3        4  5  5  6
## 3      3       10  7  8 10
## 4      5        2 15 14 15
## 5      5        4  6 10 10
## 6      5       10 15 15 15

#categorical df

df_cat <- df[ ,!sapply(df, is.numeric)]
head(df_cat)

##   school sex address famsize Pstatus     Mjob     Fjob     reason guardian
## 1     GP   F       U     GT3       A  at_home  teacher     course   mother
## 2     GP   F       U     GT3       T  at_home    other     course   father
## 3     GP   F       U     LE3       T  at_home    other      other   mother
## 4     GP   F       U     GT3       T   health services       home   mother
## 5     GP   F       U     GT3       T    other    other       home   father
## 6     GP   M       U     LE3       T services    other reputation   mother
##   schoolsup famsup paid activities nursery higher internet romantic
## 1       yes     no   no         no     yes    yes       no       no
## 2        no    yes   no         no      no    yes      yes       no
## 3       yes     no  yes         no     yes    yes      yes       no
## 4        no    yes  yes        yes     yes    yes      yes      yes
## 5        no    yes  yes         no     yes    yes       no       no
## 6        no    yes  yes        yes     yes    yes      yes       no

summary(df_num)

##       age             Medu            Fedu         traveltime      studytime   
##  Min.   :15.00   Min.   :0.000   Min.   :0.000   Min.   :1.000   Min.   :1.00  
##  1st Qu.:16.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.00  
##  Median :17.00   Median :3.000   Median :2.000   Median :1.000   Median :2.00  
##  Mean   :16.73   Mean   :2.603   Mean   :2.388   Mean   :1.523   Mean   :1.97  
##  3rd Qu.:18.00   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.00  
##  Max.   :22.00   Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.00  
##     failures          famrel         freetime         goout      
##  Min.   :0.0000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :3.000   Median :3.000  
##  Mean   :0.2644   Mean   :3.936   Mean   :3.201   Mean   :3.156  
##  3rd Qu.:0.0000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :3.0000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##       Dalc            Walc           health         absences     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 0.000  
##  Median :1.000   Median :2.000   Median :4.000   Median : 2.000  
##  Mean   :1.494   Mean   :2.284   Mean   :3.543   Mean   : 4.435  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :75.000  
##        G1              G2              G3       
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 9.00   1st Qu.: 9.00   1st Qu.:10.00  
##  Median :11.00   Median :11.00   Median :11.00  
##  Mean   :11.21   Mean   :11.25   Mean   :11.34  
##  3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :19.00   Max.   :19.00   Max.   :20.00

Exploratory Data Analysis

cor_num_var <- cor(df_num, use="pairwise.complete.obs")
corrplot.mixed(cor_num_var, tl.col="black", tl.pos = "lt", tl.cex = 0.7,cl.cex = .7, number.cex=.7)

#sex vs higher

ggplot(df, aes(sex, fill = factor(higher))) + 
  geom_bar(stat = "count", position = 'dodge')+
  xlab("Sex") +
  ylab("Count") +
  scale_fill_discrete(name = "Higher") + 
  ggtitle("Sex vs Higher")

#parents cohabitation status vs higher

ggplot(df, aes(Pstatus, fill = factor(higher))) + 
  geom_bar(stat = "count", position = 'dodge')+
  xlab("PStatus") +
  ylab("Count") +
  scale_fill_discrete(name = "Higher") + 
  ggtitle("PStatus vs Higher")

#extra educational support vs higher

ggplot(df, aes(schoolsup, fill = factor(higher))) + 
  geom_bar(stat = "count", position = 'dodge')+
  xlab("School Support") +
  ylab("Count") +
  scale_fill_discrete(name = "Higher") + 
  ggtitle("School Support vs Higher")

#family educational support vs higher

ggplot(df, aes(famsup, fill = factor(higher))) + 
  geom_bar(stat = "count", position = 'dodge')+
  xlab("Family Support") +
  ylab("Count") +
  scale_fill_discrete(name = "Higher") + 
  ggtitle("Family Support vs Higher")

#extra paid classes vs higher

ggplot(df, aes(paid, fill = factor(higher))) + 
  geom_bar(stat = "count", position = 'dodge')+
  xlab("Extra classes") +
  ylab("Count") +
  scale_fill_discrete(name = "Higher") + 
  ggtitle("Extra classes vs Higher")

#extra curricular activities vs higher

ggplot(df, aes(activities, fill = factor(higher))) + 
  geom_bar(stat = "count", position = 'dodge')+
  xlab("Extra curricular") +
  ylab("Count") +
  scale_fill_discrete(name = "Higher") + 
  ggtitle("Extra curricular vs Higher")

#nursery vs higher

ggplot(df, aes(nursery, fill = factor(higher))) + 
  geom_bar(stat = "count", position = 'dodge')+
  xlab("Nursery") +
  ylab("Count") +
  scale_fill_discrete(name = "Higher") + 
  ggtitle("Nursery vs Higher")

#internet vs higher

ggplot(df, aes(internet, fill = factor(higher))) + 
  geom_bar(stat = "count", position = 'dodge')+
  xlab("Internet") +
  ylab("Count") +
  scale_fill_discrete(name = "Higher") + 
  ggtitle("Internet vs Higher")

#presence of relationship vs higher

ggplot(df, aes(romantic, fill = factor(higher))) + 
  geom_bar(stat = "count", position = 'dodge')+
  xlab("Relationship") +
  ylab("Count") +
  scale_fill_discrete(name = "Higher") + 
  ggtitle("Relationship vs Higher")

Result & Discussion

Multiple Regression

Predicting exam grade (G3) of students using multiple regression

#importing package for linear regression model
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

#split the dataset into training and test sets, 80:20

trainingset <- df[1:(round(0.8*nrow(df))),]
testset <- df[(round(0.8*nrow(df))+1):nrow(df),]

# creating a model to predict dependent variable (G3) using few identified dependent variables (G1, G2)

lin_reg_model <- lm(G3 ~ G1 + G2, data = trainingset)

# using testset to predict exam grade G3

predicted_G3 <- predict(lin_reg_model, testset)

# obtaining difference between actual and predicted values, error = prediction (predicted_G3) - actual (G3)

error <- testset$G3 - predicted_G3

accuracy(predicted_G3, testset$G3)

##                 ME     RMSE      MAE MPE MAPE
## Test set 0.1592894 1.569933 1.001139 NaN  Inf

Random Forest Classifier

Determining whether a student will pursue higher education using classification machine learning techniques

#feature engineering

df_cat <- df[, c("sex", "Mjob", "Fjob", "reason", "guardian", "schoolsup", "famsup", "paid", "activities", "nursery", "higher", "internet", "romantic")]

library(CatEncoders)

## 
## Attaching package: 'CatEncoders'

## The following object is masked from 'package:base':
## 
##     transform

df_new <- df

df_new[, "sex"] <- transform(LabelEncoder.fit(df_new[, "sex"]), df_new[, "sex"])
df_new[, "Mjob"] <- transform(LabelEncoder.fit(df_new[, "Mjob"]), df_new[, "Mjob"])
df_new[, "Fjob"] <- transform(LabelEncoder.fit(df_new[, "Fjob"]), df_new[, "Fjob"])
df_new[, "reason"] <- transform(LabelEncoder.fit(df_new[, "reason"]), df_new[, "reason"])
df_new[, "guardian"] <- transform(LabelEncoder.fit(df_new[, "guardian"]), df_new[, "guardian"])
df_new[, "schoolsup"] <- transform(LabelEncoder.fit(df_new[, "schoolsup"]), df_new[, "schoolsup"])
df_new[, "famsup"] <- transform(LabelEncoder.fit(df_new[, "famsup"]), df_new[, "famsup"])
df_new[, "paid"] <- transform(LabelEncoder.fit(df_new[, "paid"]), df_new[, "paid"])
df_new[, "activities"] <- transform(LabelEncoder.fit(df_new[, "activities"]), df_new[, "activities"])
df_new[, "nursery"] <- transform(LabelEncoder.fit(df_new[, "nursery"]), df_new[, "nursery"])
df_new[, "higher"] <- transform(LabelEncoder.fit(df_new[, "higher"]), df_new[, "higher"])
df_new[, "internet"] <- transform(LabelEncoder.fit(df_new[, "internet"]), df_new[, "internet"])
df_new[, "romantic"] <- transform(LabelEncoder.fit(df_new[, "romantic"]), df_new[, "romantic"])

#df_new$higher <- ifelse(df_new$higher == 2, 1, 0)
str(df_new)

## 'data.frame':    1044 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : int  1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : int  1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : int  5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : int  1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : int  2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : int  2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : int  1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : int  1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: int  1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : int  2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : int  1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

#split the dataset into training and test sets, 80:20

train <- df_new[1:(round(0.8*nrow(df_new))),]
test <- df_new[(round(0.8*nrow(df_new))+1):nrow(df_new),]


#using random forest to determine variable importance

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Set a random seed
set.seed(1000)

# Build the model (note: not all possible variables are used)
rf_model <- randomForest(factor(higher) ~ ., data = train)

#rf_model <- glm(formula = higher ~., family = "binomial", data = train)

# Get importance
importance    <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))

# Create a rank variable based on importance
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))

# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
    y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  labs(x = 'Variables') +
  coord_flip() + 
  theme_few()

predicted <- predict(rf_model, test)

confusionMatrix(as.factor(predicted), as.factor(test$higher), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2
##          1   0   0
##          2  33 176
##                                           
##                Accuracy : 0.8421          
##                  95% CI : (0.7855, 0.8888)
##     No Information Rate : 0.8421          
##     P-Value [Acc > NIR] : 0.5463          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 2.54e-08        
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8421          
##              Prevalence : 0.1579          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
##

Logistic Regression

To determine whether a student will join extra-curricular activities or not based on: 1) the location of their home (urban/rural)

library(caTools)

df_lr <- df[, c('schoolsup', 'address')]
df_lr$schoolsup <- ifelse(df_lr$schoolsup == "yes", 1, 0)
df_lr$address <- ifelse(df_lr$address == "U", 1, 0)
str(df_lr)

## 'data.frame':    1044 obs. of  2 variables:
##  $ schoolsup: num  1 0 1 0 0 0 0 1 0 0 ...
##  $ address  : num  1 1 1 1 1 1 1 1 1 1 ...

# Split the data
split <- sample.split(df_lr, SplitRatio = 0.8)
trainingset <- subset(df_lr, split == "TRUE")
testset <- subset(df_lr, split == "FALSE")

# Munge the data
df_lr$schoolsup <-  as.factor(df_lr$schoolsup)
df_lr$address <-  as.factor(df_lr$address)

# Train the model using the training data
lr_model <- glm(schoolsup ~ address, data = trainingset, family = 'binomial')
summary(lr_model)

## 
## Call:
## glm(formula = schoolsup ~ address, family = "binomial", data = trainingset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5032  -0.5032  -0.5032  -0.4396   2.1839  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.2882     0.2804  -8.159 3.37e-16 ***
## address       0.2855     0.3232   0.883    0.377    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 364.18  on 521  degrees of freedom
## Residual deviance: 363.37  on 520  degrees of freedom
## AIC: 367.37
## 
## Number of Fisher Scoring iterations: 4

# Run the test data through the model
res <- predict(lr_model, testset, type = "response")
res

##          1          3          5          7          9         11         13 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##         15         17         19         21         23         25         27 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##         29         31         33         35         37         39         41 
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.09210527 0.11891892 
##         43         45         47         49         51         53         55 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##         57         59         61         63         65         67         69 
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.09210527 
##         71         73         75         77         79         81         83 
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##         85         87         89         91         93         95         97 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 
##         99        101        103        105        107        109        111 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##        113        115        117        119        121        123        125 
## 0.11891892 0.09210527 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 
##        127        129        131        133        135        137        139 
## 0.11891892 0.09210527 0.09210527 0.11891892 0.09210527 0.09210527 0.11891892 
##        141        143        145        147        149        151        153 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 
##        155        157        159        161        163        165        167 
## 0.11891892 0.09210527 0.09210527 0.09210527 0.11891892 0.09210527 0.11891892 
##        169        171        173        175        177        179        181 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##        183        185        187        189        191        193        195 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        197        199        201        203        205        207        209 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 
##        211        213        215        217        219        221        223 
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.09210527 0.11891892 
##        225        227        229        231        233        235        237 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        239        241        243        245        247        249        251 
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##        253        255        257        259        261        263        265 
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##        267        269        271        273        275        277        279 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##        281        283        285        287        289        291        293 
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        295        297        299        301        303        305        307 
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        309        311        313        315        317        319        321 
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##        323        325        327        329        331        333        335 
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 
##        337        339        341        343        345        347        349 
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##        351        353        355        357        359        361        363 
## 0.09210527 0.11891892 0.09210527 0.09210527 0.11891892 0.09210527 0.11891892 
##        365        367        369        371        373        375        377 
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##        379        381        383        385        387        389        391 
## 0.11891892 0.11891892 0.11891892 0.09210527 0.09210527 0.11891892 0.11891892 
##        393        395        397        399        401        403        405 
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        407        409        411        413        415        417        419 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        421        423        425        427        429        431        433 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 
##        435        437        439        441        443        445        447 
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        449        451        453        455        457        459        461 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        463        465        467        469        471        473        475 
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        477        479        481        483        485        487        489 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        491        493        495        497        499        501        503 
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        505        507        509        511        513        515        517 
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        519        521        523        525        527        529        531 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        533        535        537        539        541        543        545 
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 
##        547        549        551        553        555        557        559 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        561        563        565        567        569        571        573 
## 0.09210527 0.11891892 0.09210527 0.09210527 0.11891892 0.09210527 0.11891892 
##        575        577        579        581        583        585        587 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        589        591        593        595        597        599        601 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##        603        605        607        609        611        613        615 
## 0.09210527 0.11891892 0.09210527 0.11891892 0.11891892 0.09210527 0.11891892 
##        617        619        621        623        625        627        629 
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        631        633        635        637        639        641        643 
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.09210527 0.11891892 
##        645        647        649        651        653        655        657 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 
##        659        661        663        665        667        669        671 
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        673        675        677        679        681        683        685 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 
##        687        689        691        693        695        697        699 
## 0.11891892 0.11891892 0.11891892 0.09210527 0.09210527 0.11891892 0.11891892 
##        701        703        705        707        709        711        713 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        715        717        719        721        723        725        727 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 
##        729        731        733        735        737        739        741 
## 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 
##        743        745        747        749        751        753        755 
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 
##        757        759        761        763        765        767        769 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 
##        771        773        775        777        779        781        783 
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 
##        785        787        789        791        793        795        797 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        799        801        803        805        807        809        811 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 
##        813        815        817        819        821        823        825 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.09210527 0.09210527 
##        827        829        831        833        835        837        839 
## 0.09210527 0.09210527 0.09210527 0.09210527 0.09210527 0.11891892 0.09210527 
##        841        843        845        847        849        851        853 
## 0.09210527 0.09210527 0.09210527 0.09210527 0.11891892 0.11891892 0.09210527 
##        855        857        859        861        863        865        867 
## 0.09210527 0.09210527 0.09210527 0.11891892 0.09210527 0.11891892 0.09210527 
##        869        871        873        875        877        879        881 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.09210527 0.09210527 
##        883        885        887        889        891        893        895 
## 0.09210527 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 
##        897        899        901        903        905        907        909 
## 0.11891892 0.11891892 0.09210527 0.11891892 0.09210527 0.11891892 0.11891892 
##        911        913        915        917        919        921        923 
## 0.11891892 0.09210527 0.09210527 0.09210527 0.11891892 0.11891892 0.11891892 
##        925        927        929        931        933        935        937 
## 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.09210527 
##        939        941        943        945        947        949        951 
## 0.09210527 0.11891892 0.09210527 0.09210527 0.11891892 0.11891892 0.09210527 
##        953        955        957        959        961        963        965 
## 0.09210527 0.09210527 0.11891892 0.11891892 0.09210527 0.09210527 0.11891892 
##        967        969        971        973        975        977        979 
## 0.11891892 0.11891892 0.09210527 0.09210527 0.11891892 0.11891892 0.09210527 
##        981        983        985        987        989        991        993 
## 0.09210527 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 
##        995        997        999       1001       1003       1005       1007 
## 0.09210527 0.11891892 0.09210527 0.11891892 0.09210527 0.11891892 0.09210527 
##       1009       1011       1013       1015       1017       1019       1021 
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 
##       1023       1025       1027       1029       1031       1033       1035 
## 0.09210527 0.09210527 0.09210527 0.09210527 0.09210527 0.09210527 0.09210527 
##       1037       1039       1041       1043 
## 0.09210527 0.09210527 0.11891892 0.11891892

# Validate the model using Confusion Matrix
confmatrix <- table(Actual_Value = trainingset$schoolsup, Predicted_Value = res > 0.5)
confmatrix

##             Predicted_Value
## Actual_Value FALSE
##            0   464
##            1    58

# Check Correlation
cor.test(as.numeric(df_lr$schoolsup), as.numeric(df_lr$address), method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(df_lr$schoolsup) and as.numeric(df_lr$address)
## t = 0.76146, df = 1042, p-value = 0.4466
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03714249  0.08413446
## sample estimates:
##        cor 
## 0.02358275

# Accuracy
(confmatrix[[1,1]]) / sum(confmatrix)

## [1] 0.8888889

WQD7004 Group Project

Group 11

5/26/2021