The source of dataset has been obtained via Kaggle. This dataset includes a survey of students of mathematics subject as well as Portuguese language courses in secondary school. The survey includes many valuable data especially the social, gender, and students’ study information. Not to mention there are a total 30 suitable variables found in this dataset that has met our target project requirement.
The data set that we’ll be working on includes a survey of students math and portuguese language courses in secondary school. The survey include many data about social, gender, and study information about students.
| Variables | Description |
|---|---|
| school | (bin) Student’s school (E.g: ‘GP’: Gabriel Pereira) |
| sex | (bin) Student’s sex (E.g: ‘F’- female, ‘M’-male) |
| age | (num) Student’s age from 15 to 22 |
| address | (bin) Student’s home address (E.g: ‘U’ - urban, ‘R’ - rural) |
| famsize | (bin) Family size (E.g: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3) |
| Pstatus | (bin) Parent’s cohabitation status (E.g: ‘T’ - living together or ‘A’ - apart) |
| Medu | (num) Mother’s education (E.g: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) |
| Fedu | (num) Father’s education (E.g: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) |
| Mjob | (nom) Mother’s job (E.g: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) |
| Fjob | (nom) Father’s job (E.g: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) |
| reason | (nom) Reason to choose this school (E.g: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’) |
| guardian | (nom) Student’s guardian (E.g: ‘mother’, ‘father’ or ‘other’) |
| traveltime | (num) Home to school travel time (E.g: 1 - 1 hour) |
| studytime | (num) Weekly study time (E.g: 1 - 10 hours) |
| failures | (num) Number of past class failures (E.g: n if 1<=n<3, else 4) |
| schoolsup | (bin) Extra educational support (E.g: yes or no) |
| famsup | (bin) Family educational support (E.g: yes or no) |
| paid | (bin) Extra paid classes within the course subject (Math or Portuguese) (E.g: yes or no) |
| activities | (bin) Extra-curricular activities (E.g: yes or no) |
| nursery | (bin) Attended nursery school (E.g: yes or no) |
| higher | (bin) Wants to take higher education (E.g: yes or no) |
| internet | (bin) Internet access at home (E.g: yes or no) |
| romantic | (bin) With a romantic relationship (E.g: yes or no) |
| famrel | (num) Quality of family relationships (E.g: from 1 - very bad to 5 - excellent) |
| freetime | (num) Free time after school (E.g: from 1 - very low to 5 - very high) |
| goout | (num) Going out with friends (E.g: from 1 - very low to 5 - very high) |
| Dalc | (num) Workday alcohol consumption (E.g: from 1 - very low to 5 - very high) |
| Walc | (num) Weekend alcohol consumption (E.g: from 1 - very low to 5 - very high) |
| health | (num) Current health status (E.g: from 1 - very bad to 5 - very good) |
| absences | (num) Number of school absences (E.g: from 0 to 93) |
There are 3 main objectives that we want to achieve from this project. They are as the following:
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(forcats)
library(stringr)
library(caTools)
library(ggplot2)
library(ggthemes)
library(dplyr)
library(DT)
library(data.table)##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(pander)
library(scales)##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(grid)
library(gridExtra)##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(corrplot)## corrplot 0.89 loaded
library(VIM) ## Loading required package: colorspace
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
library(knitr)
library(vcd)
library(caret)## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
mat <- read.csv('student-mat.csv')
por <- read.csv('student-por.csv')
df <- bind_rows(mat, por)
dim(df) #dimension of the dataset## [1] 1044 33
str(df) #general oveview/summary on the dataset## 'data.frame': 1044 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
missing <- df %>% summarize_all(funs(sum(is.na(.))/n()))## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
missing <- gather(missing, key="feature", value="missing_pct")
missing %>%
ggplot(aes(x=reorder(feature,-missing_pct),y=missing_pct)) +
geom_bar(stat="identity",fill="red")+
coord_flip()+theme_bw()anyDuplicated(df)## [1] 0
#numeric df
df_num <- df[ ,sapply(df, is.numeric)]
head(df_num)## age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc
## 1 18 4 4 2 2 0 4 3 4 1 1
## 2 17 1 1 1 2 0 5 3 3 1 1
## 3 15 1 1 1 2 3 4 3 2 2 3
## 4 15 4 2 1 3 0 3 2 2 1 1
## 5 16 3 3 1 2 0 4 3 2 1 2
## 6 16 4 3 1 2 0 5 4 2 1 2
## health absences G1 G2 G3
## 1 3 6 5 6 6
## 2 3 4 5 5 6
## 3 3 10 7 8 10
## 4 5 2 15 14 15
## 5 5 4 6 10 10
## 6 5 10 15 15 15
#categorical df
df_cat <- df[ ,!sapply(df, is.numeric)]
head(df_cat)## school sex address famsize Pstatus Mjob Fjob reason guardian
## 1 GP F U GT3 A at_home teacher course mother
## 2 GP F U GT3 T at_home other course father
## 3 GP F U LE3 T at_home other other mother
## 4 GP F U GT3 T health services home mother
## 5 GP F U GT3 T other other home father
## 6 GP M U LE3 T services other reputation mother
## schoolsup famsup paid activities nursery higher internet romantic
## 1 yes no no no yes yes no no
## 2 no yes no no no yes yes no
## 3 yes no yes no yes yes yes no
## 4 no yes yes yes yes yes yes yes
## 5 no yes yes no yes yes no no
## 6 no yes yes yes yes yes yes no
summary(df_num)## age Medu Fedu traveltime studytime
## Min. :15.00 Min. :0.000 Min. :0.000 Min. :1.000 Min. :1.00
## 1st Qu.:16.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00
## Median :17.00 Median :3.000 Median :2.000 Median :1.000 Median :2.00
## Mean :16.73 Mean :2.603 Mean :2.388 Mean :1.523 Mean :1.97
## 3rd Qu.:18.00 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.00
## Max. :22.00 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.00
## failures famrel freetime goout
## Min. :0.0000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :3.000 Median :3.000
## Mean :0.2644 Mean :3.936 Mean :3.201 Mean :3.156
## 3rd Qu.:0.0000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :3.0000 Max. :5.000 Max. :5.000 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 0.000
## Median :1.000 Median :2.000 Median :4.000 Median : 2.000
## Mean :1.494 Mean :2.284 Mean :3.543 Mean : 4.435
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :75.000
## G1 G2 G3
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 9.00 1st Qu.:10.00
## Median :11.00 Median :11.00 Median :11.00
## Mean :11.21 Mean :11.25 Mean :11.34
## 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :19.00 Max. :19.00 Max. :20.00
cor_num_var <- cor(df_num, use="pairwise.complete.obs")
corrplot.mixed(cor_num_var, tl.col="black", tl.pos = "lt", tl.cex = 0.7,cl.cex = .7, number.cex=.7)#sex vs higher
ggplot(df, aes(sex, fill = factor(higher))) +
geom_bar(stat = "count", position = 'dodge')+
xlab("Sex") +
ylab("Count") +
scale_fill_discrete(name = "Higher") +
ggtitle("Sex vs Higher")#parents cohabitation status vs higher
ggplot(df, aes(Pstatus, fill = factor(higher))) +
geom_bar(stat = "count", position = 'dodge')+
xlab("PStatus") +
ylab("Count") +
scale_fill_discrete(name = "Higher") +
ggtitle("PStatus vs Higher")#extra educational support vs higher
ggplot(df, aes(schoolsup, fill = factor(higher))) +
geom_bar(stat = "count", position = 'dodge')+
xlab("School Support") +
ylab("Count") +
scale_fill_discrete(name = "Higher") +
ggtitle("School Support vs Higher")#family educational support vs higher
ggplot(df, aes(famsup, fill = factor(higher))) +
geom_bar(stat = "count", position = 'dodge')+
xlab("Family Support") +
ylab("Count") +
scale_fill_discrete(name = "Higher") +
ggtitle("Family Support vs Higher")#extra paid classes vs higher
ggplot(df, aes(paid, fill = factor(higher))) +
geom_bar(stat = "count", position = 'dodge')+
xlab("Extra classes") +
ylab("Count") +
scale_fill_discrete(name = "Higher") +
ggtitle("Extra classes vs Higher")#extra curricular activities vs higher
ggplot(df, aes(activities, fill = factor(higher))) +
geom_bar(stat = "count", position = 'dodge')+
xlab("Extra curricular") +
ylab("Count") +
scale_fill_discrete(name = "Higher") +
ggtitle("Extra curricular vs Higher")#nursery vs higher
ggplot(df, aes(nursery, fill = factor(higher))) +
geom_bar(stat = "count", position = 'dodge')+
xlab("Nursery") +
ylab("Count") +
scale_fill_discrete(name = "Higher") +
ggtitle("Nursery vs Higher")#internet vs higher
ggplot(df, aes(internet, fill = factor(higher))) +
geom_bar(stat = "count", position = 'dodge')+
xlab("Internet") +
ylab("Count") +
scale_fill_discrete(name = "Higher") +
ggtitle("Internet vs Higher")#presence of relationship vs higher
ggplot(df, aes(romantic, fill = factor(higher))) +
geom_bar(stat = "count", position = 'dodge')+
xlab("Relationship") +
ylab("Count") +
scale_fill_discrete(name = "Higher") +
ggtitle("Relationship vs Higher")Predicting exam grade (G3) of students using multiple regression
#importing package for linear regression model
library(forecast)## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
#split the dataset into training and test sets, 80:20
trainingset <- df[1:(round(0.8*nrow(df))),]
testset <- df[(round(0.8*nrow(df))+1):nrow(df),]
# creating a model to predict dependent variable (G3) using few identified dependent variables (G1, G2)
lin_reg_model <- lm(G3 ~ G1 + G2, data = trainingset)
# using testset to predict exam grade G3
predicted_G3 <- predict(lin_reg_model, testset)
# obtaining difference between actual and predicted values, error = prediction (predicted_G3) - actual (G3)
error <- testset$G3 - predicted_G3
accuracy(predicted_G3, testset$G3)## ME RMSE MAE MPE MAPE
## Test set 0.1592894 1.569933 1.001139 NaN Inf
Determining whether a student will pursue higher education using classification machine learning techniques
#feature engineering
df_cat <- df[, c("sex", "Mjob", "Fjob", "reason", "guardian", "schoolsup", "famsup", "paid", "activities", "nursery", "higher", "internet", "romantic")]
library(CatEncoders)##
## Attaching package: 'CatEncoders'
## The following object is masked from 'package:base':
##
## transform
df_new <- df
df_new[, "sex"] <- transform(LabelEncoder.fit(df_new[, "sex"]), df_new[, "sex"])
df_new[, "Mjob"] <- transform(LabelEncoder.fit(df_new[, "Mjob"]), df_new[, "Mjob"])
df_new[, "Fjob"] <- transform(LabelEncoder.fit(df_new[, "Fjob"]), df_new[, "Fjob"])
df_new[, "reason"] <- transform(LabelEncoder.fit(df_new[, "reason"]), df_new[, "reason"])
df_new[, "guardian"] <- transform(LabelEncoder.fit(df_new[, "guardian"]), df_new[, "guardian"])
df_new[, "schoolsup"] <- transform(LabelEncoder.fit(df_new[, "schoolsup"]), df_new[, "schoolsup"])
df_new[, "famsup"] <- transform(LabelEncoder.fit(df_new[, "famsup"]), df_new[, "famsup"])
df_new[, "paid"] <- transform(LabelEncoder.fit(df_new[, "paid"]), df_new[, "paid"])
df_new[, "activities"] <- transform(LabelEncoder.fit(df_new[, "activities"]), df_new[, "activities"])
df_new[, "nursery"] <- transform(LabelEncoder.fit(df_new[, "nursery"]), df_new[, "nursery"])
df_new[, "higher"] <- transform(LabelEncoder.fit(df_new[, "higher"]), df_new[, "higher"])
df_new[, "internet"] <- transform(LabelEncoder.fit(df_new[, "internet"]), df_new[, "internet"])
df_new[, "romantic"] <- transform(LabelEncoder.fit(df_new[, "romantic"]), df_new[, "romantic"])
#df_new$higher <- ifelse(df_new$higher == 2, 1, 0)
str(df_new)## 'data.frame': 1044 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : int 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : int 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : int 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : int 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : int 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : int 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : int 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : int 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: int 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : int 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : int 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : int 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : int 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
#split the dataset into training and test sets, 80:20
train <- df_new[1:(round(0.8*nrow(df_new))),]
test <- df_new[(round(0.8*nrow(df_new))+1):nrow(df_new),]
#using random forest to determine variable importance
library(randomForest)## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Set a random seed
set.seed(1000)
# Build the model (note: not all possible variables are used)
rf_model <- randomForest(factor(higher) ~ ., data = train)
#rf_model <- glm(formula = higher ~., family = "binomial", data = train)
# Get importance
importance <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance),
Importance = round(importance[ ,'MeanDecreaseGini'],2))
# Create a rank variable based on importance
rankImportance <- varImportance %>%
mutate(Rank = paste0('#',dense_rank(desc(Importance))))
# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance),
y = Importance, fill = Importance)) +
geom_bar(stat='identity') +
labs(x = 'Variables') +
coord_flip() +
theme_few()predicted <- predict(rf_model, test)
confusionMatrix(as.factor(predicted), as.factor(test$higher), positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2
## 1 0 0
## 2 33 176
##
## Accuracy : 0.8421
## 95% CI : (0.7855, 0.8888)
## No Information Rate : 0.8421
## P-Value [Acc > NIR] : 0.5463
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 2.54e-08
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8421
## Prevalence : 0.1579
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
To determine whether a student will join extra-curricular activities or not based on: 1) the location of their home (urban/rural)
library(caTools)
df_lr <- df[, c('schoolsup', 'address')]
df_lr$schoolsup <- ifelse(df_lr$schoolsup == "yes", 1, 0)
df_lr$address <- ifelse(df_lr$address == "U", 1, 0)
str(df_lr)## 'data.frame': 1044 obs. of 2 variables:
## $ schoolsup: num 1 0 1 0 0 0 0 1 0 0 ...
## $ address : num 1 1 1 1 1 1 1 1 1 1 ...
# Split the data
split <- sample.split(df_lr, SplitRatio = 0.8)
trainingset <- subset(df_lr, split == "TRUE")
testset <- subset(df_lr, split == "FALSE")
# Munge the data
df_lr$schoolsup <- as.factor(df_lr$schoolsup)
df_lr$address <- as.factor(df_lr$address)
# Train the model using the training data
lr_model <- glm(schoolsup ~ address, data = trainingset, family = 'binomial')
summary(lr_model)##
## Call:
## glm(formula = schoolsup ~ address, family = "binomial", data = trainingset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5032 -0.5032 -0.5032 -0.4396 2.1839
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2882 0.2804 -8.159 3.37e-16 ***
## address 0.2855 0.3232 0.883 0.377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 364.18 on 521 degrees of freedom
## Residual deviance: 363.37 on 520 degrees of freedom
## AIC: 367.37
##
## Number of Fisher Scoring iterations: 4
# Run the test data through the model
res <- predict(lr_model, testset, type = "response")
res## 1 3 5 7 9 11 13
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 15 17 19 21 23 25 27
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 29 31 33 35 37 39 41
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.09210527 0.11891892
## 43 45 47 49 51 53 55
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 57 59 61 63 65 67 69
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.09210527
## 71 73 75 77 79 81 83
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 85 87 89 91 93 95 97
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527
## 99 101 103 105 107 109 111
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 113 115 117 119 121 123 125
## 0.11891892 0.09210527 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892
## 127 129 131 133 135 137 139
## 0.11891892 0.09210527 0.09210527 0.11891892 0.09210527 0.09210527 0.11891892
## 141 143 145 147 149 151 153
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527
## 155 157 159 161 163 165 167
## 0.11891892 0.09210527 0.09210527 0.09210527 0.11891892 0.09210527 0.11891892
## 169 171 173 175 177 179 181
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 183 185 187 189 191 193 195
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 197 199 201 203 205 207 209
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892
## 211 213 215 217 219 221 223
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.09210527 0.11891892
## 225 227 229 231 233 235 237
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 239 241 243 245 247 249 251
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 253 255 257 259 261 263 265
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 267 269 271 273 275 277 279
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 281 283 285 287 289 291 293
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 295 297 299 301 303 305 307
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 309 311 313 315 317 319 321
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 323 325 327 329 331 333 335
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527
## 337 339 341 343 345 347 349
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 351 353 355 357 359 361 363
## 0.09210527 0.11891892 0.09210527 0.09210527 0.11891892 0.09210527 0.11891892
## 365 367 369 371 373 375 377
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 379 381 383 385 387 389 391
## 0.11891892 0.11891892 0.11891892 0.09210527 0.09210527 0.11891892 0.11891892
## 393 395 397 399 401 403 405
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 407 409 411 413 415 417 419
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 421 423 425 427 429 431 433
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527
## 435 437 439 441 443 445 447
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 449 451 453 455 457 459 461
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 463 465 467 469 471 473 475
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 477 479 481 483 485 487 489
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 491 493 495 497 499 501 503
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 505 507 509 511 513 515 517
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 519 521 523 525 527 529 531
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 533 535 537 539 541 543 545
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892
## 547 549 551 553 555 557 559
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 561 563 565 567 569 571 573
## 0.09210527 0.11891892 0.09210527 0.09210527 0.11891892 0.09210527 0.11891892
## 575 577 579 581 583 585 587
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 589 591 593 595 597 599 601
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 603 605 607 609 611 613 615
## 0.09210527 0.11891892 0.09210527 0.11891892 0.11891892 0.09210527 0.11891892
## 617 619 621 623 625 627 629
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 631 633 635 637 639 641 643
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.09210527 0.11891892
## 645 647 649 651 653 655 657
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527
## 659 661 663 665 667 669 671
## 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 673 675 677 679 681 683 685
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892
## 687 689 691 693 695 697 699
## 0.11891892 0.11891892 0.11891892 0.09210527 0.09210527 0.11891892 0.11891892
## 701 703 705 707 709 711 713
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 715 717 719 721 723 725 727
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892
## 729 731 733 735 737 739 741
## 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892
## 743 745 747 749 751 753 755
## 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527
## 757 759 761 763 765 767 769
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527
## 771 773 775 777 779 781 783
## 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892
## 785 787 789 791 793 795 797
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 799 801 803 805 807 809 811
## 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892
## 813 815 817 819 821 823 825
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.09210527 0.09210527
## 827 829 831 833 835 837 839
## 0.09210527 0.09210527 0.09210527 0.09210527 0.09210527 0.11891892 0.09210527
## 841 843 845 847 849 851 853
## 0.09210527 0.09210527 0.09210527 0.09210527 0.11891892 0.11891892 0.09210527
## 855 857 859 861 863 865 867
## 0.09210527 0.09210527 0.09210527 0.11891892 0.09210527 0.11891892 0.09210527
## 869 871 873 875 877 879 881
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.09210527 0.09210527
## 883 885 887 889 891 893 895
## 0.09210527 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.11891892
## 897 899 901 903 905 907 909
## 0.11891892 0.11891892 0.09210527 0.11891892 0.09210527 0.11891892 0.11891892
## 911 913 915 917 919 921 923
## 0.11891892 0.09210527 0.09210527 0.09210527 0.11891892 0.11891892 0.11891892
## 925 927 929 931 933 935 937
## 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892 0.09210527
## 939 941 943 945 947 949 951
## 0.09210527 0.11891892 0.09210527 0.09210527 0.11891892 0.11891892 0.09210527
## 953 955 957 959 961 963 965
## 0.09210527 0.09210527 0.11891892 0.11891892 0.09210527 0.09210527 0.11891892
## 967 969 971 973 975 977 979
## 0.11891892 0.11891892 0.09210527 0.09210527 0.11891892 0.11891892 0.09210527
## 981 983 985 987 989 991 993
## 0.09210527 0.09210527 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527
## 995 997 999 1001 1003 1005 1007
## 0.09210527 0.11891892 0.09210527 0.11891892 0.09210527 0.11891892 0.09210527
## 1009 1011 1013 1015 1017 1019 1021
## 0.11891892 0.11891892 0.11891892 0.11891892 0.09210527 0.11891892 0.11891892
## 1023 1025 1027 1029 1031 1033 1035
## 0.09210527 0.09210527 0.09210527 0.09210527 0.09210527 0.09210527 0.09210527
## 1037 1039 1041 1043
## 0.09210527 0.09210527 0.11891892 0.11891892
# Validate the model using Confusion Matrix
confmatrix <- table(Actual_Value = trainingset$schoolsup, Predicted_Value = res > 0.5)
confmatrix## Predicted_Value
## Actual_Value FALSE
## 0 464
## 1 58
# Check Correlation
cor.test(as.numeric(df_lr$schoolsup), as.numeric(df_lr$address), method = "pearson")##
## Pearson's product-moment correlation
##
## data: as.numeric(df_lr$schoolsup) and as.numeric(df_lr$address)
## t = 0.76146, df = 1042, p-value = 0.4466
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03714249 0.08413446
## sample estimates:
## cor
## 0.02358275
# Accuracy
(confmatrix[[1,1]]) / sum(confmatrix)## [1] 0.8888889