Group 6
Lim Wei Yeong (WIE190021 17206011/1)
Gan Jia Soon (WIE190012 17206343/1)
Ng Zi Xiang (WIE190035 17205360/1)
Ng Phoon Ken (WIE190034 17204748/1)
Academic performance of students is always a concern not only to themselves, but also to their parents, teacher and society once they enter working environment. There is a lot of factors that are actually affecting the students in concentrating to their academic subjects, cause a bad result overally. This project is done to investigate on them.
Our product is created on the purpose helping predicting the grade of students under conditions and show for various factor affecting academic performance of students. We believe that this product could be useful to users like teachers, parents and even students.
In completing this assignment, R being used. We focus on data extraction, cleaning, analysis and finally visualisation with prediction using Shiny application mainly. We successfully obtain our data set through online where there 2 data sets inside, where one recording for some variables with result of the students for 3 semester in Mathematics, while another with Portuguese. We selected the data set with Mathematics only, where we had a total of 395 rows and 33 columns at first. Some variables included inside the data set are age, health status, family relationship and more.
First, we set the working directory, and load the data, check for dimension, existence of null value and such for the data set chosen. “student-mat.csv” is the raw data set that is unprocessed where we extract it from the source https://archive.ics.uci.edu/ml/datasets/student+performance
setwd("D:/WIE2003 Intro to DS/Assignment/Group/student/Grade")
d1=read.table("student-mat.csv",sep=";",header=TRUE)
# Display for number of rows and columns
dim(d1)
## [1] 395 33
# Check for class of the data
class(d1)
## [1] "data.frame"
# Check for names of columns with its respective type
str(d1)
## 'data.frame': 395 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
# Check for existence of null value inside the data
sum(is.na(d1))
## [1] 0
# Remove duplicate data
d1 <- unique(d1)
# Check for summary for whole data
summary(d1)
## school sex age address
## Length:395 Length:395 Min. :15.0 Length:395
## Class :character Class :character 1st Qu.:16.0 Class :character
## Mode :character Mode :character Median :17.0 Mode :character
## Mean :16.7
## 3rd Qu.:18.0
## Max. :22.0
## famsize Pstatus Medu Fedu
## Length:395 Length:395 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :2.000
## Mean :2.749 Mean :2.522
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Length:395
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.000 Median :2.000 Median :0.0000 Mode :character
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:395 Length:395 Length:395 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.944
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.000
## Mean :3.235 Mean :3.109 Mean :1.481 Mean :2.291
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## health absences G1 G2
## Min. :1.000 Min. : 0.000 Min. : 3.00 Min. : 0.00
## 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00
## Median :4.000 Median : 4.000 Median :11.00 Median :11.00
## Mean :3.554 Mean : 5.709 Mean :10.91 Mean :10.71
## 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :5.000 Max. :75.000 Max. :19.00 Max. :19.00
## G3
## Min. : 0.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :10.42
## 3rd Qu.:14.00
## Max. :20.00
Let us take a look for preview of the data that is not yet processed.
head(d1)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## guardian traveltime studytime failures schoolsup famsup paid activities
## 1 mother 2 2 0 yes no no no
## 2 father 1 2 0 no yes no no
## 3 mother 1 2 3 yes no yes no
## 4 mother 1 3 0 no yes yes yes
## 5 father 1 2 0 no yes yes no
## 6 mother 1 2 0 no yes yes yes
## nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1 yes yes no no 4 3 4 1 1 3
## 2 no yes yes no 5 3 3 1 1 3
## 3 yes yes yes no 4 3 2 2 3 3
## 4 yes yes yes yes 3 2 2 1 1 5
## 5 yes yes no no 4 3 2 1 2 5
## 6 yes yes yes no 5 4 2 1 2 5
## absences G1 G2 G3
## 1 6 5 6 6
## 2 4 5 5 6
## 3 10 7 8 10
## 4 2 15 14 15
## 5 4 6 10 10
## 6 10 15 15 15
Some variables such as school where the student come from, reason choosing the school being removed since these are not applicable as these are not suitable factors to be considered in affecting grade since we treat this to be a more universal problem.
Columns like daily and weekly alcohol consumption being removed as well since proportion of alcohol consumption among students is not significant enough in some countries.
d1 <- subset(d1, select=-c(Dalc, Walc, school, reason))
Remove the grade for the 3 semester where they would be sum up and processed to form a new column “Average.result”
total <- d1$G1 + d1$G2 + d1$G3
Result <- total / 3
d1$Average.result <- Result
d1 <- subset(d1, select=-c(G1, G2, G3))
Set yes/no to TRUE/FALSE (levels for factor)
d1[d1 == "no"] <- "FALSE"
d1[d1 == "yes"] <- "TRUE"
Renaming columns to be a more appropriate names
names(d1)[names(d1) == "sex"] <- "Sex"
names(d1)[names(d1) == "age"] <- "Age"
names(d1)[names(d1) == "address"] <- "Address"
names(d1)[names(d1) == "famsize"] <- "Family.size"
names(d1)[names(d1) == "Pstatus"] <- "Parent.s.cohabition.status"
names(d1)[names(d1) == "Medu"] <- "Mother.s.education"
names(d1)[names(d1) == "Fedu"] <- "Father.s.education"
names(d1)[names(d1) == "Mjob"] <- "Mother.s.job"
names(d1)[names(d1) == "Fjob"] <- "Father.s.job"
names(d1)[names(d1) == "guardian"] <- "Guardian"
names(d1)[names(d1) == "traveltime"] <- "Time.from.home.to.school"
names(d1)[names(d1) == "studytime"] <- "Weekly.study.time"
names(d1)[names(d1) == "failures"] <- "Number.of.past.class.failures"
names(d1)[names(d1) == "schoolsup"] <- "Extra.educational.support"
names(d1)[names(d1) == "famsup"] <- "Family.educational.support"
names(d1)[names(d1) == "paid"] <- "Extra.paid.class"
names(d1)[names(d1) == "activities"] <- "Extra.curricular.activity"
names(d1)[names(d1) == "nursery"] <- "Attended.nursery.school"
names(d1)[names(d1) == "higher"] <- "Aims.for.higher.education"
names(d1)[names(d1) == "internet"] <- "Internet.access"
names(d1)[names(d1) == "romantic"] <- "In.romantic.relationship"
names(d1)[names(d1) == "famrel"] <- "Evaluation.for.family.relationship"
names(d1)[names(d1) == "freetime"] <- "Evaluation.for.free.time.after.school"
names(d1)[names(d1) == "goout"] <- "Frequent.going.out.with.friends"
names(d1)[names(d1) == "health"] <- "Current.health.status"
names(d1)[names(d1) == "absences"] <- "Number.of.school.absences"
Make columns except Age, Number of school absences and “Average result” to be factor
col_names <- names(d1)
col_names <- col_names [-c(2, 26, 27)]
d1[col_names] <- lapply(d1[col_names], factor)
Some graphs would be plotted and then remove for some columns based on reasons. But first, load the suitable library for the visualization
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
First graph showing for Sex and “Average result”
data1 <- data.frame(Sex = d1$Sex, Average.result = d1$Average.result)
p1 <- ggplot(data1) + geom_col(aes(x=Sex, y=Average.result), position="dodge", fill ="blue")
p1 <- p1 + labs(title = "Plot of Sex against Average result", x = "Sex", y = "Average result")
p1
∴ There is insignificant difference for the two Sex attributes with the “Average result”, remove Sex column is suggested.
To plot following graphs, details below need to be obtained, count for the factors of the selected columns need to be known
summary(d1$Guardian)
## father mother other
## 90 273 32
summary(d1$Mother.s.job)
## at_home health other services teacher
## 59 34 141 103 58
summary(d1$Father.s.job)
## at_home health other services teacher
## 20 18 217 111 29
Pie Chart of Guardian
countG <- c(90, 273, 32)
labG <- c("father", "mother", "other")
percentG <- round(countG / sum(countG)*100)
labG <- paste(labG, percentG)
labG <- paste(labG, "%", sep = " ")
pG <- pie(countG, labels = labG, col=rainbow(length(labG)), main = "Pie Chart of Guardian")
Pie Chart of Mother’s job
countM <- c(59, 24, 141, 103, 58)
labM <- c("at_home", "health", "other", "services", "teacher")
percentM <- round(countM / sum(countM)*100)
labM <- paste(labM, percentM)
labM <- paste(labM, "%", sep = " ")
pM <- pie(countM, labels = labM, col=rainbow(length(labM)), main = "Pie Chart of Mother's job")
Pie Chart of Father’s job
countF <- c(20, 18, 217, 111, 29)
labF <- c("at_home", "health", "other", "services", "teacher")
percentF <- round(countF / sum(countF)*100)
labF <- paste(labF, percentF)
labF <- paste(labF, "%", sep = " ")
pF <- pie(countF, labels = labF, col=rainbow(length(labF)), main = "Pie Chart of Father's job")
∴ Proportion for “Other and services” are too high for column Mother’s job and Father’s job, causing the factor to be considered with parent’s job is meaningless. Hence the both columns should be removed.
Make same graph for Guardian which is same what did with Sex column before
data2 <- data.frame(Guardian = d1$Guardian, Average.result = d1$Average.result)
p2 <- ggplot(data2) + geom_col(aes(x=Guardian, y=Average.result), position="dodge", fill = "red")
p2 <- p2 + labs(title="Plot of Guardian against Average result", x="Guardian", y="Average result")
p2
∴ There is insignificant difference for the three Guardian attributes with the “Average result”, removing Guardian column is required
Removing columns Sex, Guardian, Mother’s job, and Father’s job
d1 <- d1[-c(1, 8, 9, 10)]
Final checking on data processed
summary(d1)
## Age Address Family.size Parent.s.cohabition.status
## Min. :15.0 R: 88 GT3:281 A: 41
## 1st Qu.:16.0 U:307 LE3:114 T:354
## Median :17.0
## Mean :16.7
## 3rd Qu.:18.0
## Max. :22.0
## Mother.s.education Father.s.education Time.from.home.to.school
## 0: 3 0: 2 1:257
## 1: 59 1: 82 2:107
## 2:103 2:115 3: 23
## 3: 99 3:100 4: 8
## 4:131 4: 96
##
## Weekly.study.time Number.of.past.class.failures Extra.educational.support
## 1:105 0:312 FALSE:344
## 2:198 1: 50 TRUE : 51
## 3: 65 2: 17
## 4: 27 3: 16
##
##
## Family.educational.support Extra.paid.class Extra.curricular.activity
## FALSE:153 FALSE:214 FALSE:194
## TRUE :242 TRUE :181 TRUE :201
##
##
##
##
## Attended.nursery.school Aims.for.higher.education Internet.access
## FALSE: 81 FALSE: 20 FALSE: 66
## TRUE :314 TRUE :375 TRUE :329
##
##
##
##
## In.romantic.relationship Evaluation.for.family.relationship
## FALSE:263 1: 8
## TRUE :132 2: 18
## 3: 68
## 4:195
## 5:106
##
## Evaluation.for.free.time.after.school Frequent.going.out.with.friends
## 1: 19 1: 23
## 2: 64 2:103
## 3:157 3:130
## 4:115 4: 86
## 5: 40 5: 53
##
## Current.health.status Number.of.school.absences Average.result
## 1: 47 Min. : 0.000 Min. : 1.333
## 2: 45 1st Qu.: 0.000 1st Qu.: 8.333
## 3: 91 Median : 4.000 Median :10.667
## 4: 66 Mean : 5.709 Mean :10.679
## 5:146 3rd Qu.: 8.000 3rd Qu.:13.333
## Max. :75.000 Max. :19.333
Write the data set where index row won’t be recorded to be stored as another csv, “Cleaned.csv”
write.csv(d1, "Cleaned.csv", row.names = FALSE)
Obtain the features that is needed to train the Random Forest Model. as.factor() is used to convert the variables to be categorical/ factor variables which is the same as the data set processed before
ds <- read.csv("Cleaned.csv", header=TRUE)
ds.model <- data.frame(
Age = ds$Age,
Time.from.home.to.school = as.factor(ds$Time.from.home.to.school),
Number.of.past.class.failures = as.factor(ds$Number.of.past.class.failures),
Extra.educational.support = as.factor(ds$Extra.educational.support),
Family.educational.support = as.factor(ds$Family.educational.support),
Extra.paid.class = as.factor(ds$Extra.paid.class),
Extra.curricular.activity = as.factor(ds$Extra.curricular.activity),
Attended.nursery.school = as.factor(ds$Attended.nursery.school),
Aims.for.higher.education = as.factor(ds$Aims.for.higher.education),
Internet.access = as.factor(ds$Internet.access),
In.romantic.relationship = as.factor(ds$In.romantic.relationship),
Average.result = ds$Average.result)
Since the average result is a type of continuous variable, therefore it is clearly to use supervised learning regression model. There are several regression model can be used for our prediction. We choose Random Forest Regression model to predict the average result. Random Forest Regression model is an ensemble learning with multiple decision tree. It has the advantages on dealing with non-linearity.
The data set obtained from the feature extraction is split into two, which used for training (70%) and testing (30%). Based on our testing set result, the prediction model has a root mean square error (rmse) between 3 to 4, which is relatively moderate.
# Load dplyr library
library(dplyr)
# Load randomForest library to use the model
library(randomForest)
# Split the ds.model to train data frame and test data frame
dt <- sort(sample(nrow(ds.model), nrow(ds.model)*0.7))
train <- ds.model[dt,]
test <- ds.model[-dt,]
# Remove the average result from the test data frame
test <- select(test, -Average.result)
# Train the Random Forest model
model <- randomForest(formula=Average.result ~., data=train)
# Finding root mean square error
actual <- select(ds.model[-dt,], Average.result)
names(actual) <- "actual"
predicted <- predict(model, test)
cal_df <- cbind(actual, predicted)
cal_df$error <- cal_df$actual - cal_df$predicted
rmse <- sqrt(mean(cal_df$error^2))
rmse
## [1] 3.517115
# Plot to see the difference between the actual and predicted based on test set
value <- select(cal_df, actual)
names(value) <- "value"
index_value <- c(1:nrow(value))
group <- "Actual"
actual_df <- cbind(value, index_value, group)
value <- select(cal_df, predicted)
names(value) <- "value"
index_value <- c(1:nrow(value))
group <- "Predicted"
predicted_df <- cbind(value, index_value, group)
plot_df <- rbind(actual_df, predicted_df)
ggplot(plot_df, aes(index_value, value, colour=group)) + geom_point() + ggtitle("Comparison between Actual and Predicted Values based on Test Dataset")
app.R is split into 2 parts, which are ui, in charge of visualizing the dashboard, and server, in charge of manipulation of input and displaying output. There is a total of 3 tabs for the shiny app we have created, first to be “About”, where descriptions of the variable stored and questions proposed would be listed there. Second tab is “Graph”, where a jitter/ scatter graph would be shown, based on input by users on X-axis, Y-axis, facet row and also facet column. The last tab is “Prediction”, where users can manipulate for conditions according to their likes to predict average result of the student under the selected conditions. Root mean square error for the prediction and a graph showing the difference between prediction and actual results would be shown as well. Shinydashboard library is used in beautify the visualization of the product.
Link to the web application: https://limweiyeong99.shinyapps.io/grade/
Below show for the source code of app.R
library(dplyr)
library(ggplot2)
library(randomForest)
library(rsconnect)
library(shiny)
library(shinydashboard)
# Read data set that is processed and cleaned
ds <- read.csv("Cleaned.csv", header=TRUE)
# Data dictionary
Feature <- names(ds)
Description <- c(
"student's age (numeric: from 15 to 22)",
"student's home address type (binary: U - urban or R - rural)",
"family size (binary: LE3 - less or equal to 3 or GT3 - greater than 3)",
"parent's cohabitation status (binary: T - living together or A - apart)",
"mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)",
"father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)",
"home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)",
"weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)",
"number of past class failures (numeric: n for 0<= n <=3)",
"extra educational support (binary: yes or no)",
"family educational support (binary: yes or no)",
"extra paid classes within the subject (binary: yes or no)",
"extra-curricular activities (binary: yes or no)",
"attended nursery school (binary: yes or no)",
"wants to take higher education (binary: yes or no)",
"Internet access at home (binary: yes or no)",
"in a romantic relationship (binary: yes or no)",
"quality of family relationships (numeric: from 1 - very bad to 5 - excellent)",
"free time after school (numeric: from 1 - very low to 5 - very high)",
"going out with friends (numeric: from 1 - very low to 5 - very high)",
"current health status (numeric: from 1 - very bad to 5 - very good)",
"number of school absences (numeric: from 0 to 93)",
"average result for the subject within 3 period (continuos: 0 to 20)"
)
dict <- cbind(Feature, Description)
# Obtain the features that is needed to train the Random Forest Model
ds.model <- data.frame(
Age = ds$Age,
Time.from.home.to.school = as.factor(ds$Time.from.home.to.school),
Number.of.past.class.failures = as.factor(ds$Number.of.past.class.failures),
Extra.educational.support = as.factor(ds$Extra.educational.support),
Family.educational.support = as.factor(ds$Family.educational.support),
Extra.paid.class = as.factor(ds$Extra.paid.class),
Extra.curricular.activity = as.factor(ds$Extra.curricular.activity),
Attended.nursery.school = as.factor(ds$Attended.nursery.school),
Aims.for.higher.education = as.factor(ds$Aims.for.higher.education),
Internet.access = as.factor(ds$Internet.access),
In.romantic.relationship = as.factor(ds$In.romantic.relationship),
Average.result = ds$Average.result)
# Split the ds.model to train dataframe and test dataframe
dt <- sort(sample(nrow(ds.model), nrow(ds.model)*0.7))
train <- ds.model[dt,]
test <- ds.model[-dt,]
# Remove the average result from the test data frame
test <- select(test, -Average.result)
# Train the Random Forest model
model <- randomForest(formula=Average.result ~., data=train)
# Finding root mean square error
actual <- select(ds.model[-dt,], Average.result)
names(actual) <- "actual"
predicted <- predict(model, test)
cal_df <- cbind(actual, predicted)
cal_df$error <- cal_df$actual - cal_df$predicted
rmse <- sqrt(mean(cal_df$error^2))
# Plot to see the difference between the actual and predicted based on test set
value <- select(cal_df, actual)
names(value) <- "value"
index_value <- c(1:nrow(value))
group <- "Actual"
actual_df <- cbind(value, index_value, group)
value <- select(cal_df, predicted)
names(value) <- "value"
index_value <- c(1:nrow(value))
group <- "Predicted"
predicted_df <- cbind(value, index_value, group)
plot_df <- rbind(actual_df, predicted_df)
ui <- dashboardPage(
title = "Predict your grade",
skin = "purple",
dashboardHeader(title = "Predict Your Grade"),
dashboardSidebar(
sidebarMenu(
menuItem("About", tabName = "about", icon = icon("info")),
menuItem("Graph", tabName = "graph", icon = icon("chart-bar")),
menuItem("Prediction", tabName = "prediction", icon = icon("smile-wink"))
)
),
dashboardBody(
tabItems(
# First tab content
tabItem(tabName = "about",
fluidRow(
box(
width = 12,
title = "Questions",
h4("i. What are the factors that can affect grade of students?"),
h4("ii. How much do these factors affect grade of the students?"),
h4("iii. What is the relationship between these factors?")
)
),
fluidRow(
box(
width = 12,
title = "Data Dictionary",
tableOutput("details")
)
),
fluidRow(
box(
width = 12,
title = "Want to understand more?",
a("Click here to Github", href="https://github.com/YeongLIM99/grade")
)
)
),
tabItem(tabName = "graph",
fluidRow(
box(
title = "Parameter",
sliderInput("Population", "Choose number of students", min = 30, max = 395, value = 200),
# Below are inputs for graph
selectInput("x_axis", "Choose for X-axis", choices=names(ds), selected=names(ds)[1]),
selectInput("y_axis", "Choose for Y-axis", choices=names(ds), selected=names(ds)[2]),
# None would be represented with "." in syntax, where there will be no chosen variable
selectInput('facet_row', 'Choose the facet row', choices=c(None='.', names(ds))),
selectInput('facet_col', 'Choose the facet column', choices=c(None='.', names(ds)))
),
box(
title = h1("Plot for variables"),
plotOutput("graph"),
helpText("Please refer to About for more information.")
)
)
),
tabItem(tabName = "prediction",
fluidRow(
box(
title = "Parameter",
# Below are inputs for prediction
checkboxGroupInput('check_box', 'Select the variable(s) to be TRUE',
choiceNames=c('Extra.educational.support','Family.educational.support','Extra.paid.class',
'Extra.curricular.activity', 'Attended.nursery.school', 'Aims.for.higher.education',
'Internet.access', 'In.romantic.relationship'),
choiceValues=c('Extra.educational.support','Family.educational.support','Extra.paid.class',
'Extra.curricular.activity', 'Attended.nursery.school', 'Aims.for.higher.education',
'Internet.access', 'In.romantic.relationship')
),
selectInput('age', 'Age', choices = as.integer(levels(factor(ds$Age)))),
selectInput('time', 'Time.from.home.to.school', choices = c("< 15min","15 - 30min","30min - 1hour", "> 1hour")),
selectInput('fail', 'Number.of.past.class.failures', choices = levels(factor(ds$Number.of.past.class.failures)))
),
box(
title = h1("Predict the grade"),
h4("Root mean square error for this prediction model"), verbatimTextOutput("rmse"),
h4("Plot showing difference for actual and prediction based on test dataset"), plotOutput("difference"),
h4("Prediction for average result of the student according to the condition chosen (0-20)"), verbatimTextOutput("predict")
)
)
)
)
)
)
server <- function(input, output){
data_shown <- reactive({
ds[1:input$Population,]
})
# Prevent facet row and column having same variables chosen
facet_row <- reactive({
validate(
need((input$facet_row != input$facet_col) || input$facet_row == '.',
"Please choose another variable for facet row")
)
input$facet_row
})
facet_column <- reactive({
validate(
need((input$facet_row != input$facet_col) || input$facet_col == '.',
"Please choose another variable for facet row")
)
input$facet_col
})
# For graph tab
output$graph <- renderPlot({
# Jitter graph or scatter graph is drawn to show distribution for the variables
p <- ggplot(data_shown(), aes_string(x=input$x_axis, y=input$y_axis)) +
geom_jitter(width = 0.25, height = 0.25)
# Declare for facet row and facet column to be shown in graph in case not a null value
facets <- paste(facet_row(), '~', facet_column())
if (facets != '. ~ .')
p <- p + facet_grid(facets)
# Output the graph
p
})
# For prediction tab menu
output$predict <- renderText({
# Code the user input
if (input$time == "< 15min") {
time = 1
}
else if (input$time == "15 - 30min") {
time = 2
}
else if (input$time == "30min - 1hour") {
time = 3
}
else if (input$time == "> 1hour"){
time = 4
}
# Let those boolean variable to be in form of factor and FALSE at neutral
input_data <- data.frame(
Age = as.integer(input$age),
Time.from.home.to.school = factor(time, levels=levels(ds.model$Time.from.home.to.school)),
Number.of.past.class.failures = factor(input$fail, levels=levels(ds.model$Number.of.past.class.failures)),
Extra.educational.support = factor(FALSE, levels=levels(ds.model$Extra.educational.support)),
Family.educational.support = factor(FALSE, levels=levels(ds.model$Family.educational.support)),
Extra.paid.class = factor(FALSE, levels=levels(ds.model$Extra.paid.class)),
Extra.curricular.activity = factor(FALSE, levels=levels(ds.model$Extra.curricular.activity)),
Attended.nursery.school = factor(FALSE, levels=levels(ds.model$Attended.nursery.school)),
Aims.for.higher.education = factor(FALSE, levels=levels(ds.model$Aims.for.higher.education)),
Internet.access = factor(FALSE, levels = levels(ds.model$Internet.access)),
In.romantic.relationship = factor(FALSE, levels=levels(ds.model$In.romantic.relationship))
)
# If checkbox tick for the boolean variable, set as true for those applicable
if(length(input$check_box) > 0){
if("Extra.educational.support" %in% input$check_box){
input_data$Extra.educational.support = TRUE
}
if("Family.educational.support" %in% input$check_box){
input_data$Family.educational.support = TRUE
}
if("Extra.paid.class" %in% input$check_box){
input_data$Extra.paid.class = TRUE
}
if("Extra.curricular.activity" %in% input$check_box){
input_data$Extra.curricular.activity = TRUE
}
if("Attended.nursery.school" %in% input$check_box){
input_data$Attended.nursery.school = TRUE
}
if("Aims.for.higher.education" %in% input$check_box){
input_data$Aims.for.higher.education = TRUE
}
if("Internet.access" %in% input$check_box){
input_data$Internet.access = TRUE
}
if("In.romantic.relationship" %in% input$check_box){
input_data$In.romantic.relationship = TRUE
}
}
test <- rbind(test, input_data)
testrow <- test[nrow(test), ]
test <- test[1:nrow(test) - 1, ]
predict(model, testrow)
})
# Render rsme
output$rmse <- renderText({
paste(sqrt(mean(cal_df$error^2)))
})
# Render prediction plot to visualise the accuracy
output$difference <- renderPlot({
ggplot(plot_df, aes(index_value, value, colour=group)) + geom_point() + ggtitle("Comparison between Actual and Predicted Values based on Test Dataset")
})
# Render the data dictionary table
output$details <- renderTable({
dict
})
}
# Run the application
shinyApp(ui = ui, server = server)
Below show for images taken based on our product.
We have listed the factors that affect the grade of students in “About” tab, making some analysis and visualization for determining how factors affecting the grade as well as relationship between the factors in “Graph” tab. Further checking on extend factors affecting grade of students can be determined at “Predict” tab.
We learned lot on developing the data product, yet there are still some challenges faced by us in completing the assignment, and they are listed as shown:
Defining a good question for the project and finding and appropriate data set regarding the question
Inexperience in operation related to prediction and choosing appropriate prediction model
Communication with teammates is halted because of the online learning environment
Lack of experience in members in R programming language
There are need for ensuring academic performance of the student to be improved, where this required more effort in determining factors affecting the students and solving them one by one. Suitable helps such as financial support or meeting with parents for further discussion on student’s daily life should be determined. This require attention from not only students, but also parents, school, government and even society since these students who will be leading the world in future and academic performance stands for capability of them.