Before you go ahead and run the codes in this coursebook, it’s often a good idea to go through some initial setup. Under the Libraries and Setup tab you’ll see some code to initialize our workspace, and the libraries we’ll be using for the projects. You may want to make sure that the libraries are installed beforehand by referring back to the packages listed here.
This coursebook is as a response to Algoritma’s - Learn By Building (LBB) Data Visualization (DV) Workshop, it is to evaluate whether myself as a participant have enough understanding on the subject, which is a great method experiencing a hands-on development using R, and specifically using ggplot2 as the package for Grammar of Graphics.
We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.
options(scipen = 9999)
rm(list=ls())You will need to use install.packages() or you can access the tab ‘Packages’ from your righ bottom of RStudio as following screenshots.
Install Package
Then search the packages required
For the purpose of this coursebook, we are going to install the following packages:
to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library() function:
library(ggplot2)
library(GGally)
library(ggthemes)
library(ggpubr)
library(lubridate)
library(dplyr)The dataset consists of the marks secured by the students in various subjects, which accessible from Kaggle Student Performance in Exams.
The Inspiration is to understand the influence of the parents background, test preparation etc on students performance. It comprises of 1,000 rows and 8 columns:
With this coursebook, we are going to demystify the urban myths about the student’s performance in exam as following:
Now, let’s get our hands dirty. Before that, please make sure that you are working on the right directory by getwd() function.
getwd()## [1] "D:/Development/MachineLearning/Algoritma Workshop/[Algoritma] Data Visualization/Hendri_DV_Assignment"
You will see three directories, and the csv is located on data_input/ you can verify the existency of csv file by runnning list.files() function.
list.files("data_input/")## [1] "StudentsPerformance.csv"
Next, we load the data into the global environment.
studentPerformance <- read.csv("data_input/StudentsPerformance.csv", stringsAsFactors = FALSE)Then, we examine the structure of studentPerformance
str(studentPerformance)## 'data.frame': 1000 obs. of 8 variables:
## $ gender : chr "female" "female" "female" "male" ...
## $ race.ethnicity : chr "group B" "group C" "group B" "group A" ...
## $ parental.level.of.education: chr "bachelor's degree" "some college" "master's degree" "associate's degree" ...
## $ lunch : chr "standard" "standard" "standard" "free/reduced" ...
## $ test.preparation.course : chr "none" "completed" "none" "none" ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
As you can see, that all values other than the number are considered as character, but actually it is implicitly coercion to factor by default. The reason it is not performing implicit coercion to factor is the parameter ‘stringsAsFactors’ set to FALSE. This is purposely done, so we can go through on the step to decide which variables are necessary to be converted to factor.
Factor, normally used to categorize data, such as Gender: Male, Female, Date, Type of payment: Cash, Credit, Transfer.
Fortunately, if you observe these ‘character’ type variables can be used as factor type. So, let’s convert them to factor.
studentPerformance$gender <- as.factor(studentPerformance$gender)
studentPerformance$race.ethnicity <- as.factor(studentPerformance$race.ethnicity)
studentPerformance$parental.level.of.education <- as.factor(studentPerformance$parental.level.of.education)
studentPerformance$lunch <- as.factor(studentPerformance$lunch)
studentPerformance$test.preparation.course <- as.factor(studentPerformance$test.preparation.course )Now, check again the structure to make sure that all required variables to be factor
str(studentPerformance)## 'data.frame': 1000 obs. of 8 variables:
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
## $ race.ethnicity : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
## $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
## $ lunch : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
## $ test.preparation.course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
Then we remove meaningless variables that are not required for our analysis purpose, peek the data using head() function
head(studentPerformance)## gender race.ethnicity parental.level.of.education lunch
## 1 female group B bachelor's degree standard
## 2 female group C some college standard
## 3 female group B master's degree standard
## 4 male group A associate's degree free/reduced
## 5 male group C some college standard
## 6 female group B associate's degree standard
## test.preparation.course math.score reading.score writing.score
## 1 none 72 72 74
## 2 completed 69 90 88
## 3 none 90 95 93
## 4 none 47 57 44
## 5 none 76 78 75
## 6 none 71 83 78
Fortunately again, the data are good to move forward. The example of meaningless data that I refered is more like name of the student, the student id, or even parent’s name. It makes no sense to use them for analysis.
Moving forward, we transform the score to grade (i.e A, A-, B+, etc.) by adding ‘grade’ variable and add weight to later determine the Grade Point Average (GPA), as following:
To convert score to grade, we define getGrade() function, as below:
getGrade <- function(score) {
if(score >= 90 & score <= 100) {
return("A")
} else {
if(score >= 85 & score <= 89) {
return("A-")
} else {
if(score >= 80 & score <= 84) {
return("B+")
} else {
if(score >= 75 & score <= 79) {
return("B")
} else {
if(score >= 70 & score <= 74) {
return("B-")
} else {
if(score >= 65 & score <= 69) {
return ("C")
} else {
if(score >= 50 & score <= 64) {
return ("D")
} else {
if(score > 0 & score <= 49) {
return ("E")
} else {
return ("F")
}
}
}
}
}
}
}
}
}Look’s complicated? especially with the brackets {}, hanging there, we will comeback with the ‘more readable and easier to follow’ version later on.
getWeight() function to get the weight point from the grade, this use switch() operation
getWeight<-function(grade){
weight <- switch(grade,
"A"=4.00,
"A-"=3.67,
"B+"=3.33,
"B"=3.00,
"B-"=2.50,
"C"=2.0,
"D"=1.0,
"E"=0.00,
"F"=0.00)
return(weight)
}Then, we add the ‘math.grade’ variable into the data which contains a grade based on the score.
studentPerformance$math.grade <- sapply(studentPerformance$math.score, FUN=getGrade)
studentPerformance$reading.grade <- sapply(studentPerformance$reading.score, FUN=getGrade)
studentPerformance$writing.grade <- sapply(studentPerformance$writing.score, FUN=getGrade)And add a weight, based on its grade
studentPerformance$math.weight <- sapply(studentPerformance$math.grade, FUN=getWeight)
studentPerformance$reading.weight <- sapply(studentPerformance$reading.grade, FUN=getWeight)
studentPerformance$writing.weight <- sapply(studentPerformance$writing.grade, FUN=getWeight)As promised, below is the more readable / alternative version using with() and ifelse() function. Dont get it wrong, more readable does not mean it is shorter.
studentPerformance$math.grade <- with(studentPerformance,
ifelse(math.score >= 90 & math.score <= 100, "A",
ifelse(math.score >= 85 & math.score <= 89, "A-",
ifelse(math.score >= 80 & math.score <= 84, "B+",
ifelse(math.score >= 75 & math.score <= 79, "B",
ifelse(math.score >= 70 & math.score <= 74, "B-",
ifelse(math.score >= 65 & math.score <= 69, "C",
ifelse(math.score >= 50 & math.score <= 64, "D",
ifelse(math.score >= 0 & math.score <= 49, "D", "F"
)))))))))Now, we have the grade and weight, we can determine the GPA and add it as GPA variable.
studentPerformance$GPA <- apply(cbind(studentPerformance$math.weight, studentPerformance$reading.weight, studentPerformance$writing.weight),1, FUN=mean)Let’s improvised a bit by adding another variable to determine whether the student is eligible to graduate with minimum passing score of 2.00.
studentPerformance$eligible <- with(studentPerformance, ifelse(GPA>=2,T, F))We have grown studentPerformance variables from 8 to 16 for our observation. Personally, when I examine the data again, I found that ‘lunch’ variable is meaningless so I decided to remove using subset() function.
studentPerformance <- subset(studentPerformance, select=-c(lunch))str(studentPerformance)## 'data.frame': 1000 obs. of 15 variables:
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
## $ race.ethnicity : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
## $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
## $ test.preparation.course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
## $ math.grade : chr "B-" "C" "A" "D" ...
## $ reading.grade : chr "B-" "A" "A" "D" ...
## $ writing.grade : chr "B-" "A-" "A" "E" ...
## $ math.weight : num 2.5 2 4 0 3 2.5 3.67 0 1 0 ...
## $ reading.weight : num 2.5 4 4 1 3 3.33 4 0 1 1 ...
## $ writing.weight : num 2.5 3.67 4 0 3 3 4 0 2 1 ...
## $ GPA : num 2.5 3.223 4 0.333 3 ...
## $ eligible : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
We find that math.grade, reading.grade and writing.grade are all character type, we need to convert them into factors for our analysis.
studentPerformance$math.grade <- as.factor(studentPerformance$math.grade)
studentPerformance$reading <- as.factor(studentPerformance$reading.grade)
studentPerformance$writing.grade <- as.factor(studentPerformance$writing.grade)check again to verify.
str(studentPerformance)## 'data.frame': 1000 obs. of 16 variables:
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
## $ race.ethnicity : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
## $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
## $ test.preparation.course : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
## $ math.score : int 72 69 90 47 76 71 88 40 64 38 ...
## $ reading.score : int 72 90 95 57 78 83 95 43 64 60 ...
## $ writing.score : int 74 88 93 44 75 78 92 39 67 50 ...
## $ math.grade : Factor w/ 7 levels "A","A-","B","B-",..: 4 6 1 7 3 4 2 7 7 7 ...
## $ reading.grade : chr "B-" "A" "A" "D" ...
## $ writing.grade : Factor w/ 8 levels "A","A-","B","B-",..: 4 2 1 8 3 3 1 8 6 7 ...
## $ math.weight : num 2.5 2 4 0 3 2.5 3.67 0 1 0 ...
## $ reading.weight : num 2.5 4 4 1 3 3.33 4 0 1 1 ...
## $ writing.weight : num 2.5 3.67 4 0 3 3 4 0 2 1 ...
## $ GPA : num 2.5 3.223 4 0.333 3 ...
## $ eligible : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
## $ reading : Factor w/ 8 levels "A","A-","B","B-",..: 4 1 1 7 3 5 1 8 7 7 ...
Myth no. 1, Does a particular race really excels at math? I am sure you have heard this number one myth that a particular race or two exceed in mathematics.
First, check the distribution of students for each group to feel the number of sample
table(studentPerformance$race.ethnicity)##
## group A group B group C group D group E
## 89 190 319 262 140
Display the math.grade distribution amongst the group
table(studentPerformance$math.grade, studentPerformance$race.ethnicity)##
## group A group B group C group D group E
## A 4 8 16 8 22
## A- 2 9 13 20 15
## B 9 15 28 35 15
## B- 7 21 34 32 20
## B+ 3 12 20 23 18
## C 11 27 51 40 15
## D 53 98 157 104 35
Then the percentages using prop.table() function
prop.table(table(studentPerformance$math.grade, studentPerformance$race.ethnicity))##
## group A group B group C group D group E
## A 0.004 0.008 0.016 0.008 0.022
## A- 0.002 0.009 0.013 0.020 0.015
## B 0.009 0.015 0.028 0.035 0.015
## B- 0.007 0.021 0.034 0.032 0.020
## B+ 0.003 0.012 0.020 0.023 0.018
## C 0.011 0.027 0.051 0.040 0.015
## D 0.053 0.098 0.157 0.104 0.035
From the table above, it is obvious that group E has hit A the most, and has D the least.
We use bar chart to get the visual.
Briefly on ggplot2 library, gg before the plot2 as you may already know that is actually stands for Grammar of Graphics. In a layman term - you need to know the structure of the grammar before you draw a graphic. To draw a graphic, think of it as a canvas that you can add layer by layer in order to create a beatiful scenary.
According to Hadley Wickham 1, components that make up a plot: * data and aesthetic mappings, * geometric objects, * scales, and * facet specification.
We use studentPerformance as the data, then we map it using aesthetic, add geometric object, we use bar as the geometric object, scale, labels, lastly the theme.
ggplot(studentPerformance, mapping=aes(x = math.grade, fill=race.ethnicity)) +
geom_bar() +
scale_y_continuous(limits=c(0,500),breaks = seq(0,500,50))+
labs(y = "Race Group",
title = "The Urban Myth #1",
subtitle="Does a particular race excels at math?")+
theme_bw()Notice that I use + sign to add a layer over another layer.
ggplot(studentPerformance, aes(x = race.ethnicity, fill=math.grade )) +
theme_bw() +
geom_bar() +
labs(y = "Race Group",
title = "The Urban Myth #1",
subtitle="Does a particular race excels at math?")However, I am still not convinced that one particular race excels the others, and the bar chart may not be the perfect option to view this. Let’s use boxplot instead.
xtabs(formula=math.score~race.ethnicity,
aggregate(math.score~race.ethnicity,
data=studentPerformance,mean))## race.ethnicity
## group A group B group C group D group E
## 61.62921 63.45263 64.46395 67.36260 73.82143
As the number suggest, group E’s average score is higher 15% than the other races. But, let see the visual to get clearer view.
ggplot(data=studentPerformance, mapping=aes(x=race.ethnicity, y=math.score))+
geom_boxplot()By looking at the graph, group E may excels from the rest, and the Myth #1 may be true. Now, for the sake of standard published report, let’s add scale, title, caption, etc.
ggplot(data=studentPerformance, mapping=aes(x=race.ethnicity, y=math.score, col=race.ethnicity ))+
theme_bw() +
geom_boxplot()+
scale_y_continuous(limits=c(0,110),breaks = seq(0,110,10))+
labs(title="The Urban Myth #1", subtitle="Does a particular race excels at math?", x="Race Group", y="Math Score", caption="Source: https://www.kaggle.com/spscientist/students-performance-in-exams")+
theme(panel.grid.minor = element_blank())ggplot(data=studentPerformance, mapping=aes(x=race.ethnicity, y=math.weight, col=race.ethnicity))+
theme_bw() +
geom_boxplot()+
scale_y_continuous(limits=c(0,4),breaks = seq(0,4,0.5))+
labs(title="The Urban Myth #1", subtitle="Does a particular race excels at math?", x="Race Group", y="Math Score", caption="Source: https://www.kaggle.com/spscientist/students-performance-in-exams")+
theme(panel.grid.minor = element_blank())Okay, now I’m convinced based on the sample that one particular race excels others at math. So, unfortunately I need to agree that Myth #1 - Legit.
We now move on to the next urban Myth. This is very straightforward, we just need to compare whether the preparation variable are in line with the scores.
table(studentPerformance$test.preparation.course, studentPerformance$eligible)##
## FALSE TRUE
## completed 113 245
## none 350 292
As you can see the one who completed the preparation course has better chance to be eligible in graduation. Wait.. what if the group E has done the preparation more than other group?
ggplot(studentPerformance, aes(x = studentPerformance$race.ethnicity, fill=studentPerformance$test.preparation.course, order=studentPerformance$test.preparation.course)) +
theme_bw() +
geom_bar(position = position_stack(reverse = TRUE)) +
labs(title="The Urban Myth #1", subtitle="Does Practise really makes Perfect?", x="Race Group", y="Number of sample", caption="Source: https://www.kaggle.com/spscientist/students-performance-in-exams")+
theme(panel.grid.minor = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))Now, that we have plenty of charts, we try something more fancy and give the touch of interactivity than ggplot2 which is plotly
library(plotly)
p <- ggplot(studentPerformance, aes(x = studentPerformance$race.ethnicity, fill=studentPerformance$test.preparation.course, order=studentPerformance$test.preparation.course)) +
theme_bw() +
geom_bar(position = position_stack(reverse = TRUE)) +
labs(title="The Urban Myth #1", subtitle="Does Practise really makes Perfect?", x="Race Group", y="Number of sample", caption="Source: https://www.kaggle.com/spscientist/students-performance-in-exams")+
theme(panel.grid.minor = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))
p<-ggplotly(p)
p