Before you go ahead and run the codes in this coursebook, it’s often a good idea to go through some initial setup. Under the Libraries and Setup tab you’ll see some code to initialize our workspace, and the libraries we’ll be using for the projects. You may want to make sure that the libraries are installed beforehand by referring back to the packages listed here.

1 Background

1.1 Introduction

This coursebook is as a response to Algoritma’s - Learn By Building (LBB) Data Visualization (DV) Workshop, it is to evaluate whether myself as a participant have enough understanding on the subject, which is a great method experiencing a hands-on development using R, and specifically using ggplot2 as the package for Grammar of Graphics.

1.2 Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

options(scipen = 9999)
rm(list=ls())

You will need to use install.packages() or you can access the tab ‘Packages’ from your righ bottom of RStudio as following screenshots.

Install Package

Then search the packages required

Search Package For the purpose of this coursebook, we are going to install the following packages:

ggplot2 - to create data visualisations using the Grammar of Graphics
GGally - extension to ggplot2
ggthemes - extra Themes, Scales and Geoms for ggplot2
ggpubr - ggplot2 based publication ready plots
lubridate - dealing with dates
dplyr - a grammar of data manipulation

to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library() function:

library(ggplot2)
library(GGally)
library(ggthemes)
library(ggpubr)
library(lubridate)
library(dplyr)

2 Understanding the Data

The dataset consists of the marks secured by the students in various subjects, which accessible from Kaggle Student Performance in Exams.

The Inspiration is to understand the influence of the parents background, test preparation etc on students performance. It comprises of 1,000 rows and 8 columns:

gender
race / ethnicity
parental level of education
lunch
test preparation course
math score
reading score
writing score

3 The Urban Myths

With this coursebook, we are going to demystify the urban myths about the student’s performance in exam as following:

Does a particular race excels at math?
Does ‘Practise really makes Perfect?’
Does one particular gender really excel another?
Does the higher degree students are more mature and excel in exam?
Does parent’s education background influenced student’s performance in exam?
Does student good at math bad at writing?
Does particular race relatively concern on higher education than the other race?

4 Data Pre-processing

Now, let’s get our hands dirty. Before that, please make sure that you are working on the right directory by getwd() function.

getwd()

## [1] "D:/Development/MachineLearning/Algoritma Workshop/[Algoritma] Data Visualization/Hendri_DV_Assignment"

You will see three directories, and the csv is located on data_input/ you can verify the existency of csv file by runnning list.files() function.

list.files("data_input/")

## [1] "StudentsPerformance.csv"

Next, we load the data into the global environment.

studentPerformance <- read.csv("data_input/StudentsPerformance.csv", stringsAsFactors = FALSE)

4.1 Examine the structure

Then, we examine the structure of studentPerformance

str(studentPerformance)

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : chr  "female" "female" "female" "male" ...
##  $ race.ethnicity             : chr  "group B" "group C" "group B" "group A" ...
##  $ parental.level.of.education: chr  "bachelor's degree" "some college" "master's degree" "associate's degree" ...
##  $ lunch                      : chr  "standard" "standard" "standard" "free/reduced" ...
##  $ test.preparation.course    : chr  "none" "completed" "none" "none" ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

4.2 Factorize necessary variables

As you can see, that all values other than the number are considered as character, but actually it is implicitly coercion to factor by default. The reason it is not performing implicit coercion to factor is the parameter ‘stringsAsFactors’ set to FALSE. This is purposely done, so we can go through on the step to decide which variables are necessary to be converted to factor.

Factor, normally used to categorize data, such as Gender: Male, Female, Date, Type of payment: Cash, Credit, Transfer.

Fortunately, if you observe these ‘character’ type variables can be used as factor type. So, let’s convert them to factor.

studentPerformance$gender <- as.factor(studentPerformance$gender)
studentPerformance$race.ethnicity <- as.factor(studentPerformance$race.ethnicity)
studentPerformance$parental.level.of.education <- as.factor(studentPerformance$parental.level.of.education)
studentPerformance$lunch <- as.factor(studentPerformance$lunch)
studentPerformance$test.preparation.course  <- as.factor(studentPerformance$test.preparation.course )

Now, check again the structure to make sure that all required variables to be factor

str(studentPerformance)

## 'data.frame':    1000 obs. of  8 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
##  $ lunch                      : Factor w/ 2 levels "free/reduced",..: 2 2 2 1 2 2 2 1 1 1 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...

4.3 Examine meaningless variables

Then we remove meaningless variables that are not required for our analysis purpose, peek the data using head() function

head(studentPerformance)

##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
## 5   male        group C                some college     standard
## 6 female        group B          associate's degree     standard
##   test.preparation.course math.score reading.score writing.score
## 1                    none         72            72            74
## 2               completed         69            90            88
## 3                    none         90            95            93
## 4                    none         47            57            44
## 5                    none         76            78            75
## 6                    none         71            83            78

Fortunately again, the data are good to move forward. The example of meaningless data that I refered is more like name of the student, the student id, or even parent’s name. It makes no sense to use them for analysis.

4.4 Adding grade

Moving forward, we transform the score to grade (i.e A, A-, B+, etc.) by adding ‘grade’ variable and add weight to later determine the Grade Point Average (GPA), as following:

A with weight of 4.00 is between 90 to 100, or Score >= 90 and Score <= 100
A- with weight of 3.67 is between 85 to 89, or Score >= 85 and Score <= 89
B+ with weight of 3.33 is between 80 to 84, or Score >= 80 and Score <= 84
B with weight of 3.00 is between 75 to 79, or Score >= 75 and Score <= 79
B- with weight of 2.50 is between 70 to 74, or Score >= 70 and Score >= 74
C with weight of 2.00 is between 65 to 69, or Score >= 65 and Score >= 69
D with weight of 1.00 is between 50 to 64, or Score >= 50 and Score >= 64
E with weight of 0.00 is between 0 to 49, or Score > 0 and Score >= 49
F with weight of 0.00 has no score 0, or Score = 0

To convert score to grade, we define getGrade() function, as below:

getGrade <- function(score) {
  if(score >= 90 & score <= 100) {
      return("A")
    } else {
        if(score >= 85 & score <= 89) {
          return("A-")
        } else {
            if(score >= 80 & score <= 84) {
              return("B+")
            } else {
                if(score >= 75 & score <= 79) {
                  return("B")
                } else {
                    if(score >= 70 & score <= 74) {
                      return("B-")
                    } else {
                        if(score >= 65 & score <= 69) {
                          return ("C")
                        } else {
                          if(score >= 50 & score <= 64) {
                            return ("D")
                          } else {
                              if(score > 0 & score <= 49) {
                                return ("E")
                              } else {
                                  return ("F")
                              }
                            }
                        }
                    }
                  }
             }
        }
      }
}

Look’s complicated? especially with the brackets {}, hanging there, we will comeback with the ‘more readable and easier to follow’ version later on.

4.5 Adding weight

getWeight() function to get the weight point from the grade, this use switch() operation

getWeight<-function(grade){
  weight <- switch(grade, 
                   "A"=4.00, 
                   "A-"=3.67, 
                   "B+"=3.33, 
                   "B"=3.00, 
                   "B-"=2.50, 
                   "C"=2.0, 
                   "D"=1.0, 
                   "E"=0.00, 
                   "F"=0.00)
  return(weight)
}

Then, we add the ‘math.grade’ variable into the data which contains a grade based on the score.

studentPerformance$math.grade <- sapply(studentPerformance$math.score, FUN=getGrade)
studentPerformance$reading.grade <- sapply(studentPerformance$reading.score, FUN=getGrade)
studentPerformance$writing.grade <- sapply(studentPerformance$writing.score, FUN=getGrade)

And add a weight, based on its grade

studentPerformance$math.weight <- sapply(studentPerformance$math.grade, FUN=getWeight)
studentPerformance$reading.weight <- sapply(studentPerformance$reading.grade, FUN=getWeight)
studentPerformance$writing.weight <- sapply(studentPerformance$writing.grade, FUN=getWeight)

As promised, below is the more readable / alternative version using with() and ifelse() function. Dont get it wrong, more readable does not mean it is shorter.

studentPerformance$math.grade <- with(studentPerformance, 
    ifelse(math.score >= 90 & math.score <= 100, "A",
      ifelse(math.score >= 85 & math.score <= 89, "A-",
          ifelse(math.score >= 80 & math.score <= 84, "B+",    
               ifelse(math.score >= 75 & math.score <= 79, "B", 
                      ifelse(math.score >= 70 & math.score <= 74, "B-",
                             ifelse(math.score >= 65 & math.score <= 69, "C",
                                    ifelse(math.score >= 50 & math.score <= 64, "D",
                                           ifelse(math.score >= 0 & math.score <= 49, "D", "F"
     )))))))))

4.6 Adding GPA

Now, we have the grade and weight, we can determine the GPA and add it as GPA variable.

studentPerformance$GPA <- apply(cbind(studentPerformance$math.weight, studentPerformance$reading.weight, studentPerformance$writing.weight),1, FUN=mean)

4.7 Adding eligibility to graduate

Let’s improvised a bit by adding another variable to determine whether the student is eligible to graduate with minimum passing score of 2.00.

studentPerformance$eligible <- with(studentPerformance, ifelse(GPA>=2,T, F))

We have grown studentPerformance variables from 8 to 16 for our observation. Personally, when I examine the data again, I found that ‘lunch’ variable is meaningless so I decided to remove using subset() function.

studentPerformance <- subset(studentPerformance, select=-c(lunch))

4.8 Check and Re-check

str(studentPerformance)

## 'data.frame':    1000 obs. of  15 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...
##  $ math.grade                 : chr  "B-" "C" "A" "D" ...
##  $ reading.grade              : chr  "B-" "A" "A" "D" ...
##  $ writing.grade              : chr  "B-" "A-" "A" "E" ...
##  $ math.weight                : num  2.5 2 4 0 3 2.5 3.67 0 1 0 ...
##  $ reading.weight             : num  2.5 4 4 1 3 3.33 4 0 1 1 ...
##  $ writing.weight             : num  2.5 3.67 4 0 3 3 4 0 2 1 ...
##  $ GPA                        : num  2.5 3.223 4 0.333 3 ...
##  $ eligible                   : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...

We find that math.grade, reading.grade and writing.grade are all character type, we need to convert them into factors for our analysis.

studentPerformance$math.grade <- as.factor(studentPerformance$math.grade)
studentPerformance$reading <- as.factor(studentPerformance$reading.grade)
studentPerformance$writing.grade <- as.factor(studentPerformance$writing.grade)

check again to verify.

str(studentPerformance)

## 'data.frame':    1000 obs. of  16 variables:
##  $ gender                     : Factor w/ 2 levels "female","male": 1 1 1 2 2 1 1 2 2 1 ...
##  $ race.ethnicity             : Factor w/ 5 levels "group A","group B",..: 2 3 2 1 3 2 2 2 4 2 ...
##  $ parental.level.of.education: Factor w/ 6 levels "associate's degree",..: 2 5 4 1 5 1 5 5 3 3 ...
##  $ test.preparation.course    : Factor w/ 2 levels "completed","none": 2 1 2 2 2 2 1 2 1 2 ...
##  $ math.score                 : int  72 69 90 47 76 71 88 40 64 38 ...
##  $ reading.score              : int  72 90 95 57 78 83 95 43 64 60 ...
##  $ writing.score              : int  74 88 93 44 75 78 92 39 67 50 ...
##  $ math.grade                 : Factor w/ 7 levels "A","A-","B","B-",..: 4 6 1 7 3 4 2 7 7 7 ...
##  $ reading.grade              : chr  "B-" "A" "A" "D" ...
##  $ writing.grade              : Factor w/ 8 levels "A","A-","B","B-",..: 4 2 1 8 3 3 1 8 6 7 ...
##  $ math.weight                : num  2.5 2 4 0 3 2.5 3.67 0 1 0 ...
##  $ reading.weight             : num  2.5 4 4 1 3 3.33 4 0 1 1 ...
##  $ writing.weight             : num  2.5 3.67 4 0 3 3 4 0 2 1 ...
##  $ GPA                        : num  2.5 3.223 4 0.333 3 ...
##  $ eligible                   : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...
##  $ reading                    : Factor w/ 8 levels "A","A-","B","B-",..: 4 1 1 7 3 5 1 8 7 7 ...

5 Myth #1 - Does a particular race excels at math?

Myth no. 1, Does a particular race really excels at math? I am sure you have heard this number one myth that a particular race or two exceed in mathematics.

First, check the distribution of students for each group to feel the number of sample

table(studentPerformance$race.ethnicity)

## 
## group A group B group C group D group E 
##      89     190     319     262     140

Display the math.grade distribution amongst the group

table(studentPerformance$math.grade, studentPerformance$race.ethnicity)

##     
##      group A group B group C group D group E
##   A        4       8      16       8      22
##   A-       2       9      13      20      15
##   B        9      15      28      35      15
##   B-       7      21      34      32      20
##   B+       3      12      20      23      18
##   C       11      27      51      40      15
##   D       53      98     157     104      35

Then the percentages using prop.table() function

prop.table(table(studentPerformance$math.grade, studentPerformance$race.ethnicity))

##     
##      group A group B group C group D group E
##   A    0.004   0.008   0.016   0.008   0.022
##   A-   0.002   0.009   0.013   0.020   0.015
##   B    0.009   0.015   0.028   0.035   0.015
##   B-   0.007   0.021   0.034   0.032   0.020
##   B+   0.003   0.012   0.020   0.023   0.018
##   C    0.011   0.027   0.051   0.040   0.015
##   D    0.053   0.098   0.157   0.104   0.035

From the table above, it is obvious that group E has hit A the most, and has D the least.

We use bar chart to get the visual.

Briefly on ggplot2 library, gg before the plot2 as you may already know that is actually stands for Grammar of Graphics. In a layman term - you need to know the structure of the grammar before you draw a graphic. To draw a graphic, think of it as a canvas that you can add layer by layer in order to create a beatiful scenary.

According to Hadley Wickham ¹, components that make up a plot: * data and aesthetic mappings, * geometric objects, * scales, and * facet specification.

5.1 Bar Plot - Grade as X and Race as Y

We use studentPerformance as the data, then we map it using aesthetic, add geometric object, we use bar as the geometric object, scale, labels, lastly the theme.

ggplot(studentPerformance, mapping=aes(x = math.grade, fill=race.ethnicity)) + 
  geom_bar() +
  scale_y_continuous(limits=c(0,500),breaks = seq(0,500,50))+
  labs(y = "Race Group",
       title = "The Urban Myth #1",
       subtitle="Does a particular race excels at math?")+
  theme_bw()

Notice that I use + sign to add a layer over another layer.

5.2 Bar Plot - Race as X and Grade as Y

ggplot(studentPerformance, aes(x = race.ethnicity, fill=math.grade )) + 
  theme_bw() +
  geom_bar() +
  labs(y = "Race Group",
       title = "The Urban Myth #1",
       subtitle="Does a particular race excels at math?")

However, I am still not convinced that one particular race excels the others, and the bar chart may not be the perfect option to view this. Let’s use boxplot instead.

xtabs(formula=math.score~race.ethnicity,
      aggregate(math.score~race.ethnicity, 
                data=studentPerformance,mean))

## race.ethnicity
##  group A  group B  group C  group D  group E 
## 61.62921 63.45263 64.46395 67.36260 73.82143

As the number suggest, group E’s average score is higher 15% than the other races. But, let see the visual to get clearer view.

5.3 Box Plot #1 - based on score

ggplot(data=studentPerformance, mapping=aes(x=race.ethnicity, y=math.score))+
  geom_boxplot()

By looking at the graph, group E may excels from the rest, and the Myth #1 may be true. Now, for the sake of standard published report, let’s add scale, title, caption, etc.

5.4 Box Plot #2 - based on score with color

ggplot(data=studentPerformance, mapping=aes(x=race.ethnicity, y=math.score, col=race.ethnicity ))+
  theme_bw() +
  geom_boxplot()+
  scale_y_continuous(limits=c(0,110),breaks = seq(0,110,10))+
  labs(title="The Urban Myth #1", subtitle="Does a particular race excels at math?", x="Race Group",       y="Math Score", caption="Source: https://www.kaggle.com/spscientist/students-performance-in-exams")+
  theme(panel.grid.minor = element_blank())

5.5 Box Plot #3 - based on grade

ggplot(data=studentPerformance, mapping=aes(x=race.ethnicity, y=math.weight, col=race.ethnicity))+
  theme_bw() +
  geom_boxplot()+
  scale_y_continuous(limits=c(0,4),breaks = seq(0,4,0.5))+
  labs(title="The Urban Myth #1", subtitle="Does a particular race excels at math?", x="Race Group",       y="Math Score", caption="Source: https://www.kaggle.com/spscientist/students-performance-in-exams")+
  theme(panel.grid.minor = element_blank())

Okay, now I’m convinced based on the sample that one particular race excels others at math. So, unfortunately I need to agree that Myth #1 - Legit.

6 Myth #2 - Does Practise really makes Perfect?

We now move on to the next urban Myth. This is very straightforward, we just need to compare whether the preparation variable are in line with the scores.

table(studentPerformance$test.preparation.course, studentPerformance$eligible)

##            
##             FALSE TRUE
##   completed   113  245
##   none        350  292

As you can see the one who completed the preparation course has better chance to be eligible in graduation. Wait.. what if the group E has done the preparation more than other group?

ggplot(studentPerformance, aes(x = studentPerformance$race.ethnicity, fill=studentPerformance$test.preparation.course, order=studentPerformance$test.preparation.course)) + 
  theme_bw() +
  geom_bar(position = position_stack(reverse = TRUE)) +
  labs(title="The Urban Myth #1", subtitle="Does Practise really makes Perfect?", x="Race Group",       y="Number of sample", caption="Source: https://www.kaggle.com/spscientist/students-performance-in-exams")+
  theme(panel.grid.minor = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))

7 Does one particular gender really excel another?

8 Does the higher degree students are more mature and excel in exam?

9 Does parent’s education background influenced student’s performance in exam?

10 Does student good at math bad at writing?

11 Does particular race relatively concern on higher education than the other race?

12 Using Plotly

Now, that we have plenty of charts, we try something more fancy and give the touch of interactivity than ggplot2 which is plotly

library(plotly)
p <- ggplot(studentPerformance, aes(x = studentPerformance$race.ethnicity, fill=studentPerformance$test.preparation.course, order=studentPerformance$test.preparation.course)) + 
  theme_bw() +
  geom_bar(position = position_stack(reverse = TRUE)) +
  labs(title="The Urban Myth #1", subtitle="Does Practise really makes Perfect?", x="Race Group",       y="Number of sample", caption="Source: https://www.kaggle.com/spscientist/students-performance-in-exams")+
  theme(panel.grid.minor = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))

p<-ggplotly(p)
p

13 References

A Layered Grammar of Graphics ↩

Data Visualization with R using ggplot2 on Student Performance in Exam

Hendri Arifin

January 24, 2019