#packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
options(scipen = 9999)

1 Introduction

Student dropout in higher education is a highly problematic issue which affects the individual, higher education institutions, and society as a whole (Guzmán, Barragán & Vitery, 2021). On an individual level, the action of dropping out denies students their fundamental human right of access to education and negatively affect their economic well-being. As for higher education institutions, students dropping out can mean a reduction in the quality and efficiency of the institution’s system. On a grander scale, high dropout rates will ultimately limit the availability of a diverse pool of human capital in the labor market. Thus, the urgency to investigate factors that influence student dropout increases. In this LBB project of Data Visualization with R, we would like to investigate what factors lead to dropout in students, in order for academic institutions to provide tailored prevention measures for at-risk students. This database used can be accessed from Kaggle: Predict students’ dropout and academic success. This dataset contains multiple disjoint databases consisting of relevant information available at the time of enrollment, such as application mode, marital status, course chosen and more. From this data, we will investigate demographic factors in relation to student dropout.

2 Importing Data

academic <- read.csv("dataset.csv")

2.1 Data Inspection

head(academic)
glimpse(academic)
## Rows: 4,424
## Columns: 35
## $ Marital.status                                 <int> 1, 1, 1, 1, 2, 2, 1, 1,…
## $ Application.mode                               <int> 8, 6, 1, 8, 12, 12, 1, …
## $ Application.order                              <int> 5, 1, 5, 2, 1, 1, 1, 4,…
## $ Course                                         <int> 2, 11, 5, 15, 3, 17, 12…
## $ Daytime.evening.attendance                     <int> 1, 1, 1, 1, 0, 0, 1, 1,…
## $ Previous.qualification                         <int> 1, 1, 1, 1, 1, 12, 1, 1…
## $ Nacionality                                    <int> 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Mother.s.qualification                         <int> 13, 1, 22, 23, 22, 22, …
## $ Father.s.qualification                         <int> 10, 3, 27, 27, 28, 27, …
## $ Mother.s.occupation                            <int> 6, 4, 10, 6, 10, 10, 8,…
## $ Father.s.occupation                            <int> 10, 4, 10, 4, 10, 8, 11…
## $ Displaced                                      <int> 1, 1, 1, 1, 0, 0, 1, 1,…
## $ Educational.special.needs                      <int> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Debtor                                         <int> 0, 0, 0, 0, 0, 1, 0, 0,…
## $ Tuition.fees.up.to.date                        <int> 1, 0, 0, 1, 1, 1, 1, 0,…
## $ Gender                                         <int> 1, 1, 1, 0, 0, 1, 0, 1,…
## $ Scholarship.holder                             <int> 0, 0, 0, 0, 0, 0, 1, 0,…
## $ Age.at.enrollment                              <int> 20, 19, 19, 20, 45, 50,…
## $ International                                  <int> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Curricular.units.1st.sem..credited.            <int> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Curricular.units.1st.sem..enrolled.            <int> 0, 6, 6, 6, 6, 5, 7, 5,…
## $ Curricular.units.1st.sem..evaluations.         <int> 0, 6, 0, 8, 9, 10, 9, 5…
## $ Curricular.units.1st.sem..approved.            <int> 0, 6, 0, 6, 5, 5, 7, 0,…
## $ Curricular.units.1st.sem..grade.               <dbl> 0.00000, 14.00000, 0.00…
## $ Curricular.units.1st.sem..without.evaluations. <int> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Curricular.units.2nd.sem..credited.            <int> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Curricular.units.2nd.sem..enrolled.            <int> 0, 6, 6, 6, 6, 5, 8, 5,…
## $ Curricular.units.2nd.sem..evaluations.         <int> 0, 6, 0, 10, 6, 17, 8, …
## $ Curricular.units.2nd.sem..approved.            <int> 0, 6, 0, 5, 6, 5, 8, 0,…
## $ Curricular.units.2nd.sem..grade.               <dbl> 0.00000, 13.66667, 0.00…
## $ Curricular.units.2nd.sem..without.evaluations. <int> 0, 0, 0, 0, 0, 5, 0, 0,…
## $ Unemployment.rate                              <dbl> 10.8, 13.9, 10.8, 9.4, …
## $ Inflation.rate                                 <dbl> 1.4, -0.3, 1.4, -0.8, -…
## $ GDP                                            <dbl> 1.74, 0.79, 1.74, -3.12…
## $ Target                                         <chr> "Dropout", "Graduate", …
dim(academic)
## [1] 4424   35

3 Data Cleansing

3.1 Change Data Type

From the result of data inspection, there are some variables that do not have the correct data type, thus data coercion is needed.

#change variables into factor
academic_clean <-academic %>%  mutate_at(.vars = c("Marital.status","Application.mode","Course", "Daytime.evening.attendance", "Previous.qualification", "Nacionality", "Mother.s.qualification", "Father.s.qualification", "Mother.s.occupation", "Father.s.occupation","Displaced","Educational.special.needs","Debtor", "Tuition.fees.up.to.date", "Gender","Scholarship.holder","International", "Target"), as.factor)

3.2 Rename and Label Variables

#change variable name to english 
academic_clean <- academic_clean %>% 
  rename(Nationality = Nacionality)
#label Gender into Male and Female
head(academic_clean$Gender) 
## [1] 1 1 1 0 0 1
## Levels: 0 1
academic_clean$Gender <- recode(academic_clean$Gender, 
                                "1" = "Male",
                                "0" = "Female")
head(academic_clean$Gender)
## [1] Male   Male   Male   Female Female Male  
## Levels: Female Male

3.3 Missing Values and Duplicates

#check columns for missing values 
colSums(is.na(academic_clean))
##                                 Marital.status 
##                                              0 
##                               Application.mode 
##                                              0 
##                              Application.order 
##                                              0 
##                                         Course 
##                                              0 
##                     Daytime.evening.attendance 
##                                              0 
##                         Previous.qualification 
##                                              0 
##                                    Nationality 
##                                              0 
##                         Mother.s.qualification 
##                                              0 
##                         Father.s.qualification 
##                                              0 
##                            Mother.s.occupation 
##                                              0 
##                            Father.s.occupation 
##                                              0 
##                                      Displaced 
##                                              0 
##                      Educational.special.needs 
##                                              0 
##                                         Debtor 
##                                              0 
##                        Tuition.fees.up.to.date 
##                                              0 
##                                         Gender 
##                                              0 
##                             Scholarship.holder 
##                                              0 
##                              Age.at.enrollment 
##                                              0 
##                                  International 
##                                              0 
##            Curricular.units.1st.sem..credited. 
##                                              0 
##            Curricular.units.1st.sem..enrolled. 
##                                              0 
##         Curricular.units.1st.sem..evaluations. 
##                                              0 
##            Curricular.units.1st.sem..approved. 
##                                              0 
##               Curricular.units.1st.sem..grade. 
##                                              0 
## Curricular.units.1st.sem..without.evaluations. 
##                                              0 
##            Curricular.units.2nd.sem..credited. 
##                                              0 
##            Curricular.units.2nd.sem..enrolled. 
##                                              0 
##         Curricular.units.2nd.sem..evaluations. 
##                                              0 
##            Curricular.units.2nd.sem..approved. 
##                                              0 
##               Curricular.units.2nd.sem..grade. 
##                                              0 
## Curricular.units.2nd.sem..without.evaluations. 
##                                              0 
##                              Unemployment.rate 
##                                              0 
##                                 Inflation.rate 
##                                              0 
##                                            GDP 
##                                              0 
##                                         Target 
##                                              0

No missing values are found in the data!

#check for duplicated data
sum(duplicated(academic_clean))
## [1] 0

No duplicated values are found in the data!

3.4 Creating New Dataframe

In order to organize the predictor variables further into demographic factors, we will create a new data frame.

demographic <- academic_clean %>% 
  select(Age.at.enrollment, Gender, Marital.status,Nationality, International, Target)

4 Data Explanation

4.1 Variable Information

Info on categorical values

  • Marital.status: marital status of students
  • Nationality: The nationality of the student.
  • Gender: The gender of the student.
  • Age.at.enrollment: The age of the student at the time of enrollment.
  • International: Whether the student is an international student.
  • Target: Study Status (Graduated, Enrolled, Dropout)
summary(demographic)
##  Age.at.enrollment    Gender     Marital.status  Nationality   International
##  Min.   :17.00     Female:2868   1:3919         1      :4314   0:4314       
##  1st Qu.:19.00     Male  :1556   2: 379         14     :  38   1: 110       
##  Median :20.00                   3:   4         12     :  14                
##  Mean   :23.27                   4:  91         3      :  13                
##  3rd Qu.:25.00                   5:  25         9      :  13                
##  Max.   :70.00                   6:   6         10     :   5                
##                                                 (Other):  27                
##       Target    
##  Dropout :1421  
##  Enrolled: 794  
##  Graduate:2209  
##                 
##                 
##                 
## 

Insight from demographic dataframe:

  • The range of enrollment age is pretty large from 17-70 years old
  • There are more female students in the given data set
  • Majority of students are single, but there are a few who are married or has been married.
  • Most of the students are Portugese
  • International students are a minority

5 Data Processing and Visualization

library(vcd) #library for visualizing categorical variables
## Loading required package: grid
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Let’s visualize the distribution in Study Status using a barplot

ggplot(demographic) +
  geom_bar(aes(x = Target), fill =  c("maroon", "dodgerblue", "dodgerblue3")) +
  ggtitle("Distribution of Student Study Status in University")+
  xlab("Study Status")+ 
  theme(legend.position="none")

This plot shows that the number of dropout students is relatively large, with more than half of the amount compared to graduate students.

5.1 Demographic Analysis

Make a contigency table for Gender and Study Status

tablegender <- xtabs(~Gender+Target, data = demographic)
tablegender
##         Target
## Gender   Dropout Enrolled Graduate
##   Female     720      487     1661
##   Male       701      307      548

Make a mosaic plot for Gender and Study Status

mosaic(tablegender,shade = T, legend = T, 
       main = "Gender and Study Status", 
       labeling_args=list(set_varnames = 
                            c(Target = "Study Status"))) 

The mosaic plot above shows the relationship between gender and study status for a sample of 4,424 students. From this, we can see that there are more female students than male students within the sample and that there is a higher proportion of students who graduated compared to dropped out. As can seen above, there is no association found between students with ‘enrolled’ status and gender. On the other hand, the ‘dropout’ and ‘graduate’ status is highly associated with gender. There is a strong negative association between being Female and dropping out and the opposite for being Male with a strong positive association to dropping out. Therefore, this mosaic plot suggests that gender and education status are not independent variables and have an influence on each other. Specifically, it suggests that male students are more at-risk to dropping out than female students.

Recategorize Enrollment Age into range levels

head(demographic$Age.at.enrollment)
## [1] 20 19 19 20 45 50
age <- as.numeric(as.character(demographic$Age.at.enrollment)) 
Age_range <- cut(age, breaks = c(17,20,23,26,70), right = F)
demographic <- demographic %>%  mutate(Age = Age_range)
demographic$Age <- recode(demographic$Age, 
                          "[17,20)" = "17-19", 
                          "[20,23)" = "20-22",
                          "[23,26)" = "23-25", 
                          "[26,70)" = ">25")

Make a contigency table for Enrollment Age and Study Status

tableage<-xtabs(~Age+Target, data = demographic)
tableage
##        Target
## Age     Dropout Enrolled Graduate
##   17-19     409      331     1212
##   20-22     284      247      564
##   23-25     144       75      113
##   >25       583      141      320

Make a mosaic plot for Enrollment Age and Study Status

mosaic(tableage,shade = T, legend = T, 
       main = "Enrollment Age and Study Status", 
       labeling_args=list(set_varnames = 
                            c(Age = "Enrollment Age", Target = "Study Status")))

The mosaic plot above shows the relationship between the age students enrolled in higher education (university) and their study status. The age range that has the most students in it is 17-19 years old and the least students in the age range 23-25 years old. The strongest positive association with the study status ‘graduate’ and enrollment age is in the age range 17-19 years old, this implies that the earlier a student enrolls, the more likely they will graduate. Conversely, there is a strong positive association between dropping out and students who enrolled at the age of >25 years old, indicated by the deep blue color. A weaker positive association, indicated by the light blue color, between dropping out and enrollment at the age of 23-25 years old can also be seen. This suggests that older students (those who enrolled at a later age) are more likely to drop out.

Recode Marital Status

demographic <- demographic %>%  
  mutate(marital = recode(Marital.status, .default = "Other", "1" = "Single", "2" = "Married"))
head(demographic)

Make a contigency table for Marital Status and Study Status

tablemarital <- xtabs(~marital+Target, data = demographic)
tablemarital
##          Target
## marital   Dropout Enrolled Graduate
##   Single     1184      720     2015
##   Married     179       52      148
##   Other        58       22       46

Make a mosaic plot for Marital Status and Study Status

mosaic(tablemarital,shade = T, legend = T, 
       main = "Marital and Study Status",
       labeling_args=list(set_varnames = 
                            c(marital = "Marital Status", Target = "Study Status")))

The mosaic plot above shows the relationship between the students’ marital status and their study status. The category showing the strongest association is in the category of Married students and the study status Dropout, with a positive association, which indicates that there is a higher risk for married students to dropout compared to their counterpart. For single students, there is a weak negative association towards dropping out. This suggests that there is a relationship between being single and not dropping out.

Conclusion

In terms of student demographics, those who are at higher risk of dropping out include:

  • Male Students
  • Students who enrolled to university at age 23 years old and above
  • Married Students and Students with other marital statuses such as divorced, widowed, etc.

From this analysis, we are not able to determine the strength of these variable associations, thus further investigation is needed.