##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Student dropout in higher education is a highly problematic issue which affects the individual, higher education institutions, and society as a whole (Guzmán, Barragán & Vitery, 2021). On an individual level, the action of dropping out denies students their fundamental human right of access to education and negatively affect their economic well-being. As for higher education institutions, students dropping out can mean a reduction in the quality and efficiency of the institution’s system. On a grander scale, high dropout rates will ultimately limit the availability of a diverse pool of human capital in the labor market. Thus, the urgency to investigate factors that influence student dropout increases. In this LBB project of Data Visualization with R, we would like to investigate what factors lead to dropout in students, in order for academic institutions to provide tailored prevention measures for at-risk students. This database used can be accessed from Kaggle: Predict students’ dropout and academic success. This dataset contains multiple disjoint databases consisting of relevant information available at the time of enrollment, such as application mode, marital status, course chosen and more. From this data, we will investigate demographic factors in relation to student dropout.
## Rows: 4,424
## Columns: 35
## $ Marital.status <int> 1, 1, 1, 1, 2, 2, 1, 1,…
## $ Application.mode <int> 8, 6, 1, 8, 12, 12, 1, …
## $ Application.order <int> 5, 1, 5, 2, 1, 1, 1, 4,…
## $ Course <int> 2, 11, 5, 15, 3, 17, 12…
## $ Daytime.evening.attendance <int> 1, 1, 1, 1, 0, 0, 1, 1,…
## $ Previous.qualification <int> 1, 1, 1, 1, 1, 12, 1, 1…
## $ Nacionality <int> 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Mother.s.qualification <int> 13, 1, 22, 23, 22, 22, …
## $ Father.s.qualification <int> 10, 3, 27, 27, 28, 27, …
## $ Mother.s.occupation <int> 6, 4, 10, 6, 10, 10, 8,…
## $ Father.s.occupation <int> 10, 4, 10, 4, 10, 8, 11…
## $ Displaced <int> 1, 1, 1, 1, 0, 0, 1, 1,…
## $ Educational.special.needs <int> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Debtor <int> 0, 0, 0, 0, 0, 1, 0, 0,…
## $ Tuition.fees.up.to.date <int> 1, 0, 0, 1, 1, 1, 1, 0,…
## $ Gender <int> 1, 1, 1, 0, 0, 1, 0, 1,…
## $ Scholarship.holder <int> 0, 0, 0, 0, 0, 0, 1, 0,…
## $ Age.at.enrollment <int> 20, 19, 19, 20, 45, 50,…
## $ International <int> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Curricular.units.1st.sem..credited. <int> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Curricular.units.1st.sem..enrolled. <int> 0, 6, 6, 6, 6, 5, 7, 5,…
## $ Curricular.units.1st.sem..evaluations. <int> 0, 6, 0, 8, 9, 10, 9, 5…
## $ Curricular.units.1st.sem..approved. <int> 0, 6, 0, 6, 5, 5, 7, 0,…
## $ Curricular.units.1st.sem..grade. <dbl> 0.00000, 14.00000, 0.00…
## $ Curricular.units.1st.sem..without.evaluations. <int> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Curricular.units.2nd.sem..credited. <int> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Curricular.units.2nd.sem..enrolled. <int> 0, 6, 6, 6, 6, 5, 8, 5,…
## $ Curricular.units.2nd.sem..evaluations. <int> 0, 6, 0, 10, 6, 17, 8, …
## $ Curricular.units.2nd.sem..approved. <int> 0, 6, 0, 5, 6, 5, 8, 0,…
## $ Curricular.units.2nd.sem..grade. <dbl> 0.00000, 13.66667, 0.00…
## $ Curricular.units.2nd.sem..without.evaluations. <int> 0, 0, 0, 0, 0, 5, 0, 0,…
## $ Unemployment.rate <dbl> 10.8, 13.9, 10.8, 9.4, …
## $ Inflation.rate <dbl> 1.4, -0.3, 1.4, -0.8, -…
## $ GDP <dbl> 1.74, 0.79, 1.74, -3.12…
## $ Target <chr> "Dropout", "Graduate", …
## [1] 4424 35
From the result of data inspection, there are some variables that do not have the correct data type, thus data coercion is needed.
#change variables into factor
academic_clean <-academic %>% mutate_at(.vars = c("Marital.status","Application.mode","Course", "Daytime.evening.attendance", "Previous.qualification", "Nacionality", "Mother.s.qualification", "Father.s.qualification", "Mother.s.occupation", "Father.s.occupation","Displaced","Educational.special.needs","Debtor", "Tuition.fees.up.to.date", "Gender","Scholarship.holder","International", "Target"), as.factor)#change variable name to english
academic_clean <- academic_clean %>%
rename(Nationality = Nacionality)## [1] 1 1 1 0 0 1
## Levels: 0 1
academic_clean$Gender <- recode(academic_clean$Gender,
"1" = "Male",
"0" = "Female")
head(academic_clean$Gender)## [1] Male Male Male Female Female Male
## Levels: Female Male
## Marital.status
## 0
## Application.mode
## 0
## Application.order
## 0
## Course
## 0
## Daytime.evening.attendance
## 0
## Previous.qualification
## 0
## Nationality
## 0
## Mother.s.qualification
## 0
## Father.s.qualification
## 0
## Mother.s.occupation
## 0
## Father.s.occupation
## 0
## Displaced
## 0
## Educational.special.needs
## 0
## Debtor
## 0
## Tuition.fees.up.to.date
## 0
## Gender
## 0
## Scholarship.holder
## 0
## Age.at.enrollment
## 0
## International
## 0
## Curricular.units.1st.sem..credited.
## 0
## Curricular.units.1st.sem..enrolled.
## 0
## Curricular.units.1st.sem..evaluations.
## 0
## Curricular.units.1st.sem..approved.
## 0
## Curricular.units.1st.sem..grade.
## 0
## Curricular.units.1st.sem..without.evaluations.
## 0
## Curricular.units.2nd.sem..credited.
## 0
## Curricular.units.2nd.sem..enrolled.
## 0
## Curricular.units.2nd.sem..evaluations.
## 0
## Curricular.units.2nd.sem..approved.
## 0
## Curricular.units.2nd.sem..grade.
## 0
## Curricular.units.2nd.sem..without.evaluations.
## 0
## Unemployment.rate
## 0
## Inflation.rate
## 0
## GDP
## 0
## Target
## 0
No missing values are found in the data!
## [1] 0
No duplicated values are found in the data!
Info on categorical values
Marital.status: marital status of studentsNationality: The nationality of the student.Gender: The gender of the student.Age.at.enrollment: The age of the student at the time
of enrollment.International: Whether the student is an international
student.Target: Study Status (Graduated, Enrolled,
Dropout)## Age.at.enrollment Gender Marital.status Nationality International
## Min. :17.00 Female:2868 1:3919 1 :4314 0:4314
## 1st Qu.:19.00 Male :1556 2: 379 14 : 38 1: 110
## Median :20.00 3: 4 12 : 14
## Mean :23.27 4: 91 3 : 13
## 3rd Qu.:25.00 5: 25 9 : 13
## Max. :70.00 6: 6 10 : 5
## (Other): 27
## Target
## Dropout :1421
## Enrolled: 794
## Graduate:2209
##
##
##
##
Insight from demographic dataframe:
## Loading required package: grid
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Let’s visualize the distribution in Study Status using a barplot
ggplot(demographic) +
geom_bar(aes(x = Target), fill = c("maroon", "dodgerblue", "dodgerblue3")) +
ggtitle("Distribution of Student Study Status in University")+
xlab("Study Status")+
theme(legend.position="none")This plot shows that the number of dropout students is relatively large, with more than half of the amount compared to graduate students.
Make a contigency table for Gender and Study Status
## Target
## Gender Dropout Enrolled Graduate
## Female 720 487 1661
## Male 701 307 548
Make a mosaic plot for Gender and Study Status
mosaic(tablegender,shade = T, legend = T,
main = "Gender and Study Status",
labeling_args=list(set_varnames =
c(Target = "Study Status"))) The mosaic plot above shows the relationship between gender and study status for a sample of 4,424 students. From this, we can see that there are more female students than male students within the sample and that there is a higher proportion of students who graduated compared to dropped out. As can seen above, there is no association found between students with ‘enrolled’ status and gender. On the other hand, the ‘dropout’ and ‘graduate’ status is highly associated with gender. There is a strong negative association between being Female and dropping out and the opposite for being Male with a strong positive association to dropping out. Therefore, this mosaic plot suggests that gender and education status are not independent variables and have an influence on each other. Specifically, it suggests that male students are more at-risk to dropping out than female students.
Recategorize Enrollment Age into range levels
## [1] 20 19 19 20 45 50
age <- as.numeric(as.character(demographic$Age.at.enrollment))
Age_range <- cut(age, breaks = c(17,20,23,26,70), right = F)
demographic <- demographic %>% mutate(Age = Age_range)demographic$Age <- recode(demographic$Age,
"[17,20)" = "17-19",
"[20,23)" = "20-22",
"[23,26)" = "23-25",
"[26,70)" = ">25")Make a contigency table for Enrollment Age and Study Status
## Target
## Age Dropout Enrolled Graduate
## 17-19 409 331 1212
## 20-22 284 247 564
## 23-25 144 75 113
## >25 583 141 320
Make a mosaic plot for Enrollment Age and Study Status
mosaic(tableage,shade = T, legend = T,
main = "Enrollment Age and Study Status",
labeling_args=list(set_varnames =
c(Age = "Enrollment Age", Target = "Study Status")))The mosaic plot above shows the relationship between the age students enrolled in higher education (university) and their study status. The age range that has the most students in it is 17-19 years old and the least students in the age range 23-25 years old. The strongest positive association with the study status ‘graduate’ and enrollment age is in the age range 17-19 years old, this implies that the earlier a student enrolls, the more likely they will graduate. Conversely, there is a strong positive association between dropping out and students who enrolled at the age of >25 years old, indicated by the deep blue color. A weaker positive association, indicated by the light blue color, between dropping out and enrollment at the age of 23-25 years old can also be seen. This suggests that older students (those who enrolled at a later age) are more likely to drop out.
Recode Marital Status
demographic <- demographic %>%
mutate(marital = recode(Marital.status, .default = "Other", "1" = "Single", "2" = "Married"))
head(demographic)Make a contigency table for Marital Status and Study Status
## Target
## marital Dropout Enrolled Graduate
## Single 1184 720 2015
## Married 179 52 148
## Other 58 22 46
Make a mosaic plot for Marital Status and Study Status
mosaic(tablemarital,shade = T, legend = T,
main = "Marital and Study Status",
labeling_args=list(set_varnames =
c(marital = "Marital Status", Target = "Study Status")))The mosaic plot above shows the relationship between the students’ marital status and their study status. The category showing the strongest association is in the category of Married students and the study status Dropout, with a positive association, which indicates that there is a higher risk for married students to dropout compared to their counterpart. For single students, there is a weak negative association towards dropping out. This suggests that there is a relationship between being single and not dropping out.
Conclusion
In terms of student demographics, those who are at higher risk of dropping out include:
From this analysis, we are not able to determine the strength of these variable associations, thus further investigation is needed.