# what is the term of this research?
term <- "Fall 2021"
In this document, I analyzed three data sets provided by the University: “enrollment”, “grades” and “programs”. The enrollment data set includes information about registration for a single DU term Fall 2021, covering student registration status, their demographics, and program. The grades data set provides final grades for each student for all courses in that same term. And finally, the program data set offers a mapping of college, degree, and academic unit information. Using these three data sets, I attempted to find answers to eight challenging questions in three major sections.
In Part One, I proceeded to both clean the data sets and perform a few sanity checks on them. These checks are especially vital for automation processes. Next, I followed through with the final data product generation process, generating a student-level data set. In the third part, I tried my hand at finding the best methods of answering the questions through data manipulation and statistics. Last, I made minor changes to the student-level data set created in the previous sections and completed some final preparation tasks.Before getting to the questions, it is necessary to make sure the
provided data sets are reliable. In addition, considering that all
future data sets will have the same characteristics, I aimed to design
an automated process. Automation needs a few extra steps of data
preparation. All major data structure factors, including attribute types
and names of any future data sets should be inline with the automation
code. Therefore, aside from NAs and duplicated value identification
steps, formatting unification and verifying student ID (see table 2), I
have also created a demonstration of attribute types for each data set
(see table 1), and a table of data set dimensions (see table 3).
#---- LIBRARIES ----
library(tidyverse)
library(formattable)
library(tibble)
library(formattable)
library(knitr)
library(kableExtra)
library(plotrix)
library(gridExtra)
library(viridis)
library(gridExtra)
library(cowplot)
library(gmodels)
library(ggpubr)
library(janitor)
library(pander)
#---- INPUT PATHS ----
## this is the location of all project files
DU_path_input <- "/Users/saina/Desktop/DU/Data_Scien_Tech_Assess/"
## names of all necessary input data file
DU_file_Enrollment <- "fall_enrollment.csv"
DU_file_Grades <- "grades.csv"
DU_file_Program <- "program_data.csv"
#---- READ IN DATA ----
Enrollment <- read.csv(paste0(DU_path_input, DU_file_Enrollment),stringsAsFactors=FALSE)
Grades <- read.csv(paste0(DU_path_input, DU_file_Grades),stringsAsFactors=FALSE)
Program <- read.csv(paste0(DU_path_input, DU_file_Program),stringsAsFactors=FALSE)
#---- OUTPUT PATHS ----
DU_path_output <- "/Users/saina/Desktop/DU/Outputs/"
Product_name <- "Student_Lavel.csv"
Attr_names <- "Attr_names.csv" # this file is produced in this script, but will be input for automation script
#print("Done with reading data and calling packages")
# SOME REFINEMENTS AND DATA CHECKS
## unified column name formats
names(Enrollment) <- tolower(names(Enrollment))
names(Grades) <- tolower(names(Grades))
names(Program) <- tolower(names(Program))
## drop nonsensical duplicated records
### not dropping duplicated records of Grades data because they are meaningful (different courses same term).
Program_duplicated <- (nrow(Program) == nrow(distinct(Program)))
### Program data does not include any duplicated records
Enrollment_duplicated <- (nrow(Enrollment) == nrow(distinct(Enrollment)))
### Enrollment does
Enrollment <- distinct(Enrollment) # 15 duplicates
## check variable types
TypeEnrollment <- rownames_to_column(as.data.frame(sapply(Enrollment, typeof)), "Enrollment Attributes")
names(TypeEnrollment)[2] <- "Types"
TypeGrades <- rownames_to_column(as.data.frame(sapply(Grades, typeof)), "Grades Attributes")
names(TypeGrades)[2] <- "Types"
TypeProgram <- rownames_to_column(as.data.frame(sapply(Program, typeof)), "Program Attributes")
names(TypeProgram)[2] <- "Types"
### a table announcing these numbers
knitr::kable(
list(TypeGrades, TypeEnrollment, TypeProgram),
caption = 'Attribute types', booktabs = TRUE) %>%
kable_styling(latex_options = "HOLD_position")
|
|
|
## creating a data set of column names to use in automation pre-checks. Need to be ran only once
### to flag new columns in next data sets
colsNames <- c(names(Enrollment), names(Grades), names(Program))
#write.csv(as.data.frame(colsNames), paste0(DU_path_output, Attr_names), col.names = c("Names"),sep = ",", row.names = F)
## creating a list of all student ids; this will be used for student level final product.
student_ids <- unique(c(unique(Enrollment$id), unique(Grades$id))) # 9746
Missing_ids <- setdiff(unique(Grades$id), unique(Enrollment$id)) # 0, mistakenly not named in enrollment data set
DroppedOut_ids <- setdiff(unique(Enrollment$id), unique(Grades$id)) # 1278
### a table announcing these numbers
Number <- c(length(student_ids), length(Missing_ids), length(DroppedOut_ids))
Students <- c("All", "Error in Enrollment", "Not Enrolled")
kable(data.frame(Students, Number), caption = "Students Information") %>%
kable_styling(latex_options = "HOLD_position", full_width = F)
Students | Number |
---|---|
All | 9746 |
Error in Enrollment | 0 |
Not Enrolled | 1278 |
## what does this data look like?
### creating a table of dim()
Row <- c(dim(Enrollment)[1], dim(Grades)[1], dim(Program)[1])
Column <- c(dim(Enrollment)[2], dim(Grades)[2], dim(Program)[2])
Data <- c("Enrollment", "Grades", "Program")
kable(data.frame(Data, Row, Column), caption = "Data Sets Dimension") %>% # dimension
kable_styling(latex_options = "HOLD_position", full_width = F)
Data | Row | Column |
---|---|---|
Enrollment | 18214 | 11 |
Grades | 27520 | 3 |
Program | 70 | 4 |
# SANITY CHECK FLAGS
## are they any NAs in any of the three data?
## do the columns have the same names as I want them to have in the data sets? (for automation)
Attr_names_dt <- read.csv(paste0(DU_path_output, Attr_names)) # created in last chunk
dts <- list(Enrollment,Grades, Program)
appended_dt <- NULL
for (i in 1:3){
# NAs
A <- dts[[i]] %>%
summarise(across(everything(), ~ sum(is.na(.x))))
appended_dt <- as.data.frame(append(appended_dt, A))
if (i == 3 & (sum(colSums(appended_dt) != 0))){
print("We have NAs in a dataset") # flagging the need of attention
print(appended_dt)
}
# col names
if (sum(!(names(dts[[i]]) %in% Attr_names_dt$colsNames) != 0)) {
print("Not the same column names") # flagging the need of attention
}
}
#
Now I have clean data sets. The code will flag anything that needs
attention.
# MAKING ONE SINGLE DATA SET - COMBINING ALL DATA SETS WITH GRADES AND FINAL MAJOR
names(Enrollment) # 18214
names(Grades) # 27520
names(Program) # 70
## Enrollment + grade
Enrollment_grade <- left_join(Enrollment, Grades[,c(1,3)], by = "id") #56318
Enrollment_grade <- distinct(Enrollment_grade) # 45820
sum(is.na(Enrollment_grade$final_course_grade)) # 1278 - students who have not stayed enrolled do not have a grade
summary(Enrollment_grade) # checking for other possible NAs produced
## Enrollment_grade + Program
## to do this, to ensure a valid join, I join the Enrollment_grade and program data on three shared columns
Enrollment_full <- left_join(Enrollment_grade, Program, by = c("college" = "college", "degr" = "degree", "majr" = "major")) #45820
Since the data sets’ volume is not large, consolidating all of them is
plausible. In the process, I ended up with a single data set with 2.5
times the volume of the initial enrollment data. This is mainly due to
the grades of various courses that are presented in the Grades data set.
A few simple sanity checks were also performed to ensure that the final
product of the merges was reliable. The output is used as the input for
the rest of this analysis, with 45820 records and 13 attributes
describing the students, their performance and degree information.
To calculate the persistence rate, I used this formula:
\[\text{\% persistence rate} = \frac {\text {students who stayed enrolled till the end of the term}}{\text{number of enrolled students at the begining of the term} - \text{students who finished school}} \times 100\] Since the time period is just one term, I can safely drop the second statement in the denominator and move forward. I created a binary variable to flag the students that had both census status on their record (1 for staying enrolled, 0 for not staying enrolled), and calculated the percentage of those who stayed enrolled in class. I decided to keep this variable in the final data product, since it provides a straight forward way of recognizing the two categories.
# CREATING PERSISTENCE RATE
length(unique(Enrollment_full$id)) #9746
table(Enrollment_full$census) #EOT WK3 22271 23549
table(Enrollment_full$term_code) #202170 45820
## exclude enrolled students at week 3 - keeping only student id and census
data_WK3 <- Enrollment_full %>%
select(id, census, legal_sex_desc) %>% # keeping gender variable for the next question
filter(census == "WK3") %>%
distinct(.) # 9746
## exclude enrolled students at the end of the term and create a dummy - keeping only student id and census
data_EOT <- Enrollment_full %>%
select(id, census) %>%
filter(census == "EOT") %>%
dplyr::select(id) %>%
mutate(persist = 1) %>%
distinct(.) # 8468
## Joining the two and calculating of persistence rate
Enrollment_merged <- data_WK3 %>%
left_join(data_EOT, by = 'id') %>% #9746
mutate(Persistence_rate = sum(persist, na.rm = T)/n())
## adding the two persistence related new measures to the full student level data set
Enrollment_full <- left_join(Enrollment_full, Enrollment_merged[,c(1,4,5)], by = "id") #45820
The persistence rate of the Fall 2021 term is 0.869, according to the Enrollment information provided to us. Let’s move on to the next question.
To investigate the possible difference in persistence rates between genders, I decided to perform a Chi-square test to give additional context for the observed results. Chi-square test is a helpful tool in investigating the relationship between two categorical variables. First, I constructed a categorical attribute to flag the students that have stayed enrolled. Next, I created a contingency table and ran the Chi-Square test of independence. My null hypothesis is as followed: \(\begin{aligned} {H_o} = \text{ The persistence rate is independent of gender in this cohort.} \ \end{aligned}\) The results are as follows:
# CHI-SQUARE TEST
## let's create a categorical value of persistence
### using the data set from last section to avoid repetition of id frequencies caused by joining
Enrollment_merged$persist_cat <- ifelse(is.na(Enrollment_merged$persist), "Attrition", "Persistence")
Enrollment_test <- Enrollment_merged[, c("legal_sex_desc", "persist_cat")]
## creating a contingency table
Enrollment_test <- table(Enrollment_test$legal_sex_desc, Enrollment_test$persist_cat)
test <- chisq.test(Enrollment_test)
pander(test)
Test statistic | df | P value |
---|---|---|
0.9418 | 1 | 0.3318 |
chi <- round(test$statistic, 3)
pval <- round(test$p.value, 3)
#test$expected
critical_val <- round(qchisq(0.05, df = 1, lower.tail=FALSE), 3)
# if statement for p value
if (pval > 0.05){
sig <- "not significant"
result_p <- "the test found no"
} else {
sig <- "significant"
result_p <- "the test found"
}
# if statement for chi-square value
if (chi > critical_val){
result_c <- "larger"
} else {
result_c <- "smaller"
}
The chi-square value is 0.942, smaller than the critical value of the degree of freedom of 1 (3.841 for p = 0.05; based on the chi-square table). In addition, the p-value is not significant. As a result, the test found no evidence of a statistically significant difference in persistence rates between genders.
As a first step to answering this question, I constructed one unified measure of race/ethnicity using the three attributes of visa types, ethnicity, and race, following the guide provided in the question. In this process, I created a binary variable of internationality as well. All visa types will be flagged as international except for ‘PR’, ‘RF’, and ‘AS’, and of course, the undisclosed statuses (showing blank in the data).
# CONSTRUCTING A RACE/ETHNICITY VARIABLE
## replacing the blank values of visa_desc with NAs
Enrollment_full$visa_desc <- ifelse(Enrollment_full$visa_desc == "", NA, Enrollment_full$visa_desc)
## internationals
Enrollment_full$internationality <- ifelse(!(Enrollment_full$visa_desc %in% c("PR", "RF", "AS")) & !is.na(Enrollment_full$visa_desc), 1, 0)
## filling in NAs of internationality with zeros
Enrollment_full$internationality <- ifelse(is.na(Enrollment_full$internationality), 0, Enrollment_full$internationality)
## combining all into race/ethnicity variable
Enrollment_full$Race_Ethnicity <- ifelse(Enrollment_full$internationality == 1, "International", ifelse(Enrollment_full$ethn_desc == "Hipanic or Latino", "Hipanic or Latino", Enrollment_full$race_desc))
female <- round(sum(Enrollment_full$legal_sex_desc == "Female")/nrow(Enrollment_full), 3)*100
male <- round(sum(Enrollment_full$legal_sex_desc == "Male")/nrow(Enrollment_full), 3)*100
# gender distribution results
if (female == male) {
gender_stat <- "The proportion of legally male and female stuents are the same"
} else {
if (female > 50) {
gender_stat <- "More than half of our class"
} else {
gender_stat <- "Less than half of our class"
}
}
Having access to the gender attribute of the enrollment data set, I can demonstrate gender composition in the cohort as presented in figure 1. More than half of our class (57.1%) are legally female students while 42.9% are male.
# PIE CHART 1 FOR GENDER
pie(c((sum(Enrollment_full$legal_sex_desc == "Female"))/nrow(Enrollment_full),(sum(Enrollment_full$legal_sex_desc == "Male"))/nrow(Enrollment_full)), labels = c('57%','43%'),
col = c('red','lightgreen'), border="white", init.angle = 90, cex = .7)
legend(-1.75,1,legend =c('Female','Male'),
col = c('red','lightgreen'),
pch = 22, pt.bg = c('red','lightgreen'),
cex = .7)
title(main = 'Figure 1: Breakdown of Gender', cex.main = 0.7)
# PIE CHART 2 FOR RACE/ETHNICITY
## data prep for race plot
race_graph <- Enrollment_full %>%
select(id, Race_Ethnicity) %>%
distinct(.) %>% # dropping duplicates of students since census and grade create many
group_by(Race_Ethnicity) %>%
summarise(race = n()) %>%
ungroup()
race_graph$prop <- (race_graph$race)/sum(race_graph$race)*100
## pie chart 2 for race
### ordering the data
race_graph_descend <- race_graph[order(-race_graph$prop), ]
labl <- c("White", "Hispanic", "International", "Black","Asian", "Multiple", "Native American", "Unknown", "Pacific Islander")
pie(c((race_graph_descend$prop)), labels = (paste0(c(round(race_graph_descend$prop)),"%")), col = hcl.colors(length(labl), "Spectral"),
border="white", init.angle = 90, cex = .7)
legend(-2,1,legend =labl,
col = hcl.colors(length(labl), "Spectral"),
pch = 22, pt.bg = hcl.colors(length(labl), "Spectral"),
cex = .75)
title(main = 'Figure 2: Breakdown of Race', cex.main = 0.7)
## automation variables
if (race_graph_descend$Race_Ethnicity[race_graph_descend$prop == max(race_graph_descend$prop)]
%in% c("White", "white", "WHITE")){
race_sentence1 <- "Considering the demographics of the state of Colorado, it is not surprising to see the percentage of white students in the lead"
} else {
race_sentence1 <- paste0("The percentage of", race_graph_descend$Race_Ethnicity[race_graph_descend$prop == max(race_graph_descend$prop)], "students is the highest in this cohort")
}
### having internationals in top 3 deserves a sentence to itself
if (race_graph_descend$Race_Ethnicity[1] %in% c("International", "international", "INTERNATIONAL")){
race_sentence2 <- paste0("Having international as the highest demographic in class followed by ", race_graph_descend$Race_Ethnicity[2], " and ", race_graph_descend$Race_Ethnicity[3], "proportions", " is an intriguing emerging pattern!")
} else {
if (race_graph_descend$Race_Ethnicity[2] %in% c("International", "international", "INTERNATIONAL")){
race_sentence2 <- paste0("Having international as the second highest demographic in class followed by ", race_graph_descend$Race_Ethnicity[3], "proportions", " is an intriguing emerging pattern!")
} else {
if (race_graph_descend$Race_Ethnicity[3] %in% c("International", "international", "INTERNATIONAL")){
race_sentence2 <- paste0("The high proportion of internationals in third place, closely following ", race_graph_descend$Race_Ethnicity[1], " and ", race_graph_descend$Race_Ethnicity[2], "proportions", " is an intriguing emerging pattern!")
}
}
}
race_sentence3 <- paste0(round(race_graph_descend$prop[race_graph_descend$Race_Ethnicity %in% c("Unknown", "Unknown", "UNKNOWN")],2), "%")
race_sentence4 <- race_graph_descend$Race_Ethnicity[race_graph_descend$prop == min(race_graph_descend$prop) & !(race_graph_descend$Race_Ethnicity %in% c("Unknown", "Unknown", "UNKNOWN"))]
race_sentence4_prop <- paste0(round(race_graph_descend$prop[race_graph_descend$Race_Ethnicity == race_sentence4], 2), "%")
race_sentence5 <- race_graph_descend$Race_Ethnicity[race_graph_descend$prop == min(race_graph_descend$prop[!(race_graph_descend$Race_Ethnicity %in% c("Unknown", "Unknown", "UNKNOWN", race_sentence4))])]
race_sentence5_prop <- paste0(round(race_graph_descend$prop[race_graph_descend$Race_Ethnicity == race_sentence5], 2), "%")
race_sentence6 <- race_graph_descend$Race_Ethnicity[race_graph_descend$prop == min(race_graph_descend$prop[!(race_graph_descend$Race_Ethnicity %in% c("Unknown", "Unknown", "UNKNOWN", race_sentence4, race_sentence5))])]
race_sentence6_prop <- paste0(round(race_graph_descend$prop[race_graph_descend$Race_Ethnicity == race_sentence6], 2), "%")
race_sentence7 <- race_graph_descend$Race_Ethnicity[race_graph_descend$prop == min(race_graph_descend$prop[!(race_graph_descend$Race_Ethnicity %in% c("Unknown", "Unknown", "UNKNOWN", race_sentence4, race_sentence5, race_sentence6))])]
race_sentence7_prop <- paste0(round(race_graph_descend$prop[race_graph_descend$Race_Ethnicity == race_sentence7], 2), "%")
race_sentence8 <- race_graph_descend$Race_Ethnicity[race_graph_descend$prop == min(race_graph_descend$prop[!(race_graph_descend$Race_Ethnicity %in% c("Unknown", "Unknown", "UNKNOWN", race_sentence4, race_sentence5, race_sentence6, race_sentence7))])]
race_sentence8_prop <- paste0(round(race_graph_descend$prop[race_graph_descend$Race_Ethnicity == race_sentence8], 2), "%")
Leveraging the constructed race/ethnicity measure of the enrollment data set, the race/ethnicity make up of the cohort is presented as figure 2. Considering the demographics of the state of Colorado, it is not surprising to see the percentage of white students in the lead. The high proportion of internationals in third place, closely following White and Hipanic or Latinoproportions is an intriguing emerging pattern! Setting aside the unknown group (3.44%), the smallest racial minorities in the class are the Native Hawaiian or Other Pacific Islander (1.92%), American Indian or Alaska Native (4.32%), and Multiple (two or more races) (7.4%) students, with the Asian (9.87%) and Black or African American (11.71%) races hovering in the middle.
One other interesting piece of information can be what the composition of the two major demographics (gender and race/ethnicity) looks like in relation to each other. Figure 3 shows both the race/ethnicity and gender of DU’s students in Fall 2021.
# RACE/GENDER PYRAMID
## data prep for plot
Enrollment_bar <- Enrollment_full %>%
select(id, Race_Ethnicity, legal_sex_desc) %>%
distinct(.) %>% # dropping duplicates of students since census and grade create many
group_by(Race_Ethnicity, legal_sex_desc) %>%
tally() %>%
mutate( n = ifelse(legal_sex_desc=="Female", n*(-1), n*1)) # manipulating data - negative values for one category
ggplot(Enrollment_bar, aes(y = n, x=reorder(Race_Ethnicity, -n))) +
geom_col(aes(fill=legal_sex_desc)) +
scale_fill_manual("", values = c("#66FFFF", "#993300")) +
scale_y_continuous(limits = c(-1800, 1800), breaks = scales::pretty_breaks(n = 6),
labels = function(br) ifelse(abs(br)>=1000,paste0(abs(br)/1000, "k"), abs(br))) +
labs(caption = "Figure 3: proportion of female and male students in race/ethnicity groups") +
coord_flip() +
xlab("") +
ylab("") +
theme_bw() +
theme(text = element_text(size=9, vjust = 0.5, hjust=1, face = "bold"), axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, face = "bold"), plot.title = element_text(size=9, face = "bold", hjust = 0.5), legend.position = 'right')
## automation variables
Enrollment_automation <- Enrollment_full %>%
select(id, Race_Ethnicity, legal_sex_desc) %>%
distinct(.) %>% # dropping duplicates of students since census and grade create many
group_by(Race_Ethnicity, legal_sex_desc) %>%
tally() %>%
ungroup() %>%
spread(legal_sex_desc, n) %>%
group_by(Race_Ethnicity) %>%
summarise(female_race_rate = round((Female/(Female+Male)*100),2),
male_race_rate = round((Male/(Female+Male)*100),2),
closest = Female - Male,
gender_race_flag = ifelse((Female - Male) > 0, 1, 0))
## a gender is dominant on all races
if (sum(Enrollment_automation$gender_race_flag) == 9){
sentence1 <- "Female students are the leading gender in this cohort across all races"
} else {
if (sum(Enrollment_automation$gender_race_flag) == 0){
sentence1 <- "Male students are the leading gender in this cohort across all races"
}
}
## what races has the closest difference
sentence2 <- Enrollment_automation$Race_Ethnicity[Enrollment_automation$closest == min(abs(Enrollment_automation$closest))]
sentence2_rate <- paste0(Enrollment_automation$male_race_rate[Enrollment_automation$Race_Ethnicity == sentence2], "%")
Enrollment_automation <- arrange(Enrollment_automation, -Enrollment_automation$male_race_rate)
races <- unique(Enrollment_automation$Race_Ethnicity[!(Enrollment_automation$Race_Ethnicity %in% c(sentence2))])
sentence3_rate <- list()
sentence3 <- list()
for (i in 1:length(races)){
sentence3_rate <- append(sentence3_rate, paste0(Enrollment_automation$male_race_rate[Enrollment_automation$Race_Ethnicity == races[i]], "%"))
sentence3 <- append(sentence3, races[i])
}
Female students are the leading gender in this cohort across all races. Native Hawaiian or Other Pacific Islander are the race group that has the closest proportion of males and females, with 48.13% of males. This percentage is 44.36% for Hipanic or Latino, 43.97% for Multiple (two or more races), White for White, 42.39% for Unknown, 42.04% for American Indian or Alaska Native, 41.89% for Asian, 41.45% for Black or African American and 41.36% for International populations.
# AGE DISTRIBUTION ILLUSTRATIONS
## convert the birth date column to date type
Enrollment_full$birth_date <- as.Date(Enrollment_full$birth_date, format = "%m/%d/%Y")
## calculate age from birth date
Enrollment_full$age <- round(as.numeric(difftime(Sys.Date(), Enrollment_full$birth_date, units = "weeks"))/52.25)
## data prep for plotting age distribution
Enrollment_age <- Enrollment_full %>%
select(id, age) %>%
distinct(.) %>%
group_by(age) %>%
tally()
## numeric age plot
age <- ggplot(Enrollment_age, aes(x=age, y=n)) +
geom_line(color = "#BF87B3", size = 1.3) +
geom_point(color = "#3F2D91") +
theme_bw() +
scale_x_continuous(limits = c(16, 70)) +
xlab("") +
ylab("") +
theme(text = element_text(size=9, vjust = 0.5, hjust=1, face = "bold"), axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, face = "bold"), plot.title = element_text(size=8, face = "bold", hjust = 0.5))
## defining categorical age ranges
AgeUnder19 <- round((sum(Enrollment_age$n[Enrollment_age$age < 19])/sum(Enrollment_age$n))*100,1)
Age19 <- round((sum(Enrollment_age$n[Enrollment_age$age == 19])/sum(Enrollment_age$n))*100,1)
Age2022 <- round((sum(Enrollment_age$n[Enrollment_age$age %in% c(20,21,22)])/sum(Enrollment_age$n))*100,1)
AgeOver22 <- round((sum(Enrollment_age$n[Enrollment_age$age > 22 ])/sum(Enrollment_age$n))*100,1)
## creating a table of age ranges with their distributions
Percentage <- c(AgeUnder19, Age19, Age2022, AgeOver22)
Age <- c("AgeUnder19", "Age19", "Age2022", "AgeOver22")
table_age <- qplot(1:10, 1:10, geom = "blank") +
theme_void() +
annotation_custom(grob = tableGrob(data.frame(Age, Percentage), rows = NULL),
xmin = -5, xmax = 10, ymin = 1, ymax = 10)
## attaching two demonstrations together
title <- ggdraw() +
draw_label("Figure 4: Age Distribution: A) Barplot B) Percentage table", fontface='bold', size = 9)
fg4 <- plot_grid(age, table_age, align = "v", labels = c('A', 'B'))
plot_grid(title, fg4, ncol = 1, rel_heights=c(0.1, 1))
## automation variables
Age <- c("under 19", "19", "between 20 and 22", "over 22")
age_auto <- data.frame(Age, Percentage)
sentence1 <- age_auto$Age[age_auto$Percentage == max(age_auto$Percentage)]
sentence1_rate <- paste0(age_auto$Percentage[age_auto$Age == sentence1], "%")
sentence2 <- age_auto$Age[age_auto$Percentage == max(age_auto$Percentage[!(age_auto$Age %in% c(sentence1))])]
sentence2_rate <- paste0(age_auto$Percentage[age_auto$Age == sentence2], "%")
sentence3 <- age_auto$Age[age_auto$Percentage == max(age_auto$Percentage[!(age_auto$Age %in% c(sentence1, sentence2))])]
sentence3_rate <- paste0(age_auto$Percentage[age_auto$Age == sentence3], "%")
sentence4 <- age_auto$Age[age_auto$Percentage == max(age_auto$Percentage[!(age_auto$Age %in% c(sentence1, sentence2, sentence3))])]
sentence4_rate <- paste0(age_auto$Percentage[age_auto$Age == sentence4], "%")
sentence5 <- max(Enrollment_age$age)
sentence6 <- min(Enrollment_age$age)
As presented in figure 4, the majority of the students are 19 (50.1%). 45.6% of the class is between 20 and 22. The remainder belongs to ages of under 19 (0.7%) and over 22 (3.6%). The oldest reported age is 67, and the youngest is 16.
#GRADE DISTRIBUTION
## choosing to use Grades data directly for this question
## function to convert letter grades to numeric version
grade_scores <- function(x) {
A <- factor(x, levels=c("A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "D-", "F")) ## added score for D-
values <- c(4, 3.7, 3.3, 3, 2.7, 2.3, 2, 1.7, 1.3, 1, 0.7, 0)
values[A]
}
Grades$grade_scores <- lapply(Grades$final_course_grade, grade_scores)
Enrollment_full$grade_scores <- lapply(Enrollment_full$final_course_grade, grade_scores) # adding to student level data
Grades$grade_scores <- as.numeric(Grades$grade_scores)
Enrollment_full$grade_scores <- as.numeric(Enrollment_full$grade_scores)
Grades_byID <- Grades %>%
group_by(id) %>%
summarise(per_id = mean(grade_scores))
## plot
par(mfrow=c(1,2))
boxplot(Grades_byID$per_id, horizontal=TRUE, ylim=c(0,4), col=rgb(0.8,0.8,0,0.5), frame=T, main = "Figure 5:Final Exam Scores Distribution", cex.main = 0.7)
hist(Grades_byID$per_id,main = "Figure 6: Histogram of Final Exam Scores", cex.main = 0.7,
breaks=c(0,0.5,1,1.5,2,2.5,3,3.5,4), xaxp = c(0,4,8),
xlab = "", ylab = "",xlim = c(0,4), col="darkmagenta")
abline(v = median(Grades_byID$per_id),col="darkslategray4", lwd = 5) # median line
sentence1 <- paste0(round(((sum(Grades_byID$per_id > mean(Grades_byID$per_id))/nrow(Grades_byID))*100),2), "%")
mediangrade <- round(median(Grades_byID$per_id),2)
min_score <- min(Grades_byID$per_id)
#sum(Grades_byID$per_id > 3 & Grades_byID$per_id < 3.5)/nrow(Grades_byID)
#sum(Grades_byID$per_id >= 3.5)
After converting the letter grades to numbers, the distribution of grade scores was presented (figures 5 and 6). 54.89% of the class have scored higher than the average. Both figures 5 and 6 also present the median grade (3.33 or B+) as a vertical line. There are a couple outliers on the lower side of the grade distribution, demonstrated by the empty circles in figure 5. The minimum score in this class is 0.85. Both quarter one and quarter three are the sides of higher scores. According to figure 6, scores between 3 and 3.5 are the most frequent scores (38% of the class, in fact). The histogram is quite left-skewed. So, in summary, the students have done a great job on finals!
#GRADES ACROSS PROGRAMS
## Let's aggregate the grade data on program level
GPA_program <- Enrollment_full %>%
select(id, program, grade_scores) %>%
distinct(.)
GPA_program <- GPA_program %>% #17 rows
group_by(program) %>%
summarise(GPA = mean(grade_scores, na.rm = T))
## cross tab
GPA_program %>%
arrange(.,desc(GPA)) %>% # order
rename(Program = program) %>%
kable(caption = "Average GPA across Programs") %>%
kable_styling(latex_options = "HOLD_position", font_size = 7,
bootstrap_options = c("condensed", "striped", "bordered")) %>%
row_spec(0, font_size=10)
Program | GPA |
---|---|
BM | 3.248480 |
BFA | 3.242114 |
BSCPE-ECS | 3.233535 |
BSME-ECS | 3.222105 |
BAUNDE | 3.216374 |
BS-ENGR | 3.215658 |
BA-INTS | 3.215259 |
BS-NAT SCI | 3.213962 |
BA-ECS | 3.201580 |
BA-ART/HUMAN | 3.196689 |
BS-ECS | 3.196011 |
BSBA | 3.193920 |
BSACC | 3.183791 |
BA-NAT SCI | 3.180742 |
BSEE-ECS | 3.179546 |
BA-SOC SCI | 3.177805 |
BS-SOC SCI | 3.132791 |
sentence1 <- GPA_program$program[GPA_program$GPA == max(GPA_program$GPA)]
I already have the information about GPA and the programs in one merged data set. One question that came to my mind was whether there is a conceptual difference between GPA and final scores, but since I did not have access to course credit information, I decided to assume that they represent the same thing. As demonstrated in table 4, BM, or “Business Management”, has the highest average scores across all 17 programs.
I assigned a tag to each record indicating which category of degree it belongs to in order to investigate the distribution of grades across the broad degree level. Assisted by figure 7, we can see that BA is the degree with the best grades, followed by BS, BM, and finally BFA. 25% of all grades in BA are A, followed by a 20% rate of A- for the Fall 2021.
# GRADES ACROSS DEGREE
## constructing a measure for broad degree level
Enrollment_full$degree_level <- ifelse(str_detect(Enrollment_full$degr, "BA"), "BA", ifelse(str_detect(Enrollment_full$degr, "BS"), "BS", ifelse(str_detect(Enrollment_full$degr, "BM"), "BM", ifelse(str_detect(Enrollment_full$degr, "BFA"), "BFA", NA))))
## data prep
Grade_degree <- Enrollment_full %>% # filtering out for the broad degree level
select(id, degree_level, final_course_grade) %>%
distinct(.) %>%
group_by(degree_level, final_course_grade) %>%
tally() %>%
filter(!is.na(final_course_grade)) # NAs happen because not everyone on th enrollment data has a grade
## 47 records
## plot
ggballoonplot(Grade_degree, fill = "value", shape = 21) +
gradient_fill(c("blue", "white", "red")) + ggtitle("Figure 7: Grade distribution across the broad degree levels") +
theme(plot.title = element_text(size = 9, face = "bold"), axis.text.y = element_text(angle = -90, vjust = 0.5, hjust=1, face = "bold"), axis.text.x = element_text(vjust = 0.5, hjust=1, face = "bold"))
# UNDECLARED STUDENTS WHO DECLARED MAJOR
## choosing to use the original enrollment data
## getting a data set of all undeclared students, getting their ids and extracting major info of those ids
Enrollment_UN <- Enrollment[str_detect(Enrollment$majr, "UN"),] # 929
Enrollment_both_dec <- Enrollment[Enrollment$id %in% Enrollment_UN$id,] #1161
## filtering for those UN students who have two records under their names
Enrollment_both_dec <- Enrollment_both_dec %>%
select(-census) %>%
distinct(.) %>% # this drops whoever stayed undeclared
group_by(id) %>%
mutate(un_major = n()) %>%
filter(un_major == 2) #488
prop_dec_per_cohort <-paste0(round((length(unique(Enrollment_both_dec$id))/length(unique(Enrollment$id)))*100,2), "%") #2.503591
prop_dec_per_UN <- paste0(round((length(unique(Enrollment_both_dec$id))/nrow(Enrollment_UN))*100,2), "%") #26.2648
As the last step of this analysis, I produced a student level data set. Some of the new measures I constructed are included in the final product, and some of the original attributes were dropped.
# FINAL TOUCH UPS OF FINAL PRODUCT
names(Enrollment_full) #45820 records
## reorder columns
Enrollment_final <- Enrollment_full[,c(1:3,14,8:10,20,13,11,18,12,19,15,17)]
names(Enrollment_final) <- tolower(names(Enrollment_final))
summary(Enrollment_final)
Enrollment_final$final_course_grade <- ifelse(is.na(Enrollment_final$final_course_grade), "No Grade", Enrollment_final$final_course_grade)
Enrollment_final$grade_scores <- ifelse(is.na(Enrollment_final$grade_scores), 0, Enrollment_final$grade_scores)
Enrollment_final$persist <- ifelse(is.na(Enrollment_final$persist), 0, Enrollment_final$persist)
# OUTPUTTING
write.csv(Enrollment_final, paste0(DU_path_output, Product_name), row.names = F)