In this document, I analyzed three data sets provided by the University: “enrollment”, “grades” and “programs”. The enrollment data set includes information about registration for a single DU term Fall 2021, covering student registration status, their demographics, and program. The grades data set provides final grades for each student for all courses in that same term. And finally, the program data set offers a mapping of college, degree, and academic unit information. Using these three data sets, I attempted to find answers to eight challenging questions in three major sections.
In Part One, I proceeded to both clean the data sets and perform a few sanity checks on them. These checks are especially vital for automation processes. Next, I followed through with the final data product generation process, generating a student-level data set. In the third part, I tried my hand at finding the best methods of answering the questions through data manipulation and statistics. Last, I made minor changes to the student-level data set created in the previous sections and completed some final preparation tasks.Before getting to the questions, it is necessary to make sure the provided data sets are reliable. In addition, considering that all future data sets will have the same characteristics, I aimed to design an automated process. Automation needs a few extra steps of data preparation. All major data structure factors, including attribute types and names of any future data sets should be inline with the automation code. Therefore, aside from NAs and duplicated value identification steps, formatting unification and verifying student ID (see table 2), I have also created a demonstration of attribute types for each data set (see table 1), and a table of data set dimensions (see table 3).
|
|
|
Students | Number |
---|---|
All | 9746 |
Error in Enrollment | 0 |
Not Enrolled | 1278 |
Data | Row | Column |
---|---|---|
Enrollment | 18214 | 11 |
Grades | 27520 | 3 |
Program | 70 | 4 |
To calculate the persistence rate, I used this formula:
\[\text{\% persistence rate} = \frac {\text {students who stayed enrolled till the end of the term}}{\text{number of enrolled students at the begining of the term} - \text{students who finished school}} \times 100\] Since the time period is just one term, I can safely drop the second statement in the denominator and move forward. I created a binary variable to flag the students that had both census status on their record (1 for staying enrolled, 0 for not staying enrolled), and calculated the percentage of those who stayed enrolled in class. I decided to keep this variable in the final data product, since it provides a straight forward way of recognizing the two categories.
The persistence rate of the Fall 2021 term is 0.869, according to the Enrollment information provided to us. Let’s move on to the next question.
To investigate the possible difference in persistence rates between genders, I decided to perform a Chi-square test to give additional context for the observed results. Chi-square test is a helpful tool in investigating the relationship between two categorical variables. First, I constructed a categorical attribute to flag the students that have stayed enrolled. Next, I created a contingency table and ran the Chi-Square test of independence. My null hypothesis is as followed: \(\begin{aligned} {H_o} = \text{ The persistence rate is independent of gender in this cohort.} \ \end{aligned}\) The results are as follows:
Test statistic | df | P value |
---|---|---|
0.9418 | 1 | 0.3318 |
The chi-square value is 0.942, smaller than the critical value of the degree of freedom of 1 (3.841 for p = 0.05; based on the chi-square table). In addition, the p-value is not significant. As a result, the test found no evidence of a statistically significant difference in persistence rates between genders.
As a first step to answering this question, I constructed one unified measure of race/ethnicity using the three attributes of visa types, ethnicity, and race, following the guide provided in the question. In this process, I created a binary variable of internationality as well. All visa types will be flagged as international except for ‘PR’, ‘RF’, and ‘AS’, and of course, the undisclosed statuses (showing blank in the data).
Having access to the gender attribute of the enrollment data set, I can demonstrate gender composition in the cohort as presented in figure 1. More than half of our class (57.1%) are legally female students while 42.9% are male.
Leveraging the constructed race/ethnicity measure of the enrollment data set, the race/ethnicity make up of the cohort is presented as figure 2. Considering the demographics of the state of Colorado, it is not surprising to see the percentage of white students in the lead. The high proportion of internationals in third place, closely following White and Hipanic or Latinoproportions is an intriguing emerging pattern! Setting aside the unknown group (3.44%), the smallest racial minorities in the class are the Native Hawaiian or Other Pacific Islander (1.92%), American Indian or Alaska Native (4.32%), and Multiple (two or more races) (7.4%) students, with the Asian (9.87%) and Black or African American (11.71%) races hovering in the middle.
One other interesting piece of information can be what the composition of the two major demographics (gender and race/ethnicity) looks like in relation to each other. Figure 3 shows both the race/ethnicity and gender of DU’s students in Fall 2021.
Female students are the leading gender in this cohort across all races. Native Hawaiian or Other Pacific Islander are the race group that has the closest proportion of males and females, with 48.13% of males. This percentage is 44.36% for Hipanic or Latino, 43.97% for Multiple (two or more races), White for White, 42.39% for Unknown, 42.04% for American Indian or Alaska Native, 41.89% for Asian, 41.45% for Black or African American and 41.36% for International populations.
Program | GPA |
---|---|
BM | 3.248480 |
BFA | 3.242114 |
BSCPE-ECS | 3.233535 |
BSME-ECS | 3.222105 |
BAUNDE | 3.216374 |
BS-ENGR | 3.215658 |
BA-INTS | 3.215259 |
BS-NAT SCI | 3.213962 |
BA-ECS | 3.201580 |
BA-ART/HUMAN | 3.196689 |
BS-ECS | 3.196011 |
BSBA | 3.193920 |
BSACC | 3.183791 |
BA-NAT SCI | 3.180742 |
BSEE-ECS | 3.179546 |
BA-SOC SCI | 3.177805 |
BS-SOC SCI | 3.132791 |
I already have the information about GPA and the programs in one merged data set. One question that came to my mind was whether there is a conceptual difference between GPA and final scores, but since I did not have access to course credit information, I decided to assume that they represent the same thing. As demonstrated in table 4, BM, or “Business Management”, has the highest average scores across all 17 programs.
I assigned a tag to each record indicating which category of degree it belongs to in order to investigate the distribution of grades across the broad degree level. Assisted by figure 7, we can see that BA is the degree with the best grades, followed by BS, BM, and finally BFA. 25% of all grades in BA are A, followed by a 20% rate of A- for the Fall 2021.
As the last step of this analysis, I produced a student level data set. Some of the new measures I constructed are included in the final product, and some of the original attributes were dropped.