In this document, I analyzed three data sets provided by the University: “enrollment”, “grades” and “programs”. The enrollment data set includes information about registration for a single DU term Fall 2021, covering student registration status, their demographics, and program. The grades data set provides final grades for each student for all courses in that same term. And finally, the program data set offers a mapping of college, degree, and academic unit information. Using these three data sets, I attempted to find answers to eight challenging questions in three major sections.

In Part One, I proceeded to both clean the data sets and perform a few sanity checks on them. These checks are especially vital for automation processes. Next, I followed through with the final data product generation process, generating a student-level data set. In the third part, I tried my hand at finding the best methods of answering the questions through data manipulation and statistics. Last, I made minor changes to the student-level data set created in the previous sections and completed some final preparation tasks.  

Before getting to the questions, it is necessary to make sure the provided data sets are reliable. In addition, considering that all future data sets will have the same characteristics, I aimed to design an automated process. Automation needs a few extra steps of data preparation. All major data structure factors, including attribute types and names of any future data sets should be inline with the automation code. Therefore, aside from NAs and duplicated value identification steps, formatting unification and verifying student ID (see table 2), I have also created a demonstration of attribute types for each data set (see table 1), and a table of data set dimensions (see table 3).

Attribute types
Grades Attributes Types
id integer
term_code integer
final_course_grade character
Enrollment Attributes Types
id integer
term_code integer
census character
race_desc character
legal_sex_desc character
ethn_desc character
visa_desc character
college character
degr character
majr character
birth_date character
Program Attributes Types
college character
degree character
major character
program character
Students Information
Students Number
All 9746
Error in Enrollment 0
Not Enrolled 1278
Data Sets Dimension
Data Row Column
Enrollment 18214 11
Grades 27520 3
Program 70 4
Now I have clean data sets. The code will flag anything that needs attention. Since the data sets’ volume is not large, consolidating all of them is plausible. In the process, I ended up with a single data set with 2.5 times the volume of the initial enrollment data. This is mainly due to the grades of various courses that are presented in the Grades data set. A few simple sanity checks were also performed to ensure that the final product of the merges was reliable. The output is used as the input for the rest of this analysis, with 45820 records and 13 attributes describing the students, their performance and degree information.

Q1: What is the persistence rate from week three to the end of the term?

  To calculate the persistence rate, I used this formula:

\[\text{\% persistence rate} = \frac {\text {students who stayed enrolled till the end of the term}}{\text{number of enrolled students at the begining of the term} - \text{students who finished school}} \times 100\]     Since the time period is just one term, I can safely drop the second statement in the denominator and move forward. I created a binary variable to flag the students that had both census status on their record (1 for staying enrolled, 0 for not staying enrolled), and calculated the percentage of those who stayed enrolled in class. I decided to keep this variable in the final data product, since it provides a straight forward way of recognizing the two categories.

The persistence rate of the Fall 2021 term is 0.869, according to the Enrollment information provided to us. Let’s move on to the next question.

Q2: Is there a statistically significant difference in persistence rate between males and females?

To investigate the possible difference in persistence rates between genders, I decided to perform a Chi-square test to give additional context for the observed results. Chi-square test is a helpful tool in investigating the relationship between two categorical variables. First, I constructed a categorical attribute to flag the students that have stayed enrolled. Next, I created a contingency table and ran the Chi-Square test of independence. My null hypothesis is as followed: \(\begin{aligned} {H_o} = \text{ The persistence rate is independent of gender in this cohort.} \ \end{aligned}\) The results are as follows:

Pearson’s Chi-squared test with Yates’ continuity correction: Enrollment_test
Test statistic df P value
0.9418 1 0.3318

The chi-square value is 0.942, smaller than the critical value of the degree of freedom of 1 (3.841 for p = 0.05; based on the chi-square table). In addition, the p-value is not significant. As a result, the test found no evidence of a statistically significant difference in persistence rates between genders.  

Q3: Describe the makeup of the class in terms of race/ethnicity and gender.

As a first step to answering this question, I constructed one unified measure of race/ethnicity using the three attributes of visa types, ethnicity, and race, following the guide provided in the question. In this process, I created a binary variable of internationality as well. All visa types will be flagged as international except for ‘PR’, ‘RF’, and ‘AS’, and of course, the undisclosed statuses (showing blank in the data).

Having access to the gender attribute of the enrollment data set, I can demonstrate gender composition in the cohort as presented in figure 1. More than half of our class (57.1%) are legally female students while 42.9% are male.

Leveraging the constructed race/ethnicity measure of the enrollment data set, the race/ethnicity make up of the cohort is presented as figure 2. Considering the demographics of the state of Colorado, it is not surprising to see the percentage of white students in the lead. The high proportion of internationals in third place, closely following White and Hipanic or Latinoproportions is an intriguing emerging pattern! Setting aside the unknown group (3.44%), the smallest racial minorities in the class are the Native Hawaiian or Other Pacific Islander (1.92%), American Indian or Alaska Native (4.32%), and Multiple (two or more races) (7.4%) students, with the Asian (9.87%) and Black or African American (11.71%) races hovering in the middle. 

One other interesting piece of information can be what the composition of the two major demographics (gender and race/ethnicity) looks like in relation to each other. Figure 3 shows both the race/ethnicity and gender of DU’s students in Fall 2021. 

Female students are the leading gender in this cohort across all races. Native Hawaiian or Other Pacific Islander are the race group that has the closest proportion of males and females, with 48.13% of males. This percentage is 44.36% for Hipanic or Latino, 43.97% for Multiple (two or more races), White for White, 42.39% for Unknown, 42.04% for American Indian or Alaska Native, 41.89% for Asian, 41.45% for Black or African American and 41.36% for International populations.

Q4: How is age distributed in this class?

As presented in figure 4, the majority of the students are 19 (68.4%). 26.6% of the class is between 20 and 22. The remainder belongs to ages of under 19 (1.6%) and over 22 (3.4%). The oldest reported age is 67, and the youngest is 16. #### Q5: Describe or show the distribution of grades across all students.
After converting the letter grades to numbers, the distribution of grade scores was presented (figures 5 and 6). 54.89% of the class have scored higher than the average. Both figures 5 and 6 also present the median grade (3.33 or B+) as a vertical line. There are a couple outliers on the lower side of the grade distribution, demonstrated by the empty circles in figure 5. The minimum score in this class is 0.85. Both quarter one and quarter three are the sides of higher scores. According to figure 6, scores between 3 and 3.5 are the most frequent scores (38% of the class, in fact). The histogram is quite left-skewed. So, in summary, the students have done a great job on finals! 
Average GPA across Programs
Program GPA
BM 3.248480
BFA 3.242114
BSCPE-ECS 3.233535
BSME-ECS 3.222105
BAUNDE 3.216374
BS-ENGR 3.215658
BA-INTS 3.215259
BS-NAT SCI 3.213962
BA-ECS 3.201580
BA-ART/HUMAN 3.196689
BS-ECS 3.196011
BSBA 3.193920
BSACC 3.183791
BA-NAT SCI 3.180742
BSEE-ECS 3.179546
BA-SOC SCI 3.177805
BS-SOC SCI 3.132791

Q6: Present a cross tab of average GPA by program.

I already have the information about GPA and the programs in one merged data set. One question that came to my mind was whether there is a conceptual difference between GPA and final scores, but since I did not have access to course credit information, I decided to assume that they represent the same thing. As demonstrated in table 4, BM, or “Business Management”, has the highest average scores across all 17 programs.

Q7: Present a visual representation of course grade distributions broken out by broad degree level (BA, BS, BM, BFA).

I assigned a tag to each record indicating which category of degree it belongs to in order to investigate the distribution of grades across the broad degree level. Assisted by figure 7, we can see that BA is the degree with the best grades, followed by BS, BM, and finally BFA. 25% of all grades in BA are A, followed by a 20% rate of A- for the Fall 2021. 

Q8: What is the proportion of week three undeclared students with a declared major at the end of term?

I identified all the students with undeclared majors, and then flagged the ones that have declared their major at the end of the term. 2.5% of the cohort ended up figuring out what major they want to continue. They are 26.26% of all undeclared students.

As the last step of this analysis, I produced a student level data set. Some of the new measures I constructed are included in the final product, and some of the original attributes were dropped.