This stakeholder is in the business of providing higher learning at scale. Although the goal and mission of student success are at the forefront, it is also imperative that the business component is running at peak performance to ensure the ultimate goal can be accomplished.
The goal of my analysis is to determine if the data shows us areas where we can potentially maximize our efforts related to student success and maintaining enrollment.
The first phase of this analysis will focus on cleaning the data, understanding our data, and quantifying and qualifying the data.
We will use correlation and correspondence analysis to provide insight into inputs we can focus on to ensure we are meeting our objective.
To keep the clarity of this report simple and focused on the business logic side of the equation vs. the underlying programming, I will show output only and will leave a link below to the code for those who may want to review.
Let’s begin by looking at our data to determine the quality of the data as well as to begin assessing cleaning and shaping the data if needed.
We will plot out the data for a visual inspection.
With this plot, we get a quick feel for the type of data we have, and the overall completeness. The glaring observation here is that we have some missing values.
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 78608 19 11 8 0
## total_missing_values complete_rows total_observations memory_usage
## 1 7814 70799 1493552 11893208
Let’s take a closer look by column.
Normally, I would reach out to stakeholders to gather additional information related to missing data; however, since that is not an option for this demonstration, I will remove the NAs. We could look at the option to impute a mean for the “HasPersistance”, but since we are working with a binary, the best option here given we have ample data to work with, is to remove the NAs.
## Student_SK Region
## 0 0
## StateOrProvinceCode GenderCode
## 0 0
## Gender Ethnicity
## 0 0
## CurrentAge CurrentLearnerSegment
## 0 0
## CohortTerm CohortTermAxis
## 0 0
## Term TermAxis
## 0 0
## Course CourseSection
## 0 0
## HasMaintainedEnrollment Success
## 0 0
## Grade CourseAverageGrade
## 0 5
## HasPersistenceIntoNextTerm
## 7809
Ok, we are back on track.
Let’s take another look at the data. For this analysis, the data looks adequate to work; however, before moving on, we will trim the mean for “CourseAverageGrade.”I noticed that there is a large frequency spike in scores at 100%. I would want to verify if that information is accurate. For this analysis, we will trim the max and min to balance out the mean and move forward.
## vars n mean sd median
## Student_SK 1 70799 6402780.91 922282.56 6794028.00
## Region* 2 70799 7.17 3.81 6.00
## StateOrProvinceCode* 3 70799 34.88 16.61 36.00
## GenderCode* 4 70799 2.57 1.07 2.00
## Gender* 5 70799 2.57 1.07 2.00
## Ethnicity* 6 70799 7.35 2.28 8.00
## CurrentAge 7 70799 32.15 9.25 30.00
## CurrentLearnerSegment* 8 70799 1.35 0.90 1.00
## CohortTerm* 9 70799 1.00 0.00 1.00
## CohortTermAxis 10 70799 -7.00 0.00 -7.00
## Term* 11 70799 3.56 1.56 4.00
## TermAxis 12 70799 -4.94 1.71 -5.00
## Course* 13 70799 298.33 179.34 294.00
## CourseSection* 14 70799 11391.29 6566.45 11317.00
## HasMaintainedEnrollment 15 70799 1.00 0.00 1.00
## Success 16 70799 0.76 0.42 1.00
## Grade* 17 70799 4.49 4.51 2.00
## CourseAverageGrade 18 70799 0.79 0.27 0.91
## HasPersistenceIntoNextTerm 19 70799 0.81 0.39 1.00
## trimmed mad min max range skew
## Student_SK 6630153.70 187720.88 1921700 7023853 5102153 -2.48
## Region* 7.10 4.45 1 13 12 0.12
## StateOrProvinceCode* 34.99 19.27 1 64 63 -0.05
## GenderCode* 2.59 0.00 1 5 4 0.38
## Gender* 2.59 0.00 1 6 5 0.38
## Ethnicity* 7.73 1.48 1 9 8 -1.17
## CurrentAge 31.14 8.90 17 83 66 0.98
## CurrentLearnerSegment* 1.09 0.00 1 5 4 2.37
## CohortTerm* 1.00 0.00 1 1 0 NaN
## CohortTermAxis -7.00 0.00 -7 -7 0 NaN
## Term* 3.58 1.48 1 6 5 -0.06
## TermAxis -5.05 1.48 -7 -2 5 0.34
## Course* 295.76 213.49 1 621 620 0.18
## CourseSection* 11378.60 8391.52 1 22387 22386 0.03
## HasMaintainedEnrollment 1.00 0.00 1 1 0 NaN
## Success 0.83 0.00 0 1 1 -1.24
## Grade* 3.73 1.48 1 15 14 1.08
## CourseAverageGrade 0.84 0.11 0 1 1 -1.58
## HasPersistenceIntoNextTerm 0.89 0.00 0 1 1 -1.61
## kurtosis se
## Student_SK 6.07 3466.17
## Region* -1.32 0.01
## StateOrProvinceCode* -1.23 0.06
## GenderCode* -1.36 0.00
## Gender* -1.36 0.00
## Ethnicity* -0.18 0.01
## CurrentAge 0.74 0.03
## CurrentLearnerSegment* 4.04 0.00
## CohortTerm* NaN 0.00
## CohortTermAxis NaN 0.00
## Term* -0.97 0.01
## TermAxis -1.17 0.01
## Course* -1.04 0.67
## CourseSection* -1.19 24.68
## HasMaintainedEnrollment NaN 0.00
## Success -0.46 0.00
## Grade* -0.26 0.02
## CourseAverageGrade 1.22 0.00
## HasPersistenceIntoNextTerm 0.60 0.00
## [1] 0.8442784
To test my correlation theory among variables I will create a simple contingency table, and run a chi-square test.
Great, the chi-square test tells us we have independence between these variables, we have variation in our data greater than the expected observations.
## Success 0 1
## CurrentLearnerSegment HasPersistenceIntoNextTerm
## Business to Consumer (Retail) 0 7100 4145
## 1 7565 42208
## Business to Consumer (Retail) Course Work Only 0 63 235
## 1 5 110
## Corporate Partner 0 252 362
## 1 320 3271
## Guild 0 707 260
## 1 643 3260
## Latin America 0 42 15
## 1 31 205
##
## Pearson's Chi-squared test
##
## data: multi_table2
## X-squared = 13764, df = 9, p-value < 2.2e-16
Continuing with our analysis, let’s drill down on correlations.
Ok, my focus here is on the “HasPersitenceIntoNextTerm” variable. Ideally, a high volume of students continuing through the program term after term will be a litmus test for the overall success of the system.
In general, I see what I would expect here with positive grades, success, etc correlating with a positive persistence score.
My attention however is drawn to the negative correlation in the “Other” segment of “CurrentLearnerSegment” has on our response variable.
While it is not a large negative variance, it does stand out from the other segments, and will be worth exploring.
Next, we will run a simple correspondence analysis to further drill down any patterns or relationships in our variables.
Interesting. “Business to Consumer(Retail) Course Work Only” and “Latin America” show a separation from the rest of our data. “Persistence_No” and “Unknown” trend towards these variables as well.
## **Results of the Correspondence Analysis (CA)**
## The row variable has 5 categories; the column variable has 11 categories
## The chi square of independence between the two variables is equal to 1684.285 (p-value = 0 ).
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$col" "results for the columns"
## 3 "$col$coord" "coord. for the columns"
## 4 "$col$cos2" "cos2 for the columns"
## 5 "$col$contrib" "contributions of the columns"
## 6 "$row" "results for the rows"
## 7 "$row$coord" "coord. for the rows"
## 8 "$row$cos2" "cos2 for the rows"
## 9 "$row$contrib" "contributions of the rows"
## 10 "$call" "summary called parameters"
## 11 "$call$marge.col" "weights of the columns"
## 12 "$call$marge.row" "weights of the rows"
We can also see that “Business to Consumer(Retail) Course Work Only” is having the most variance impact in Dimension 1.
## Dim 1 Dim 2 Dim 3
## Business to Consumer(Retail) 2.310292 7.072583 4.490019
## Business to Consumer(Retail) Course Work Only 80.240882 12.105978 1.922970
## Corporate Partner 2.025607 12.694836 35.502659
## Guild 9.356110 33.458795 10.373110
## Latin America 6.067109 34.667808 47.711242
## Dim 4
## Business to Consumer(Retail) 0.1358307
## Business to Consumer(Retail) Course Work Only 4.9630833
## Corporate Partner 43.9516430
## Guild 39.8136816
## Latin America 11.1357614
Finally, we will run an asymmetric biplot to quantify the relational variance these variables have.
And there we have it, the theory we began to develop related to the “Other” consumer retail segment is also showing up in our biplot indicating “Business to Consumer(Retail) Course Work Only”, and “Persistence_No” have a close information impact on eachother.
Our objective was to determine if the data would show us areas where we can potentially maximize our efforts related to student success and maintaining enrollment.
We discovered that there was a negative correlation with a segment of the “CurrentLearnerSegment.”
Using correspondence analysis, we were able to show a justification for continuing analysis into a subset of the “CurrentLearnerSegment” - “Business to Consumer(Retail) Course Work Only”, and “Latin America.”
Exploring this business channel segment would be a beginning objective for me to determine if this segment can be improved.