BI EDA

OVERVIEW

This stakeholder is in the business of providing higher learning at scale. Although the goal and mission of student success are at the forefront, it is also imperative that the business component is running at peak performance to ensure the ultimate goal can be accomplished.

Objective

The goal of my analysis is to determine if the data shows us areas where we can potentially maximize our efforts related to student success and maintaining enrollment.

Method

The first phase of this analysis will focus on cleaning the data, understanding our data, and quantifying and qualifying the data.

We will use correlation and correspondence analysis to provide insight into inputs we can focus on to ensure we are meeting our objective.

To keep the clarity of this report simple and focused on the business logic side of the equation vs. the underlying programming, I will show output only and will leave a link below to the code for those who may want to review.

https://github.com/JesseNormand/snhued/blob/main/BI_EDA.Rmd

EDA

Let’s begin by looking at our data to determine the quality of the data as well as to begin assessing cleaning and shaping the data if needed.

We will plot out the data for a visual inspection.

With this plot, we get a quick feel for the type of data we have, and the overall completeness. The glaring observation here is that we have some missing values.

##    rows columns discrete_columns continuous_columns all_missing_columns
## 1 78608      19               11                  8                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                 7814         70799            1493552     11893208

Let’s take a closer look by column.

Normally, I would reach out to stakeholders to gather additional information related to missing data; however, since that is not an option for this demonstration, I will remove the NAs. We could look at the option to impute a mean for the “HasPersistance”, but since we are working with a binary, the best option here given we have ample data to work with, is to remove the NAs.

##                 Student_SK                     Region 
##                          0                          0 
##        StateOrProvinceCode                 GenderCode 
##                          0                          0 
##                     Gender                  Ethnicity 
##                          0                          0 
##                 CurrentAge      CurrentLearnerSegment 
##                          0                          0 
##                 CohortTerm             CohortTermAxis 
##                          0                          0 
##                       Term                   TermAxis 
##                          0                          0 
##                     Course              CourseSection 
##                          0                          0 
##    HasMaintainedEnrollment                    Success 
##                          0                          0 
##                      Grade         CourseAverageGrade 
##                          0                          5 
## HasPersistenceIntoNextTerm 
##                       7809

Ok, we are back on track.

Let’s take another look at the data. For this analysis, the data looks adequate to work; however, before moving on, we will trim the mean for “CourseAverageGrade.”I noticed that there is a large frequency spike in scores at 100%. I would want to verify if that information is accurate. For this analysis, we will trim the max and min to balance out the mean and move forward.

##                            vars     n       mean        sd     median
## Student_SK                    1 70799 6402780.91 922282.56 6794028.00
## Region*                       2 70799       7.17      3.81       6.00
## StateOrProvinceCode*          3 70799      34.88     16.61      36.00
## GenderCode*                   4 70799       2.57      1.07       2.00
## Gender*                       5 70799       2.57      1.07       2.00
## Ethnicity*                    6 70799       7.35      2.28       8.00
## CurrentAge                    7 70799      32.15      9.25      30.00
## CurrentLearnerSegment*        8 70799       1.35      0.90       1.00
## CohortTerm*                   9 70799       1.00      0.00       1.00
## CohortTermAxis               10 70799      -7.00      0.00      -7.00
## Term*                        11 70799       3.56      1.56       4.00
## TermAxis                     12 70799      -4.94      1.71      -5.00
## Course*                      13 70799     298.33    179.34     294.00
## CourseSection*               14 70799   11391.29   6566.45   11317.00
## HasMaintainedEnrollment      15 70799       1.00      0.00       1.00
## Success                      16 70799       0.76      0.42       1.00
## Grade*                       17 70799       4.49      4.51       2.00
## CourseAverageGrade           18 70799       0.79      0.27       0.91
## HasPersistenceIntoNextTerm   19 70799       0.81      0.39       1.00
##                               trimmed       mad     min     max   range  skew
## Student_SK                 6630153.70 187720.88 1921700 7023853 5102153 -2.48
## Region*                          7.10      4.45       1      13      12  0.12
## StateOrProvinceCode*            34.99     19.27       1      64      63 -0.05
## GenderCode*                      2.59      0.00       1       5       4  0.38
## Gender*                          2.59      0.00       1       6       5  0.38
## Ethnicity*                       7.73      1.48       1       9       8 -1.17
## CurrentAge                      31.14      8.90      17      83      66  0.98
## CurrentLearnerSegment*           1.09      0.00       1       5       4  2.37
## CohortTerm*                      1.00      0.00       1       1       0   NaN
## CohortTermAxis                  -7.00      0.00      -7      -7       0   NaN
## Term*                            3.58      1.48       1       6       5 -0.06
## TermAxis                        -5.05      1.48      -7      -2       5  0.34
## Course*                        295.76    213.49       1     621     620  0.18
## CourseSection*               11378.60   8391.52       1   22387   22386  0.03
## HasMaintainedEnrollment          1.00      0.00       1       1       0   NaN
## Success                          0.83      0.00       0       1       1 -1.24
## Grade*                           3.73      1.48       1      15      14  1.08
## CourseAverageGrade               0.84      0.11       0       1       1 -1.58
## HasPersistenceIntoNextTerm       0.89      0.00       0       1       1 -1.61
##                            kurtosis      se
## Student_SK                     6.07 3466.17
## Region*                       -1.32    0.01
## StateOrProvinceCode*          -1.23    0.06
## GenderCode*                   -1.36    0.00
## Gender*                       -1.36    0.00
## Ethnicity*                    -0.18    0.01
## CurrentAge                     0.74    0.03
## CurrentLearnerSegment*         4.04    0.00
## CohortTerm*                     NaN    0.00
## CohortTermAxis                  NaN    0.00
## Term*                         -0.97    0.01
## TermAxis                      -1.17    0.01
## Course*                       -1.04    0.67
## CourseSection*                -1.19   24.68
## HasMaintainedEnrollment         NaN    0.00
## Success                       -0.46    0.00
## Grade*                        -0.26    0.02
## CourseAverageGrade             1.22    0.00
## HasPersistenceIntoNextTerm     0.60    0.00

## [1] 0.8442784

Contingency Table

To test my correlation theory among variables I will create a simple contingency table, and run a chi-square test.

Great, the chi-square test tells us we have independence between these variables, we have variation in our data greater than the expected observations.

##                                                                           Success     0     1
## CurrentLearnerSegment                          HasPersistenceIntoNextTerm                    
## Business to Consumer (Retail)                  0                                   7100  4145
##                                                1                                   7565 42208
## Business to Consumer (Retail) Course Work Only 0                                     63   235
##                                                1                                      5   110
## Corporate Partner                              0                                    252   362
##                                                1                                    320  3271
## Guild                                          0                                    707   260
##                                                1                                    643  3260
## Latin America                                  0                                     42    15
##                                                1                                     31   205

## 
##  Pearson's Chi-squared test
## 
## data:  multi_table2
## X-squared = 13764, df = 9, p-value < 2.2e-16

Correlation Plot

Continuing with our analysis, let’s drill down on correlations.

Ok, my focus here is on the “HasPersitenceIntoNextTerm” variable. Ideally, a high volume of students continuing through the program term after term will be a litmus test for the overall success of the system.

In general, I see what I would expect here with positive grades, success, etc correlating with a positive persistence score.

My attention however is drawn to the negative correlation in the “Other” segment of “CurrentLearnerSegment” has on our response variable.

While it is not a large negative variance, it does stand out from the other segments, and will be worth exploring.

Correspondence Analysis

Next, we will run a simple correspondence analysis to further drill down any patterns or relationships in our variables.

Interesting. “Business to Consumer(Retail) Course Work Only” and “Latin America” show a separation from the rest of our data. “Persistence_No” and “Unknown” trend towards these variables as well.

## **Results of the Correspondence Analysis (CA)**
## The row variable has  5  categories; the column variable has 11 categories
## The chi square of independence between the two variables is equal to 1684.285 (p-value =  0 ).
## *The results are available in the following objects:
## 
##    name              description                   
## 1  "$eig"            "eigenvalues"                 
## 2  "$col"            "results for the columns"     
## 3  "$col$coord"      "coord. for the columns"      
## 4  "$col$cos2"       "cos2 for the columns"        
## 5  "$col$contrib"    "contributions of the columns"
## 6  "$row"            "results for the rows"        
## 7  "$row$coord"      "coord. for the rows"         
## 8  "$row$cos2"       "cos2 for the rows"           
## 9  "$row$contrib"    "contributions of the rows"   
## 10 "$call"           "summary called parameters"   
## 11 "$call$marge.col" "weights of the columns"      
## 12 "$call$marge.row" "weights of the rows"

We can also see that “Business to Consumer(Retail) Course Work Only” is having the most variance impact in Dimension 1.

##                                                   Dim 1     Dim 2     Dim 3
## Business to Consumer(Retail)                   2.310292  7.072583  4.490019
## Business to Consumer(Retail) Course Work Only 80.240882 12.105978  1.922970
## Corporate Partner                              2.025607 12.694836 35.502659
## Guild                                          9.356110 33.458795 10.373110
## Latin America                                  6.067109 34.667808 47.711242
##                                                    Dim 4
## Business to Consumer(Retail)                   0.1358307
## Business to Consumer(Retail) Course Work Only  4.9630833
## Corporate Partner                             43.9516430
## Guild                                         39.8136816
## Latin America                                 11.1357614

Finally, we will run an asymmetric biplot to quantify the relational variance these variables have.

And there we have it, the theory we began to develop related to the “Other” consumer retail segment is also showing up in our biplot indicating “Business to Consumer(Retail) Course Work Only”, and “Persistence_No” have a close information impact on eachother.

Summary

Our objective was to determine if the data would show us areas where we can potentially maximize our efforts related to student success and maintaining enrollment.

We discovered that there was a negative correlation with a segment of the “CurrentLearnerSegment.”

Using correspondence analysis, we were able to show a justification for continuing analysis into a subset of the “CurrentLearnerSegment” - “Business to Consumer(Retail) Course Work Only”, and “Latin America.”

Exploring this business channel segment would be a beginning objective for me to determine if this segment can be improved.