Overview

In this demonstration, I will outline some of my thoughts and logic related to my approach to quantifying and verifying data that can be sourced from a university’s learning management system (LM).

Specifically, we are looking at student retention and how we can derive business-related decisions from our data.

My objective is to conduct a correspondence analysis to qualify relationships among categorical variables.

Begin Analysis

In this analysis, we will use a contrived data set that I manufactured. The data set is a contingency table that encapsulates some descriptive data of students. The goal of this CA analysis is to discover if there is statistical significance in any of the variables related to the dependency on student retention.

Put another way, the null hypothesis is that there is not a meaningful relationship between our independent and dependent variables.

Let’s find out.

We will begin by looking at the table to provide context for this report. For the purpose of this demonstration, I will not go into the details of the data wrangling and other EDA techniques used to shape and center data or the R functions and compumtations of the analysis. I will leave a link to my github repository to view the code.

##                                    x retention_stay active retention_not_stay
## 1                           gender_f            398    378                365
## 2                           gender_m            328    350                459
## 3                       age_under_21            340    310                350
## 4                        age_over_21            157    295                299
## 5                     family_closeby            392    390                325
## 6 admission_before_term_over_90_days            480    340                245
## 7        traveltime_miles_greater_20            210    370                317
## 8                      sports_active            399    331                309
## 9            sat_scores_greater_1100            285    325                244
##   undecided
## 1       265
## 2       278
## 3       250
## 4       240
## 5       220
## 6       200
## 7       200
## 8       280
## 9       380

Correspondence Analysis

CA gives us a powerful tool to derive insight into correlation groupings of categorical variables. Maybe even more significant is that CA provides the ability to extract information and compress and plot the coordinates of multiple variables into a two-dimensional plot for analysis.

Our goal here is to formulate some framework for analyzing correlations, if any, among retention_yes and retention_no with the independent variables.

Our first step is to conduct a chi-squared analysis to quantify statistical significance.

Our test results indicate a statistically significant correlation is present; therefore, we can continue with our analysis.

## 
##  Pearson's Chi-squared test
## 
## data:  dt2
## X-squared = 321.95, df = 16, p-value < 2.2e-16

Our next step is to compute the correspondence analysis. Thankfully, R provides a wonderful CA package called “Factoextra” that provides the framework and functions to compute the analysis. However, for context, the math behind CA is quite complex involving projecting multiple dimensions of chi-squared distances onto a 2d plot map. The process is fascinating and defiantly worth reading up on if interested.

My demonstration here will extract some of the computational results of the CA measurements so that we can conduct our analysis.

After running the CA function, we have objects shown below that will be useful in continuing our analysis.

## **Results of the Correspondence Analysis (CA)**
## The row variable has  9  categories; the column variable has 3 categories
## The chi square of independence between the two variables is equal to 321.9533 (p-value =  7.125495e-59 ).
## *The results are available in the following objects:
## 
##    name              description                   
## 1  "$eig"            "eigenvalues"                 
## 2  "$col"            "results for the columns"     
## 3  "$col$coord"      "coord. for the columns"      
## 4  "$col$cos2"       "cos2 for the columns"        
## 5  "$col$contrib"    "contributions of the columns"
## 6  "$row"            "results for the rows"        
## 7  "$row$coord"      "coord. for the rows"         
## 8  "$row$cos2"       "cos2 for the rows"           
## 9  "$row$contrib"    "contributions of the rows"   
## 10 "$call"           "summary called parameters"   
## 11 "$call$marge.col" "weights of the columns"      
## 12 "$call$marge.row" "weights of the rows"

Inertia

We will not go through every one of these results but let’s extract and plot the row contributions so that we can have an understanding of what information is being retained and contributing most to our two dimensions map.

This graph shows us the essence of what CA is dong for us. We will retain the information that provides the greatest variance inertia which will be compressed into two dimensions as we will see in the biplots below.

Biplots

Here is where the real magic begins. Through a process of essentially plotting the chi-square distances, rotating coordinates and removing dimensions three or greater while retaining the information, we now have a visual of our data in a relationship format.

We can now visualize our data in a clustered format revealing relationships in our data. How cool is that!

The symmetric plot shows us the correlation between row to row(blue) and column to column(red). We can infer a variable-to-variable correlation (row to column); however, keep in mind the plot does not yet accurately depict the strength of that row-to-column relationship.

To analyze the strength of the correlation, we use the asymmetric plot. The asymmetric plot uses linear algebra to measure the angles between the lines. Lines traveling in the same direction and less than 90 degrees indicate a strong correlation. The further away the lines are from each other parallel, the weaker the correlation.

Lines that are greater than 90 degrees, and traveling away from each other confirm no correlations.

Lastly, the farther away the data points are from the center of the graph, the stronger the variance impact the information has on the analysis output.

Here we are able to visualize the row variavles that are essentially having the greatest information impact in our data.

Analysis

The most important part of an analysis is to gain meaningful insights that will help produce better results related to our business question.

The answer to that question in this demonstration is, yes! We wanted to discover a profile for which type of student is likely to stay or leave, and our analysis identified a profile.

Students who are most likely to stay have the following characteristics:

  • They completed their admissions =/> 90 days before the term started.

  • They have family that lives close by to the campus.

  • They are involved in a sports program.

Students who are most likely to leave:

  • Males over the age of 21 that have to travel a greater distance than 20 miles to get to campus.