In this demonstration, I will outline some of my thoughts and logic related to my approach to quantifying and verifying data that can be sourced from a university’s learning management system (LM).
Specifically, we are looking at student retention and how we can derive business-related decisions from our data.
My objective is to conduct a correspondence analysis to qualify relationships among categorical variables.
In this analysis, we will use a contrived data set that I manufactured. The data set is a contingency table that encapsulates some descriptive data of students. The goal of this CA analysis is to discover if there is statistical significance in any of the variables related to the dependency on student retention.
Put another way, the null hypothesis is that there is not a meaningful relationship between our independent and dependent variables.
Let’s find out.
We will begin by looking at the table to provide context for this report. For the purpose of this demonstration, I will not go into the details of the data wrangling and other EDA techniques used to shape and center data or the R functions and compumtations of the analysis. I will leave a link to my github repository to view the code.
## x retention_stay active retention_not_stay
## 1 gender_f 398 378 365
## 2 gender_m 328 350 459
## 3 age_under_21 340 310 350
## 4 age_over_21 157 295 299
## 5 family_closeby 392 390 325
## 6 admission_before_term_over_90_days 480 340 245
## 7 traveltime_miles_greater_20 210 370 317
## 8 sports_active 399 331 309
## 9 sat_scores_greater_1100 285 325 244
## undecided
## 1 265
## 2 278
## 3 250
## 4 240
## 5 220
## 6 200
## 7 200
## 8 280
## 9 380
CA gives us a powerful tool to derive insight into correlation groupings of categorical variables. Maybe even more significant is that CA provides the ability to extract information and compress and plot the coordinates of multiple variables into a two-dimensional plot for analysis.
Our goal here is to formulate some framework for analyzing correlations, if any, among retention_yes and retention_no with the independent variables.
Our first step is to conduct a chi-squared analysis to quantify statistical significance.
Our test results indicate a statistically significant correlation is present; therefore, we can continue with our analysis.
##
## Pearson's Chi-squared test
##
## data: dt2
## X-squared = 321.95, df = 16, p-value < 2.2e-16
Our next step is to compute the correspondence analysis. Thankfully, R provides a wonderful CA package called “Factoextra” that provides the framework and functions to compute the analysis. However, for context, the math behind CA is quite complex involving projecting multiple dimensions of chi-squared distances onto a 2d plot map. The process is fascinating and defiantly worth reading up on if interested.
My demonstration here will extract some of the computational results of the CA measurements so that we can conduct our analysis.
After running the CA function, we have objects shown below that will be useful in continuing our analysis.
## **Results of the Correspondence Analysis (CA)**
## The row variable has 9 categories; the column variable has 3 categories
## The chi square of independence between the two variables is equal to 321.9533 (p-value = 7.125495e-59 ).
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$col" "results for the columns"
## 3 "$col$coord" "coord. for the columns"
## 4 "$col$cos2" "cos2 for the columns"
## 5 "$col$contrib" "contributions of the columns"
## 6 "$row" "results for the rows"
## 7 "$row$coord" "coord. for the rows"
## 8 "$row$cos2" "cos2 for the rows"
## 9 "$row$contrib" "contributions of the rows"
## 10 "$call" "summary called parameters"
## 11 "$call$marge.col" "weights of the columns"
## 12 "$call$marge.row" "weights of the rows"
We will not go through every one of these results but let’s extract and plot the row contributions so that we can have an understanding of what information is being retained and contributing most to our two dimensions map.
This graph shows us the essence of what CA is dong for us. We will retain the information that provides the greatest variance inertia which will be compressed into two dimensions as we will see in the biplots below.
Here is where the real magic begins. Through a process of essentially plotting the chi-square distances, rotating coordinates and removing dimensions three or greater while retaining the information, we now have a visual of our data in a relationship format.
We can now visualize our data in a clustered format revealing relationships in our data. How cool is that!
The symmetric plot shows us the correlation between row to row(blue) and column to column(red). We can infer a variable-to-variable correlation (row to column); however, keep in mind the plot does not yet accurately depict the strength of that row-to-column relationship.
To analyze the strength of the correlation, we use the asymmetric plot. The asymmetric plot uses linear algebra to measure the angles between the lines. Lines traveling in the same direction and less than 90 degrees indicate a strong correlation. The further away the lines are from each other parallel, the weaker the correlation.
Lines that are greater than 90 degrees, and traveling away from each other confirm no correlations.
Lastly, the farther away the data points are from the center of the graph, the stronger the variance impact the information has on the analysis output.
Here we are able to visualize the row variavles that are essentially having the greatest information impact in our data.
The most important part of an analysis is to gain meaningful insights that will help produce better results related to our business question.
The answer to that question in this demonstration is, yes! We wanted to discover a profile for which type of student is likely to stay or leave, and our analysis identified a profile.
Students who are most likely to stay have the following characteristics:
They completed their admissions =/> 90 days before the term started.
They have family that lives close by to the campus.
They are involved in a sports program.
Students who are most likely to leave: