Dimension Reduction for nominal data and qualitative data

In this article we will look closer at the problem of analyzing survey’s data using dimension reduction methods. Based on the type of the input data we may distinguish several methods.

PCA - suitable for continous, quantitative variables
CA - suitable for discrete, qualitative variables, but only for 2 variables to contingency table
MCA - extension of CA, suitable for larger datasets of discrete variables

We will analyze the dataset describing primary school students downloaded from Kaggle (link). For the sake of simplicity, surveys tend to ask question for which responses will be saved as discrete variables. That’s why it is important to know how to deal with such kind of data.

Application of MCA

The packages ‘factorextra’ and ‘FactorMiner’ will be necessary for our analysis. First we analyze the data in the basic format with each individual as a row and variables as columns.

library(factoextra)
library(FactoMineR)
library(dplyr)

First we need to import the data and select only qualitative variables. Therefore we don’t analyze the results of exams, but only the diffrences between the students. For example a relation between the gender and test preparation.
Data summary tells us whether some levels of variables are not too rare, which may obstruct the analysis and should be treated as outliers and removed or tranformed. In this case everything is fine.

dt<- read.csv('StudentsPerformance.csv')

dt.active<- dt[,1:5]

head(dt.active, 4)
##   gender race.ethnicity parental.level.of.education        lunch
## 1 female        group B           bachelor's degree     standard
## 2 female        group C                some college     standard
## 3 female        group B             master's degree     standard
## 4   male        group A          associate's degree free/reduced
##   test.preparation.course
## 1                    none
## 2               completed
## 3                    none
## 4                    none
summary(dt.active)
##     gender    race.ethnicity     parental.level.of.education
##  female:518   group A: 89    associate's degree:222         
##  male  :482   group B:190    bachelor's degree :118         
##               group C:319    high school       :196         
##               group D:262    master's degree   : 59         
##               group E:140    some college      :226         
##                              some high school  :179         
##           lunch     test.preparation.course
##  free/reduced:355   completed:358          
##  standard    :645   none     :642          
##                                            
##                                            
##                                            
## 

After preparation of data we use function MCA() to create a object of class MCA which will be used for further analysis.
The first graph shows us the percentage of variance which can be explained by each dimension.

dt.mca<-MCA(dt.active, graph = FALSE)

#optimal number of dimensions
fviz_screeplot(dt.mca, addlabels = TRUE)

The chart below shows us which variables contribute the most to the visualization in two dimension plot.

#Contribution to dim 1-2
fviz_contrib(dt.mca, choice = "var", axes = 1:2, top = 15)

The plot below shows the relationship between the points.
-Similar rows are grouped together
-Negatively correlated rows are plotted on the oposite sides
-The distance from the orgin represents the quality of row points on the factor map.

Since two dimension is not enough to perfectly retain the variation between variables, some of them are better represented and others are not. The quality of represtation is measured by squared cosine (cos2) which measures the contribution of variables to the two dimensional plot.

The highest cos2, the better. Variables with low cos2 should be treated and interpreted with caution.

#Variable Cathegories - MCA
fviz_mca_var(dt.mca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE, # avoid text overlapping (slow)
             ggtheme = theme_minimal()
)

By now we analyzed only the variables. The function get_mca_ind gives us the MCA object from which we can get the information like cos2, cooridinates and contributions. The graph below has exactly the same interpretation as the one above, but this time it represents the individuals. Number of individuals in data set is equal to 1000, but many of them have the same characteristics that’s why we have less point on the graph.

#get MCA of individuals
ind <- get_mca_ind(dt.mca)

#head(ind$coord)
#head(ind$cos2)
#head(ind$contrib)

fviz_mca_ind(dt.mca, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             ggtheme = theme_minimal(),
             label = 'none')

Aplication of CA

The data from our survey can be also analzyed in a different way. CA method is suitable for a contigency table, thats why we first need to transform the data.

library(factoextra)
library(FactoMineR)
library(dplyr)
library(tidyr)
data<-dt

quantile((data$math.score + data$reading.score + data$writing.score)/3, probs = seq(0, 1, 0.33))
##       0%      33%      66%      99% 
##  9.00000 61.89000 74.33333 97.66667
data <- mutate(data, mean_grade= (math.score + reading.score + writing.score)/3)

f_grade <- function(x){
  if(x<60) return(1)
  if(x>=60 & x<75) return(2)
  if(x>=75) return(3)
}

data$grade_level<-apply(as.matrix(data$mean_grade),1,f_grade)


data %>% select(grade_level, gender)%>% group_by(grade_level) %>% count(gender) %>% spread(gender,n) ->a
data %>% select(grade_level, lunch)%>% group_by(grade_level) %>% count(lunch) %>% spread(lunch,n) ->b
data %>% select(grade_level, parental.level.of.education)%>% group_by(grade_level) %>% count(parental.level.of.education) %>% spread(parental.level.of.education,n) ->c
data %>% select(grade_level, test.preparation.course)%>% group_by(grade_level) %>% count(test.preparation.course) %>% spread(test.preparation.course,n) ->d


data_pv<-bind_cols(a[,1:3],b[2:3], c[,2:3], d[,2:3])
data_pv<-t(data_pv)
colnames(data_pv)<-data_pv[1,]
data_pv<-data_pv[-1,]

After the data transformation we should obtained the following matrix (contingency table).

data_pv
##                      1   2   3
## female             124 204 190
## male               161 187 134
## free/reduced       155 125  75
## standard           130 266 249
## associate's degree  59  78  85
## bachelor's degree   21  50  47
## completed           60 137 161
## none               225 254 163

Once we have prepared the data we can proceed with the CA methods and use CA() function which return us the CA class object for further interpretation.
First graph shows us the variance expained by the given dimension.

res.ca <- CA(data_pv, graph = FALSE)

#Optimal number of dimensions
fviz_screeplot(res.ca, addlabels = TRUE)

This graph below shows us the relationship between the variables. Graph has the same interpretation as in the MCA part.

fviz_ca_row(res.ca, col.row = "contrib",
            gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
            repel = TRUE)

The function fviz_ca_biplot returns us the assymetric biplot allowing to interprete the contribution of rows or columns to the axes.
The closer the arrow is to the axis the more it contributes to the realtive dimension.
It shows that female level of variable Gender contributes the most to the second dimension and free/reduced(students with free lunch) variable contribute equally to the both dimensions.

fviz_ca_biplot(res.ca, map ="colgreen", arrow = c(TRUE, FALSE),
               repel = TRUE)