How someone’s subjective ideas about his or her health connected with the frequency of the excersises during a week? The answer will be found with the help of correspondent analysis.

Libraries:

library(haven)
library(stringr)
library(FactoMineR)
library(factoextra)
library(ade4)
library(ca)
library(gplots)
library(gridExtra)
library(corrplot)
library(pander)
library(ggsci)

Data Preparation:

data <- read_spss("ESS6RU.sav")
data2 <- data[,c("physact", "health")]
#Variable 1: Physically active for 20 minutes or longer last 7 days/ROWS
data2$physact<-as.factor(data2$physact)
levels(data2$physact)<-c("No days", "One", "Two", "Three", "Four", "Five", "Six","Seven")
pander(table(data2$physact))
No days One Two Three Four Five Six Seven
351 133 193 230 187 268 128 827
#Variable 2: Subjective general health/COLUMNS
data2$health<-as.factor(data2$health)
levels(data2$health)<-c("Very good", "Good", "Fair", "Bad", "Very bad")
#Categories "Bad" and "Very bad" are combined into one "Bad", because the number of #observations at the intersection of "Very bad" (health) and different levels of #physical activity is too small
data2$health <- str_replace(data2$health, "Very bad", "Bad")
data2$health<-as.factor(data2$health)

pander(table(data2$health))
Bad Fair Good Very good
365 1225 748 128
ctable<-table(data2)
#now the distribution of row/column values are quite reliable for the analysis
pander(ctable)
  Bad Fair Good Very good
No days 93 156 81 19
One 17 65 41 10
Two 29 87 68 9
Three 34 99 81 15
Four 22 92 57 15
Five 27 159 65 17
Six 9 72 41 6
Seven 102 421 267 32

Are the variables related?

chisqresults <- chisq.test(ctable)
pander(chisqresults) #definitely related (significantly, according to p_value) related
Pearson’s Chi-squared test: ctable Let’s look at the relations with the help of baloonplot:
Test statistic df P value
81.31 21 4.864e-09 * * *
dt <- as.table(as.matrix(ctable))
balloonplot(t(dt), main = "Sport and Subjective Health", 
            xlab = "Health evaluation", ylab = "Physical Activity",
            dotcolor="firebrick3", label = F)

The diagram shows that fair health is the most popular answer, regardless of the amount of sport per week. Bad health and regular exercises (7 times) are frequently associated, maybe it is those who needs obligatory physical activity because of some diseases. The connection between absence of sport and bad health is quite obvious. There is also the strong relation between the subjective idea that someone’s health is good and everyday workout.

Let’s look at the relations with the help of mosaicplot:

mosaicplot(dt, shade = TRUE, las=2, main = "Sport and Subjective Health", 
           ylab = "Health evaluation", xlab = "Physical Activity") 

According to the graph, perception of the health as bad is much more, than it would be expected. The situation with having a sport 5 times a week and fair health is the same, probably it is about people who is ill or some perfectionists. Association of a good health and no sport at all is extremely lower, than expected, as well as for the six trainings a week and bad feelengs about the health.

Performing the corresponding analysis:

ca <- CA(ctable, graph = F)
summary(ca)
## 
## Call:
## CA(X = ctable, graph = F) 
## 
## The chi square of independence between the two variables is equal to 81.31412 (p-value =  4.863833e-09 ).
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3
## Variance               0.025   0.007   0.004
## % of var.             70.026  19.571  10.403
## Cumulative % of var.  70.026  89.597 100.000
## 
## Rows
##             Iner*1000    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## No days   |    19.018 |  0.351 75.614  0.981 | -0.046  4.617  0.017 |
## One       |     0.642 | -0.035  0.280  0.108 | -0.008  0.058  0.006 |
## Two       |     1.126 |  0.010  0.037  0.008 |  0.115 16.176  0.991 |
## Three     |     1.983 |  0.016  0.101  0.013 |  0.114 18.798  0.654 |
## Four      |     1.514 | -0.059  1.119  0.182 | -0.022  0.571  0.026 |
## Five      |     5.264 | -0.114  6.138  0.288 | -0.179 54.246  0.711 |
## Six       |     2.649 | -0.214 10.321  0.961 | -0.031  0.796  0.021 |
## Seven     |     3.036 | -0.067  6.390  0.519 |  0.030  4.737  0.108 |
##            Dim.3    ctr   cos2  
## No days   -0.017  1.236  0.002 |
## One        0.099 15.521  0.886 |
## Two       -0.004  0.039  0.001 |
## Three      0.082 18.050  0.334 |
## Four       0.122 32.691  0.792 |
## Five       0.009  0.247  0.002 |
## Six       -0.029  1.299  0.018 |
## Seven     -0.056 30.918  0.373 |
## 
## Columns
##             Iner*1000    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## Bad       |    20.803 |  0.379 84.064  0.997 |  0.011  0.247  0.001 |
## Fair      |     4.752 | -0.066  8.856  0.460 | -0.067 32.644  0.474 |
## Good      |     6.031 | -0.075  6.976  0.285 |  0.119 62.327  0.713 |
## Very good |     3.646 |  0.022  0.104  0.007 | -0.079  4.783  0.090 |
##            Dim.3    ctr   cos2  
## Bad       -0.018  1.261  0.002 |
## Fair      -0.025  8.631  0.067 |
## Good       0.006  0.324  0.002 |
## Very good  0.248 89.784  0.903 |

Look at the dimentions obtained in details:

#eigenvalues/variances
eigenvalues <- get_eigenvalue(ca)
pander(round(eigenvalues, 2))
  eigenvalue variance.percent cumulative.variance.percent
Dim.1 0.02 70.03 70.03
Dim.2 0.01 19.57 89.6
Dim.3 0 10.4 100
#significance
trace <- sum(eigenvalues$eigenvalue) 
cor.coef <- sqrt(trace)
pander(cor.coef)

0.1877

#the value does not cross the threshold of 0.2, but it is very close. 
#Real data, let it be OK for the analysis here

Considering eigenvalues, corresponding to the amount of information retained by each axis, dimensions 1 and 2 explain approximately 70.03% and 19.57% of the total inertia respectively. Therefore, about 90.6% of the variation is explained by the first two dimensions, which is an acceptably large percentage. It means that two dimentions is enough for the further analysis and interpretation.

Alternatively, Scree plot to consider CA dimentions visually:

fviz_screeplot(ca) +
  geom_hline(yintercept=33.33, linetype=2, color="red", size = 2)+
  geom_hline(yintercept=14.28, linetype=2, color="green", size = 2)+
  ggtitle("Scree plot")+
  theme(plot.title = element_text(size = 20))

Our data contains 8 rows and 4 columns. If the data were random, the expected value of the eigenvalue for each axis would be 1/7 = 14.28% (green) in terms of rows. At the same time, the average axis should account for 1/3 = 33.33% (red) in terms of the 4 columns.“Any axis with a contribution larger than the maximum of these two percentages should be considered as important and included in the solution for the interpretation of the data,”- http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/113-ca-correspondence-analysis-in-r-essentials/. So the Dimentions 1 and 2 was the reasonable choice.

CA Perception map (Symmetric biplot):

fviz_ca_biplot(ca, title = "Symmetric Plot", repel = T)+
theme(plot.title = element_text(size = 20))  

Rows are represented by blue points and columns by red triangles. According to the graph, only Dim1, the horizontal axis could be interpreted clearly. So, there is the variety of workout frequencies on the left side of the graph and the absence of any physical activity on the right. Probably, the horizontal axis dispays the level of respondent health objectively. Maybe the vertical dimention could be interpreted as common VS specific (the top and the bottom respectively) patterns of physical activity: Two and Three times a week trainings are located close to each other, together with the Good health evaluation, while Seven, One, Four someway related to Fair and Very good - such practices looks specific.

Assymetric biplot

fviz_ca_biplot(ca, arrows = c(F,T), repel = T, title = "Asymmetric Plot")+
theme(plot.title = element_text(size = 20))

Here the contribution of rows to the axes will be interpreted, because there are 4 rows for analysis and 4 “directons” on the coordinate plane. Firstly, “Bad health” arrow position indicates great contribution to the positive pole of horizontal dimention. Dimension 2 is mainly defined by the column category “Very Good”. Fair health category contributes to the two axes to the same extent.

Contributions of rows to dimensions:

row <- get_ca_row(ca)
pander(row$contrib)#row variables with larger value contribute most to dimensions.
  Dim 1 Dim 2 Dim 3
No days 75.61 4.617 1.236
One 0.2804 0.05774 15.52
Two 0.03666 16.18 0.03927
Three 0.1009 18.8 18.05
Four 1.119 0.5714 32.69
Five 6.138 54.25 0.2466
Six 10.32 0.7957 1.299
Seven 6.39 4.737 30.92

Looking at the data, it is easy to conclude that the Dimention 1 is basically defined by the row “No day”. “Six” day level provides some contribution to the first dimention, but it is much less. Such levels as “Five” (mostly), “Three” and “Two” days of training a week importantly contribute to the Dimention 2. “Four” or “Seven” workouts per week define the variance, related to the Dimention 3.

How much each of rows put into each of dimentions, visually:

corrplot(row$contrib, is.corr=FALSE, tl.col = "black", cl.cex = 1,cl.ratio = 0.5, cl.pos = "r")

Another way to evaluate row contributions: Dim1, Dim2, both Dims and top rows for both Dims

a <- fviz_contrib(ca, choice = "row", axes = 1)
b <- fviz_contrib(ca, choice = "row", axes = 2) 
c <- fviz_contrib(ca, choice = "row", axes = 1:2)
d <- fviz_contrib(ca, choice = "row", axes = 1:2, top = 4)
grid.arrange(a, b, c, d, ncol = 2)

Next, what about these pictures. The red dashed line on the graph above indicates the expected average contribution of the rows, 12.5 percent. The fist one, top left, illustrate the colossal contribution of “No days” to the variance, explained by the Dimention 2; contribution of the rest rows is lower than averagely expected. Top right diagram shows that the variance, corresponding to the Dimention 2, mostly defined by rows “Five”, “Three” and “Two” days. Bottom left graph illustrates the most important rows for the both dimentions, displayed on the perception map, they are “No days” and “Five”, the same as for the first graph. Bottom right picture tells about top-4 row contributions to both dimensions.

Row contributions on the map

fviz_ca_row(ca, col.row = "contrib", repel = T)+
  scale_color_gradient2(low="pink", mid="hotpink", 
                        high="hotpink3", midpoint=20) + theme_minimal()

Contributions of columns to dimensions:

col <- get_ca_col(ca)
pander(col$contrib)#row variables with larger value contribute most to dimensions.
  Dim 1 Dim 2 Dim 3
Bad 84.06 0.2469 1.261
Fair 8.856 32.64 8.631
Good 6.976 62.33 0.3242
Very good 0.1042 4.783 89.78

Let us describe the contribution of columns to the dimentions or to the variability of a data set. The contribution of the column “Bad” incomparably greater than for the rest rows. Dimension 2 is mostly contributed by the columns “Fair” and “Good”. The third dimention is defined by the column “Very good”.

How much each of columns put into each of dimentions:

corrplot(col$contrib, is.corr=FALSE, tl.col = "black", cl.cex = 1, cl.ratio = 0.5, cl.pos = "r")

Another way to evaluate column contributions: Dim1, Dim2, both Dims and top columns for both Dims

e <- fviz_contrib(ca, choice = "col", axes = 1)
f <- fviz_contrib(ca, choice = "col", axes = 2) 
g <- fviz_contrib(ca, choice = "col", axes = 1:2) 
h <- fviz_contrib(ca, choice = "col", axes = 1:2, top = 3)
grid.arrange(e, f, g, h, ncol = 2)

Plain description of these grid. This reference line corresponds to the expected value if the contribution where uniform (25 percent of contribution from each column). Top left barplot displays the important meaning of “Bad” column in matter of the explaining Dimention 1 variance. “Good” and “Fair” significant for the Dimention 2, top right. Contribution of “Bad” column is the most important for both dimentions, next we have “Good” and “Fair”, however, they contribute less than averagely expected, bottom left picture. Bottom right shows the same as third picture does, just without “Very good” level, because its contribution seems meaningless.

Column contributions on the map

fviz_ca_col(ca, col.col ="contrib", repel = T) +
  scale_color_gradient2(low="pink", mid="hotpink",high="hotpink3", midpoint=15) + theme_minimal()