Data Analysis

How someone’s subjective ideas about his or her health connected with the frequency of the excersises during a week? The answer will be found with the help of correspondent analysis.

Libraries:

library(haven)
library(stringr)
library(FactoMineR)
library(factoextra)
library(ade4)
library(ca)
library(gplots)
library(gridExtra)
library(corrplot)
library(pander)
library(ggsci)

Data Preparation:

data <- read_spss("ESS6RU.sav")
data2 <- data[,c("physact", "health")]
#Variable 1: Physically active for 20 minutes or longer last 7 days/ROWS
data2$physact<-as.factor(data2$physact)
levels(data2$physact)<-c("No days", "One", "Two", "Three", "Four", "Five", "Six","Seven")
pander(table(data2$physact))

No days	One	Two	Three	Four	Five	Six	Seven
351	133	193	230	187	268	128	827

#Variable 2: Subjective general health/COLUMNS
data2$health<-as.factor(data2$health)
levels(data2$health)<-c("Very good", "Good", "Fair", "Bad", "Very bad")
#Categories "Bad" and "Very bad" are combined into one "Bad", because the number of #observations at the intersection of "Very bad" (health) and different levels of #physical activity is too small
data2$health <- str_replace(data2$health, "Very bad", "Bad")
data2$health<-as.factor(data2$health)

pander(table(data2$health))

Bad	Fair	Good	Very good
365	1225	748	128

ctable<-table(data2)
#now the distribution of row/column values are quite reliable for the analysis
pander(ctable)

	Bad	Fair	Good	Very good
No days	93	156	81	19
One	17	65	41	10
Two	29	87	68	9
Three	34	99	81	15
Four	22	92	57	15
Five	27	159	65	17
Six	9	72	41	6
Seven	102	421	267	32

Are the variables related?

chisqresults <- chisq.test(ctable)
pander(chisqresults) #definitely related (significantly, according to p_value) related

Pearson’s Chi-squared test: `ctable` Let’s look at the relations with the help of baloonplot:
Test statistic	df	P value
81.31	21	4.864e-09 * * *

dt <- as.table(as.matrix(ctable))
balloonplot(t(dt), main = "Sport and Subjective Health", 
            xlab = "Health evaluation", ylab = "Physical Activity",
            dotcolor="firebrick3", label = F)

The diagram shows that fair health is the most popular answer, regardless of the amount of sport per week. Bad health and regular exercises (7 times) are frequently associated, maybe it is those who needs obligatory physical activity because of some diseases. The connection between absence of sport and bad health is quite obvious. There is also the strong relation between the subjective idea that someone’s health is good and everyday workout.

Let’s look at the relations with the help of mosaicplot:

mosaicplot(dt, shade = TRUE, las=2, main = "Sport and Subjective Health", 
           ylab = "Health evaluation", xlab = "Physical Activity")

According to the graph, perception of the health as bad is much more, than it would be expected. The situation with having a sport 5 times a week and fair health is the same, probably it is about people who is ill or some perfectionists. Association of a good health and no sport at all is extremely lower, than expected, as well as for the six trainings a week and bad feelengs about the health.

Performing the corresponding analysis:

ca <- CA(ctable, graph = F)
summary(ca)

## 
## Call:
## CA(X = ctable, graph = F) 
## 
## The chi square of independence between the two variables is equal to 81.31412 (p-value =  4.863833e-09 ).
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3
## Variance               0.025   0.007   0.004
## % of var.             70.026  19.571  10.403
## Cumulative % of var.  70.026  89.597 100.000
## 
## Rows
##             Iner*1000    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## No days   |    19.018 |  0.351 75.614  0.981 | -0.046  4.617  0.017 |
## One       |     0.642 | -0.035  0.280  0.108 | -0.008  0.058  0.006 |
## Two       |     1.126 |  0.010  0.037  0.008 |  0.115 16.176  0.991 |
## Three     |     1.983 |  0.016  0.101  0.013 |  0.114 18.798  0.654 |
## Four      |     1.514 | -0.059  1.119  0.182 | -0.022  0.571  0.026 |
## Five      |     5.264 | -0.114  6.138  0.288 | -0.179 54.246  0.711 |
## Six       |     2.649 | -0.214 10.321  0.961 | -0.031  0.796  0.021 |
## Seven     |     3.036 | -0.067  6.390  0.519 |  0.030  4.737  0.108 |
##            Dim.3    ctr   cos2  
## No days   -0.017  1.236  0.002 |
## One        0.099 15.521  0.886 |
## Two       -0.004  0.039  0.001 |
## Three      0.082 18.050  0.334 |
## Four       0.122 32.691  0.792 |
## Five       0.009  0.247  0.002 |
## Six       -0.029  1.299  0.018 |
## Seven     -0.056 30.918  0.373 |
## 
## Columns
##             Iner*1000    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## Bad       |    20.803 |  0.379 84.064  0.997 |  0.011  0.247  0.001 |
## Fair      |     4.752 | -0.066  8.856  0.460 | -0.067 32.644  0.474 |
## Good      |     6.031 | -0.075  6.976  0.285 |  0.119 62.327  0.713 |
## Very good |     3.646 |  0.022  0.104  0.007 | -0.079  4.783  0.090 |
##            Dim.3    ctr   cos2  
## Bad       -0.018  1.261  0.002 |
## Fair      -0.025  8.631  0.067 |
## Good       0.006  0.324  0.002 |
## Very good  0.248 89.784  0.903 |

Look at the dimentions obtained in details:

#eigenvalues/variances
eigenvalues <- get_eigenvalue(ca)
pander(round(eigenvalues, 2))

	eigenvalue	variance.percent	cumulative.variance.percent
Dim.1	0.02	70.03	70.03
Dim.2	0.01	19.57	89.6
Dim.3	0	10.4	100

#significance
trace <- sum(eigenvalues$eigenvalue) 
cor.coef <- sqrt(trace)
pander(cor.coef)

0.1877

#the value does not cross the threshold of 0.2, but it is very close. 
#Real data, let it be OK for the analysis here

Considering eigenvalues, corresponding to the amount of information retained by each axis, dimensions 1 and 2 explain approximately 70.03% and 19.57% of the total inertia respectively. Therefore, about 90.6% of the variation is explained by the first two dimensions, which is an acceptably large percentage. It means that two dimentions is enough for the further analysis and interpretation.

Alternatively, Scree plot to consider CA dimentions visually:

fviz_screeplot(ca) +
  geom_hline(yintercept=33.33, linetype=2, color="red", size = 2)+
  geom_hline(yintercept=14.28, linetype=2, color="green", size = 2)+
  ggtitle("Scree plot")+
  theme(plot.title = element_text(size = 20))

Our data contains 8 rows and 4 columns. If the data were random, the expected value of the eigenvalue for each axis would be 1/7 = 14.28% (green) in terms of rows. At the same time, the average axis should account for 1/3 = 33.33% (red) in terms of the 4 columns.“Any axis with a contribution larger than the maximum of these two percentages should be considered as important and included in the solution for the interpretation of the data,”- http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/113-ca-correspondence-analysis-in-r-essentials/. So the Dimentions 1 and 2 was the reasonable choice.

CA Perception map (Symmetric biplot):

fviz_ca_biplot(ca, title = "Symmetric Plot", repel = T)+
theme(plot.title = element_text(size = 20))

Rows are represented by blue points and columns by red triangles. According to the graph, only Dim1, the horizontal axis could be interpreted clearly. So, there is the variety of workout frequencies on the left side of the graph and the absence of any physical activity on the right. Probably, the horizontal axis dispays the level of respondent health objectively. Maybe the vertical dimention could be interpreted as common VS specific (the top and the bottom respectively) patterns of physical activity: Two and Three times a week trainings are located close to each other, together with the Good health evaluation, while Seven, One, Four someway related to Fair and Very good - such practices looks specific.

Assymetric biplot

fviz_ca_biplot(ca, arrows = c(F,T), repel = T, title = "Asymmetric Plot")+
theme(plot.title = element_text(size = 20))

Here the contribution of rows to the axes will be interpreted, because there are 4 rows for analysis and 4 “directons” on the coordinate plane. Firstly, “Bad health” arrow position indicates great contribution to the positive pole of horizontal dimention. Dimension 2 is mainly defined by the column category “Very Good”. Fair health category contributes to the two axes to the same extent.

Contributions of rows to dimensions:

row <- get_ca_row(ca)
pander(row$contrib)#row variables with larger value contribute most to dimensions.

	Dim 1	Dim 2	Dim 3
No days	75.61	4.617	1.236
One	0.2804	0.05774	15.52
Two	0.03666	16.18	0.03927
Three	0.1009	18.8	18.05
Four	1.119	0.5714	32.69
Five	6.138	54.25	0.2466
Six	10.32	0.7957	1.299
Seven	6.39	4.737	30.92

Looking at the data, it is easy to conclude that the Dimention 1 is basically defined by the row “No day”. “Six” day level provides some contribution to the first dimention, but it is much less. Such levels as “Five” (mostly), “Three” and “Two” days of training a week importantly contribute to the Dimention 2. “Four” or “Seven” workouts per week define the variance, related to the Dimention 3.

How much each of rows put into each of dimentions, visually:

corrplot(row$contrib, is.corr=FALSE, tl.col = "black", cl.cex = 1,cl.ratio = 0.5, cl.pos = "r")

Another way to evaluate row contributions: Dim1, Dim2, both Dims and top rows for both Dims

a <- fviz_contrib(ca, choice = "row", axes = 1)
b <- fviz_contrib(ca, choice = "row", axes = 2) 
c <- fviz_contrib(ca, choice = "row", axes = 1:2)
d <- fviz_contrib(ca, choice = "row", axes = 1:2, top = 4)
grid.arrange(a, b, c, d, ncol = 2)

Next, what about these pictures. The red dashed line on the graph above indicates the expected average contribution of the rows, 12.5 percent. The fist one, top left, illustrate the colossal contribution of “No days” to the variance, explained by the Dimention 2; contribution of the rest rows is lower than averagely expected. Top right diagram shows that the variance, corresponding to the Dimention 2, mostly defined by rows “Five”, “Three” and “Two” days. Bottom left graph illustrates the most important rows for the both dimentions, displayed on the perception map, they are “No days” and “Five”, the same as for the first graph. Bottom right picture tells about top-4 row contributions to both dimensions.

Row contributions on the map

fviz_ca_row(ca, col.row = "contrib", repel = T)+
  scale_color_gradient2(low="pink", mid="hotpink", 
                        high="hotpink3", midpoint=20) + theme_minimal()

Contributions of columns to dimensions:

col <- get_ca_col(ca)
pander(col$contrib)#row variables with larger value contribute most to dimensions.

	Dim 1	Dim 2	Dim 3
Bad	84.06	0.2469	1.261
Fair	8.856	32.64	8.631
Good	6.976	62.33	0.3242
Very good	0.1042	4.783	89.78

Let us describe the contribution of columns to the dimentions or to the variability of a data set. The contribution of the column “Bad” incomparably greater than for the rest rows. Dimension 2 is mostly contributed by the columns “Fair” and “Good”. The third dimention is defined by the column “Very good”.

How much each of columns put into each of dimentions:

corrplot(col$contrib, is.corr=FALSE, tl.col = "black", cl.cex = 1, cl.ratio = 0.5, cl.pos = "r")

Another way to evaluate column contributions: Dim1, Dim2, both Dims and top columns for both Dims

e <- fviz_contrib(ca, choice = "col", axes = 1)
f <- fviz_contrib(ca, choice = "col", axes = 2) 
g <- fviz_contrib(ca, choice = "col", axes = 1:2) 
h <- fviz_contrib(ca, choice = "col", axes = 1:2, top = 3)
grid.arrange(e, f, g, h, ncol = 2)

Plain description of these grid. This reference line corresponds to the expected value if the contribution where uniform (25 percent of contribution from each column). Top left barplot displays the important meaning of “Bad” column in matter of the explaining Dimention 1 variance. “Good” and “Fair” significant for the Dimention 2, top right. Contribution of “Bad” column is the most important for both dimentions, next we have “Good” and “Fair”, however, they contribute less than averagely expected, bottom left picture. Bottom right shows the same as third picture does, just without “Very good” level, because its contribution seems meaningless.

Column contributions on the map

fviz_ca_col(ca, col.col ="contrib", repel = T) +
  scale_color_gradient2(low="pink", mid="hotpink",high="hotpink3", midpoint=15) + theme_minimal()

Data Analysis

An Menshikova

9/29/2017