Data analysis project 2

Theoretical framework

The Covid-19 pandemic affected many countries and governments and Czechia is not an exception. As evident in Andoh (2020) and Havlík (2022) the pandemic affected Czech labor market and political sphere in a number of ways we would like to explore in our project. For example, an interesting contradiction in data is that Czechia had a very low rate of people losing their jobs because of Covid-19, but then the Covid policies in particular were the ones that led to ANO 2011 populist party’s downfall in 2021 elections. I would like to explore this topic more thouroughly.

Research question

How the covid pandemic affected politiсal and labour market situation in Czech Republic?

Libraries

library(foreign)
library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)
library(RColorBrewer)
library(knitr)
library(sjPlot)
library(rcompanion)
library(sjstats)
library(pwr)
library(rstatix)
library(psych)
library(car)
library(DescTools)
library(viridis)

Cleaning the dataset

I cleaned the dataset and made separate datasets for different tests (czc for chi-square, czt for t-test and cza for ANOVA)

setwd("C:/Users/user/Downloads")
czechia<-read.spss("ESS10-subset.sav", use.value.labels = T, to.data.frame = T)
czc <- czechia %>%
  select(respc19, hapljc19)

czc <- drop_na(czc)
czc$respc19 <- as.character(czc$respc19)
czc$respc19[czc$respc19 == "Yes, I tested positive for COVID-19"] <- "Yes"
czc$respc19[czc$respc19 == "Yes, I think I had COVID-19 but was not tested/did not test positive "] <- "Yes"
czc$respc19[czc$respc19 == "No, I have not had COVID-19"] <- "No"

czc$respc19 <- as.factor(czc$respc19)


czc <- drop_na(czc)
czc <- droplevels(czc)

cza <- czechia%>%
  select(prtclecz, agea)

cza <- drop_na(cza)

cza$prtclecz <- cza$prtclecz %>% str_replace_all("ODS", "SPOLU")
cza$prtclecz <- cza$prtclecz %>% str_replace_all("TOP 09", "SPOLU")
cza$prtclecz <- cza$prtclecz %>% str_replace_all("KDU-ČSL", "SPOLU")
cza$prtclecz <- cza$prtclecz %>% str_replace_all("Svoboda a přímá demokracie", "SPD")
cza$prtclecz <- cza$prtclecz %>% str_replace_all("ČSSD", "Left")
cza$prtclecz <- cza$prtclecz %>% str_replace_all("KSČM", "Left")
cza$prtclecz <- cza$prtclecz %>% str_replace_all("Starostové a nezávislí", "PirSTAN")
cza$prtclecz <- cza$prtclecz %>% str_replace_all("Česká pirátská strana", "PirSTAN")

cza$prtclecz <- as.factor(cza$prtclecz)
cza$agea <- as.numeric(cza$agea)

cza <- filter(cza, agea > 17)
cza[cza=="Other"]<-NA
cza <- drop_na(cza)
cza <- droplevels(cza)

czt <- czechia%>%
  select(gvhanc19, respc19)
czt <- drop_na(czt)

czt$gvhanc19 <- czt$gvhanc19 %>% str_replace_all("Extremely dissatisfied", "0")
czt$gvhanc19 <- czt$gvhanc19 %>% str_replace_all("Extremely satisfied", "10")
czt$gvhanc19<-as.factor(czt$gvhanc19)
czt$gvhanc19<-as.numeric(czt$gvhanc19)


czt$respc19 <- as.character(czt$respc19)
czt$respc19[czt$respc19 == "Yes, I tested positive for COVID-19"] <- "Yes"
czt$respc19[czt$respc19 == "Yes, I think I had COVID-19 but was not tested/did not test positive "] <- "Yes"
czt$respc19[czt$respc19 == "No, I have not had COVID-19"] <- "No"
czt$respc19<-as.factor(czt$respc19)

Analysis

Chi-squared Test

Variables

For my chi-square test I decided to choose two variables respc19 (Whether the respondent was ill with COVID-19, categorical nominal, two levels) and hapljc19 (Things happened since start of COVID-19: was made redundant/lost job, categorical nominal, two levels). I interested in these variables, because COVID-19 significantly change the labour market in Chezh Republic. After the COVID-19 total employment decreased by 87,5 thousand year-on-tear (Hedvicakova & Kozubikova, 2021). The categories of variables are mutually exclusive, the respondents can be assigned to only one of the groups, and the studied groups are independent (trusting the data collectors)

I created a plot that shows distribution between variables.

table(czc$respc19, czc$hapljc19)

##      
##       Not marked Marked
##   No        1629     39
##   Yes        717     37

plot_xtab(czc$respc19, czc$hapljc19,
          margin = "row", 
          bar.pos = "stack", 
          show.summary = T)

We can see a great difference between them, so I can conduct a chi-square test. We can see that majority of respondents didn’t lose their job, however, the percentage of those who lost the job is bigger for people that had covid.

Chi-squared test

H0 - There is no relation between the loosing job and COVID-19 illness H1 - There is a relation between the losing job and COVID-19 illness

chisq.test(czc$respc19, czc$hapljc19)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  czc$respc19 and czc$hapljc19
## X-squared = 10.446, df = 1, p-value = 0.001229

After conducting a chi-square test, we see that p-value is less than 0,05. It means that I can reject the H0 and say that there is a great difference between variables. The presence of COVID-19 in a respondent affects his work situation. So, I have to check residuals.

Residuals

chisq.test(czc$respc19, czc$hapljc19)$expected

##            czc$hapljc19
## czc$respc19 Not marked   Marked
##         No   1615.6598 52.34021
##         Yes   730.3402 23.65979

chisq.test(czc$respc19, czc$hapljc19)$stdres

##            czc$hapljc19
## czc$respc19 Not marked    Marked
##         No    3.357914 -3.357914
##         Yes  -3.357914  3.357914

Looking at the expected values, I can confirm the assumption of 5 observations minimum being in each cell is correct. This means that the test conducted was relevant.

Then, I should inspect the standardized residuals. We can see that every category made a significant contribution to the results, however:

-we expected to see less people who didn’t have covid and didn’t lose their job; -we expected to see less people who had covid and lost their job; -we expected to see more people who didn’t have covid and lost their job; -we expected to see more people who had covid and didn’t lose their job.

All-in-all, this confirms the results that the test conducted gave us.

ANOVA

Theoretical framework

Continuing the research on people’s political preferences that was started in the last project, I found an interesting work linking age and political preferences/voting behaviour. Holland (2013) talks about three main theories of the relationship between age and political preferences. Firstly, the idea of the preference of candidates is close to their age, as well as the hypothesis that young people are more liberal than people of the older generation, which will influence their preferences. And finally, the idea that older people prefer moderate/traditional candidates, and people of the younger generation, on the contrary, those who oppose the traditional party system. The study was conducted in an American context and it will be interesting for us to look at the situation in the Czech Republic, focusing on what kind of party people feel closer to and their age.

RH: The age mean will be higher in leftist and populist parties and lower in centrist and liberal

Variables

For analysis I chose age variable (continuous) and prtclecz variable that defines to which party respondents feel closer to (categorical). Above you can see that I have united some parties based on the existing coalitions of the Czech Republic that were mentioned in one of the articles (Havlík & Kluknavská, 2022).

describeBy(cza$agea, cza$prtclecz, mat = TRUE) %>% 
  select(prtclecz = group1, N=n, Mean=mean, SD=sd, Median=median, Min=min, Max=max, 
                Skew=skew, Kurtosis=kurtosis, st.error = se)

##     prtclecz   N     Mean       SD Median Min Max        Skew   Kurtosis
## X11 ANO 2011 216 46.33333 12.45830     47  18  72 -0.35826131 -0.6022835
## X12     Left 117 49.09402 12.75362     52  18  71 -0.56949942 -0.6352386
## X13  PirSTAN  79 36.11392 13.87766     32  18  65  0.60890996 -0.8674853
## X14      SPD  79 38.78481 12.05082     39  18  62  0.06188888 -1.1693110
## X15    SPOLU 119 40.81513 13.36512     40  18  72  0.25647278 -0.8841561
##      st.error
## X11 0.8476801
## X12 1.1790728
## X13 1.5613586
## X14 1.3558235
## X15 1.2251789

From the table we can see that groups are of comparable size, skew close to 0 in almost all cases, however, standard deviation is quite big.

Boxplot

ggplot(cza)+
  geom_boxplot(aes(x = prtclecz, y = agea, fill=prtclecz, alpha = 0.5))+
  theme_minimal()+ 
  labs(title = "The age of people based on which party/coalition they feel closer to", x = "Which party/coalition feel closer to", y = "Age") +
  theme(legend.position="none", plot.title = element_text(face = "bold",hjust = 0.5))+
  scale_fill_viridis(discrete = TRUE)

From the plot, we see that the age variable is distributed quite normally across the groups, the medians are close to the centre of boxes, also we do not see outliers. Another thing that we see is that response range for different groups looks quite similar (however, slightly less for PirSTAN and SPD). Also I can say that those who feel closer to ANO and left parties are older than people in other groups

F-test hypotheses

H0: the means of all groups are equal; Ha: at least one of the groups have different mean.

res_aov <- aov(data = cza, agea~prtclecz)
summary(res_aov)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## prtclecz      4  12319  3079.8    18.7 1.68e-14 ***
## Residuals   605  99665   164.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results show that p-value < 0.05, therefore, I can reject the null hypothesis in favor of the hypothesis that there is a difference in means between the groups. The difference in the age across groups of political parties that people feel closer to is statistically significant.

Assumptions

Now I will check the assumptions for ANOVA.

The variances are equal The Levene test shows us a p-value > 0.05, so I can’t reject the null hypothesis of variances being equal.

leveneTest(agea ~ prtclecz, data = cza)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   4   0.773  0.543
##       605

Independence The dataset can guarantee us the randomness and the absence of “before-after” design. There is also no relationship between the groups, because one person could only choose one party that they felt closer to.
Normality of residuals First, I check the skew and kurtosis. They are both less than 2, which is normal.

describe(res_aov$residuals)[11:12]

##     skew kurtosis
## X1 -0.06    -0.73

Next, I perform a Shapiro-Wilk test. However, the p-value < 0.05 and I should reject the null hypothesis of the residuals being normal. However, this test is really sensitive and is used primarily on small samples, so I should visualize my data to make better judgements about its normality.

shapiro.test(res_aov$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  res_aov$residuals
## W = 0.98874, p-value = 0.0001233

Now I visualize data with a qqplot and a histogram.

plotNormalHistogram(res_aov$residuals, prob = FALSE, linecol="purple")

qqPlot(res_aov$residuals,
  id = FALSE
  )

On the histogram we see that the distribution is fairly close to normal. The qqplot also suggests that the data is close to a normal distribution, despite some deviations here and there. Therefore, we can proceed.

Post hoc

Since our first test is parametric and significant and the varicances are equal, I choose the Tukey test to perform post hoc analysis. The post hoc analysis shows us that most of the pairwise comparisons are significant, except for these:

SPD-PirSTAN, SPOLU-PirSTAN, SPOLU-SPD and Left-ANO 2011.

This can be explained by the fact that the first three parties are the parties with similar ideologies (e. g. right-wing) and the last pair with the fact that ANO 2011 still has an overwhelming amount of support in all age groups.

TukeyHSD(res_aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = agea ~ prtclecz, data = cza)
## 
## $prtclecz
##                        diff         lwr       upr     p adj
## Left-ANO 2011      2.760684  -1.2703466  6.791714 0.3326364
## PirSTAN-ANO 2011 -10.219409 -14.8366694 -5.602149 0.0000000
## SPD-ANO 2011      -7.548523 -12.1657833 -2.931263 0.0000899
## SPOLU-ANO 2011    -5.518207  -9.5272050 -1.509210 0.0017011
## PirSTAN-Left     -12.980093 -18.0937939 -7.866392 0.0000000
## SPD-Left         -10.309207 -15.4229078 -5.195506 0.0000005
## SPOLU-Left        -8.278891 -12.8508608 -3.706921 0.0000093
## SPD-PirSTAN        2.670886  -2.9165840  8.258356 0.6864433
## SPOLU-PirSTAN      4.701202  -0.3951489  9.797553 0.0867080
## SPOLU-SPD          2.030316  -3.0660350  7.126667 0.8117327

Tukey <- TukeyHSD(res_aov)
par(mar = c(5, 13, 3, 1))
plot(Tukey, las = 2)

Then, I calculate the effect size for this test by looking at the omega-squared. It is 0.104 which is a moderate effect size. This means that the age only moderately corresponds with the choice of the political party that the respondent feels closer to.

anova_stats(res_aov)

## etasq | partial.etasq | omegasq | partial.omegasq | epsilonsq | cohens.f |      term |     sumsq |  df |   meansq | statistic | p.value | power
## -----------------------------------------------------------------------------------------------------------------------------------------------
## 0.110 |         0.110 |   0.104 |           0.104 |     0.104 |    0.352 |  prtclecz | 12319.152 |   4 | 3079.788 |    18.695 |  < .001 |     1
##       |               |         |                 |           |          | Residuals | 99665.215 | 605 |  164.736 |           |         |

Non-parametric test

Then, I perform a non-parametric test (Kruskal-Wallis) which turns out to be significant and calculate its effect size, which is moderate.

kruskal.test(agea~prtclecz, data = cza)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  agea by prtclecz
## Kruskal-Wallis chi-squared = 66.552, df = 4, p-value = 1.212e-13

kruskal_effsize(cza, agea~prtclecz)

## # A tibble: 1 × 5
##   .y.       n effsize method  magnitude
## * <chr> <int>   <dbl> <chr>   <ord>    
## 1 agea    610   0.103 eta2[H] moderate

Then, I perform post hoc analysis for the non-parametric test. It is the same as in the parametric test. This means that the age only moderately corresponds with the choice of the political party that the respondent feels closer to.

DunnTest(agea ~ prtclecz, data = cza,
         method = "holm")

## 
##  Dunn's test of multiple comparisons using rank sums : holm  
## 
##                  mean.rank.diff    pval    
## Left-ANO 2011          38.62055 0.16856    
## PirSTAN-ANO 2011     -129.03167 2.3e-07 ***
## SPD-ANO 2011          -97.70889 0.00015 ***
## SPOLU-ANO 2011        -71.49753 0.00189 ** 
## PirSTAN-Left         -167.65222 6.4e-10 ***
## SPD-Left             -136.32944 8.6e-07 ***
## SPOLU-Left           -110.11808 1.1e-05 ***
## SPD-PirSTAN            31.32278 0.52770    
## SPOLU-PirSTAN          57.53415 0.09777 .  
## SPOLU-SPD              26.21136 0.52770    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

Thus, I can say that the difference in the age means between groups of political parties that people feel closer to is statistically significant for almost all pairs. For those parties with no statistical significance, I can assume that similar ideologies or a very large audience coverage of different ages play a role. Even though anova is non-directional and I can’t make definitive judgements of data, looking at the presented graphs our hypothesis turned out to be at least somewhat true. Medians of leftist and populist parties turned out to be higher than in other groups.

T-test

Theoretical framework

I am interested in studying the difference in satisfaction with government actions during covid-19, depending on whether the respondent was ill with covid or not, since the health of the population and the understanding that others are healthy is one of the most important factors for satisfaction with government actions (Chen et. al., 2021). I assume that there will be a difference between the groups of those who were ill and those who were not ill, since those who were ill will be included in the context of treatment for coronavirus, acquaintance with other patients - living sick role (Parsons, 1951).

Variables

For the analysis, I chose rasp c 19 as the grouping variable (does the respondent have covid, two levels), and the dependent variable is gvhanc19 (satisfaction with government actions in connection with covid-19), which is quasi-interval Let’s look on variables for analysis with boxplot

ggplot(czt, aes(x = respc19, y = gvhanc19)) + 
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = "point", size = 2, col = "red") +
  theme_classic()

## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Boxplot showed that the variance looks the same and we don’t have outlyers, but there is a difference in averages (red dot)

Normality

Since the data is quasi-interval, it will not be very convenient for me to use a histogram to check the distribution, so I will check the normality of the data via q-q plot

qqnorm(czt$gvhanc19, ylim = c(0, 100)); qqline(czt$gvhanc19, ylim = c(0, 100), col= 2)

The q-q plot shows that the data practically does not deviate from the line of normality, but for verification it is worth looking at skew and kurtosis

describeBy(czt, group = czt$respc19)

## 
##  Descriptive statistics by group 
## group: No
##          vars    n mean   sd median trimmed  mad min max range  skew kurtosis
## gvhanc19    1 1646 6.83 2.59      7    6.99 2.97   1  11    10 -0.48    -0.51
## respc19     2 1646 1.00 0.00      1    1.00 0.00   1   1     0   NaN      NaN
##            se
## gvhanc19 0.06
## respc19  0.00
## ------------------------------------------------------------ 
## group: Yes
##          vars   n mean   sd median trimmed  mad min max range  skew kurtosis
## gvhanc19    1 743 6.67 2.52      7    6.88 2.97   1  11    10 -0.64     -0.2
## respc19     2 743 2.00 0.00      2    2.00 0.00   2   2     0   NaN      NaN
##            se
## gvhanc19 0.09
## respc19  0.00

skew and kurtosis fall into the range from -2 to 2 (even from -1 to 1) from which I can still conclude that the data is normal and conduct a t-test

Variances

Let’s check equality of variances with Levene’s test

H0: variances of satisfaction with the government’s actions during covid-19 among those who were sick with covid-19 and those who were not sick are equal H1: variances of satisfaction with the government’s actions during covid-19 among those who were sick with covid-19 and those who were not sick are not equal

leveneTest(czt$gvhanc19 ~ czt$respc19)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    1  2.3891 0.1223
##       2387

Test show that the data have high probability to occur if null hypothesis is true. Thus, the variances are equal. Let’s doublecheck it with Bartlett’s test

bartlett.test(czt$gvhanc19 ~ czt$respc19)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  czt$gvhanc19 by czt$respc19
## Bartlett's K-squared = 0.78052, df = 1, p-value = 0.377

Test show that the data have high probability to occur if null hypothesis is true. Thus, the variances are equal.

T-test

H0: the mean satisfaction with the government’s actions during covid-19 among those who were sick with covid-19 and those who were not sick is equal

H1: the mean satisfaction with the government’s actions during covid-19 among those who were sick with covid-19 and those who were not sick is not equal

t.test(czt$gvhanc19 ~ czt$respc19, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  czt$gvhanc19 by czt$respc19
## t = 1.4277, df = 2387, p-value = 0.1535
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -0.06063575  0.38528918
## sample estimates:
##  mean in group No mean in group Yes 
##          6.829891          6.667564

P-value > 0,05 thus I cannot reject null hypothesis and means are equal. On average, those who were ill and those who were not ill have the same satisfaction with the actions of the government during covid-19. I have to check it with non-parametric test.

Non-parametric test

The data is not continuous, the samples are independent, so I use the Mann-Whitney test

wilcox.test(czt$gvhanc19 ~ czt$respc19)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  czt$gvhanc19 by czt$respc19
## W = 633591, p-value = 0.1534
## alternative hypothesis: true location shift is not equal to 0

Based on the analysis, I can conclude that in the Czech case, the difference in satisfaction with the actions of politicians during covid-19 among those who were ill and those who were ill with it is not statistically significant

Conclusion

Based on analysis, I can draw the following conclusions:

There is statistically significant relations between status of the covid-19 patient and the working status of Czech citizens
People of different ages choose different parties to which they feel closer to, the older generation prefers leftist and populist parties
The status of a covid-19 patient does not affect satisfaction with the actions of politicians during the pandemic Thus, I found out what factors influencing the political situation in the Czech Republic during the coronavirus pandemic

References

Chen, C. W. S., Lee, S., Dong, M. C., & Taniguchi, M. (2021). What factors drive the satisfaction of citizens with governments’ responses to COVID-19? International Journal of Infectious Diseases, 102, 327–331. doi:10.1016/j.ijid.2020.10.050

Havlík, V., & Kluknavská, A. (2022). The populist vs anti‐populist divide in the time of pandemic: The 2021 Czech national election and its consequences for European politics. JCMS: Journal of Common Market Studies, 60, 76-87.

Hedvičáková, M., & Kozubíková, Z. (2021). Impacts of COVID-19 on the labour market - evidence from the Czech Republic. Hradec Economic Days. https://doi.org/10.36689/uhk/hed/2021-01-023

Holland, J. L. (2013). Age gap? The influence of age on voting behavior and political preferences in the American electorate. Washington State University.

Data analysis project 2

Ekaterina Prohorova

2024-05-23

Theoretical framework

Research question

Libraries

Cleaning the dataset

Analysis

Chi-squared Test

Chi-squared test

Residuals

ANOVA

Theoretical framework

Variables

Boxplot

F-test hypotheses

Assumptions

Post hoc

Non-parametric test

Conclusion

T-test

Theoretical framework

Variables

Normality

Variances

T-test

Non-parametric test

Conclusion

References