library(haven)
ESS10 <- read_sav("/Users/DP/OneDrive/Документы/ESS10.sav")
Step 0 - clear the dataset, leave only important things
library(dplyr)
##
## Присоединяю пакет: 'dplyr'
## Следующие объекты скрыты от 'package:stats':
##
## filter, lag
## Следующие объекты скрыты от 'package:base':
##
## intersect, setdiff, setequal, union
greece <- filter(ESS10, ESS10$cntry == "GR")
Let’s describe relationships between the following pairs of variables:
polintr and voteI expect to find out if there is a relationship between the level of political interest (polintr) and whether individuals voted during last national elections (vote).
greece_2 <- greece %>% select(polintr, vote)
greece_2 <- na.omit(greece_2)
greece_2
## # A tibble: 2,752 × 2
## polintr vote
## <dbl+lbl> <dbl+lbl>
## 1 3 [Hardly interested] 1 [Yes]
## 2 3 [Hardly interested] 1 [Yes]
## 3 2 [Quite interested] 1 [Yes]
## 4 3 [Hardly interested] 3 [Not eligible to vote]
## 5 3 [Hardly interested] 1 [Yes]
## 6 4 [Not at all interested] 1 [Yes]
## 7 3 [Hardly interested] 1 [Yes]
## 8 4 [Not at all interested] 1 [Yes]
## 9 3 [Hardly interested] 1 [Yes]
## 10 4 [Not at all interested] 1 [Yes]
## # … with 2,742 more rows
greece_2$polintr <- as.numeric(greece_2$polintr)
greece_2$vote <- as.numeric(greece_2$vote)
greece_2
## # A tibble: 2,752 × 2
## polintr vote
## <dbl> <dbl>
## 1 3 1
## 2 3 1
## 3 2 1
## 4 3 3
## 5 3 1
## 6 4 1
## 7 3 1
## 8 4 1
## 9 3 1
## 10 4 1
## # … with 2,742 more rows
Part I. Chi-squared Test (2 points) 1) The correct choice of variables. A plot with the two variables involved (0.5 points)
library(ggplot2)
# Convert the variables to factors
greece_2$polintr <- factor(greece_2$polintr)
greece_2$vote <- factor(greece_2$vote)
# Create a table with the counts for each combination of polintr and vote
table_data <- table(greece_2$polintr, greece_2$vote)
# Convert the table to a data frame
table_df <- as.data.frame.matrix(table_data)
# Add row names as a variable
table_df$polintr <- rownames(table_df)
# Reshape the data from wide to long format
table_long <- reshape2::melt(table_df, id.vars = "polintr")
# Create the stacked bar chart
ggplot(table_long, aes(x = polintr, y = value, fill = variable)) +
geom_bar(stat = "identity") +
labs(x = "Political Interest", y = "Count", fill = "Vote")
2) The null hypothesis is spelled out, and you make conclusions as
table(flowers) to how the results relate to it (0.5)
Vote Categories 1 Yes 2 No 3 Not eligible to vote
polintr Category 1 Very interested 2 Quite interested 3
Hardly interested 4 Not at all interested
We are interested in examining the association between political interest (polintr) and voting behavior (vote) among Greek citizens. Specifically, we want to test whether there is a significant difference in the distribution of votes among individuals with different levels of political interest. We can formulate the following null and alternative hypotheses:
Null hypothesis (H0): The distribution of votes is the same across all levels of political interest. Alternative hypothesis (HA): The distribution of votes is different across at least one pair of levels of political interest.
# Perform chi-squared test of independence
chisq_res <- chisq.test(table_data)
## Warning in chisq.test(table_data): аппроксимация на основе хи-квадрат может
## быть неправильной
# Print the test result
chisq_res
##
## Pearson's Chi-squared test
##
## data: table_data
## X-squared = 76.519, df = 6, p-value = 1.867e-14
We can reject the null hypothesis (p-value = 1.867e-14 < 0,05) and conclude that there is a significant difference in the distribution of votes across at least one pair of levels of political interest.
library(gmodels)
CrossTable(table_data, expected=T)
## Warning in chisq.test(t, correct = FALSE, ...): аппроксимация на основе
## хи-квадрат может быть неправильной
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 2752
##
##
## |
## | 1 | 2 | 3 | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## 1 | 127 | 5 | 1 | 133 |
## | 112.605 | 17.253 | 3.141 | |
## | 1.840 | 8.702 | 1.460 | |
## | 0.955 | 0.038 | 0.008 | 0.048 |
## | 0.055 | 0.014 | 0.015 | |
## | 0.046 | 0.002 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|
## 2 | 654 | 59 | 4 | 717 |
## | 607.053 | 93.012 | 16.935 | |
## | 3.631 | 12.437 | 9.880 | |
## | 0.912 | 0.082 | 0.006 | 0.261 |
## | 0.281 | 0.165 | 0.062 | |
## | 0.238 | 0.021 | 0.001 | |
## -------------|-----------|-----------|-----------|-----------|
## 3 | 748 | 102 | 23 | 873 |
## | 739.132 | 113.249 | 20.620 | |
## | 0.106 | 1.117 | 0.275 | |
## | 0.857 | 0.117 | 0.026 | 0.317 |
## | 0.321 | 0.286 | 0.354 | |
## | 0.272 | 0.037 | 0.008 | |
## -------------|-----------|-----------|-----------|-----------|
## 4 | 801 | 191 | 37 | 1029 |
## | 871.210 | 133.486 | 24.304 | |
## | 5.658 | 24.781 | 6.632 | |
## | 0.778 | 0.186 | 0.036 | 0.374 |
## | 0.344 | 0.535 | 0.569 | |
## | 0.291 | 0.069 | 0.013 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 2330 | 357 | 65 | 2752 |
## | 0.847 | 0.130 | 0.024 | |
## -------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 76.51923 d.f. = 6 p = 1.867254e-14
##
##
##
The result of the chi-square test is a p-value of 1.867254e-14, which is less than the standard significance level of 0.05. Therefore, we reject the null hypothesis and conclude that there is a significant association between political interest and vote.
In this case, we can see that the cells with the largest standardized residuals are in the fourth second column (vote = 2). Specifically, the observed counts for those cells where political interest = 1 and political interest = 2 we overpredict, while when political interest = 4 we underpredict. Basically, people who are interested in politics do not vote less than we predict, on the other hand people who are not intereste din politics do not vote more often than we predict.
Part II. The t-test (2.5 points) 4) The correct choice of variables, a plot with these variables (0.5)
greece_3 <- greece %>% select(netustm, vote)
greece_3 <- na.omit(greece_3)
greece_3$netustm <- as.numeric(greece_3$netustm)
greece_3$vote <- as.numeric(greece_3$vote)
greece_3
## # A tibble: 1,991 × 2
## netustm vote
## <dbl> <dbl>
## 1 60 1
## 2 240 1
## 3 120 3
## 4 60 1
## 5 120 1
## 6 120 1
## 7 120 1
## 8 480 1
## 9 190 1
## 10 300 2
## # … with 1,981 more rows
netustm - internet use in minutes vote - national elections
# Create a vector for each vote category (1, 2, 3)
vote_1 <- subset(greece_3, vote == 1)$netustm
vote_2 <- subset(greece_3, vote == 2)$netustm
vote_3 <- subset(greece_3, vote == 3)$netustm #We will not use it
ggplot() + labs (title = "Internet use in minutes vs Voting", x="Did people vote on last national elections?", y="") +
geom_boxplot(aes(x='Yes', y=vote_1), fill="tomato1") +
geom_boxplot(aes(x='No', y=vote_2), fill="purple4") +
theme_bw()
5) You have checked the normality assumption for the t-test in 2
different ways (QQ plots / histogram / skew and kurtosis) (0.5)
shapiro.test(vote_1)
##
## Shapiro-Wilk normality test
##
## data: vote_1
## W = 0.83022, p-value < 2.2e-16
shapiro.test(vote_2)
##
## Shapiro-Wilk normality test
##
## data: vote_2
## W = 0.8899, p-value = 7.288e-13
var.test(vote_1, vote_2)
##
## F test to compare two variances
##
## data: vote_1 and vote_2
## F = 1.0064, num df = 1670, denom df = 261, p-value = 0.9645
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.8312979 1.2030520
## sample estimates:
## ratio of variances
## 1.006375
# QQ plot for vote_1
qqnorm(vote_1)
qqline(vote_1)
# Histogram for vote_1
hist(vote_1)
# QQ plot for vote_2
qqnorm(vote_2)
qqline(vote_2)
# Histogram for vote_2
hist(vote_2)
library(moments)
# Calculate skewness and kurtosis for vote_1
skewness(vote_1)
## [1] 1.87265
kurtosis(vote_1)
## [1] 8.054205
# Calculate skewness and kurtosis for vote_2
skewness(vote_2)
## [1] 1.2126
kurtosis(vote_2)
## [1] 4.260232
For vote_1, the skewness value is 1.87265 which is greater than zero, indicating a positively skewed distribution. The kurtosis value of 8.054205 indicates that the distribution is highly leptokurtic, meaning it has a sharp peak and heavy tails. These results suggest that the distribution of vote_1 may not be normal.
For vote_2, the skewness value is 1.2126 which is also greater than zero, indicating a positively skewed distribution. The kurtosis value of 4.260232 indicates that the distribution is moderately leptokurtic, meaning it has a relatively sharp peak and moderately heavy tails. These results also suggest that the distribution of vote_2 may not be normal.
Overall, neither vote_1 nor vote_2 appears to have a normal distribution based on their skewness and kurtosis values.
Null hypothesis: There is no significant difference in the mean time spent on the internet between people who voted and those who didn’t vote in the last national elections.
Alternative hypothesis: There is a significant difference in the mean time spent on the internet between people who voted and those who didn’t vote in the last national elections.
t.test(vote_1, vote_2, var.equal = TRUE)
##
## Two Sample t-test
##
## data: vote_1 and vote_2
## t = -3.2426, df = 1931, p-value = 0.001204
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -47.33311 -11.65582
## sample estimates:
## mean of x mean of y
## 190.8414 220.3359
The t-test results show that there is a statistically significant difference between the mean number of minutes spent on the internet by people who voted in the last national elections (vote_1) and those who did not vote (vote_2). The t-statistic value is -3.2426 with a p-value of 0.001204, which is less than the typical threshold of 0.05, indicating strong evidence against the null hypothesis.
The 95% confidence interval for the difference in means is between -47.33311 and -11.65582, which does not include zero, indicating that the difference between the two means is statistically significant. The negative sign of the confidence interval suggests that people who did not vote tend to spend more minutes on the internet compared to those who did vote.
Therefore, we can reject the null hypothesis and conclude that there is a significant difference in the mean number of minutes spent on the internet between people who voted and those who did not vote.
wilcox.test(vote_1, vote_2, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: vote_1 and vote_2
## W = 183930, p-value = 2.664e-05
## alternative hypothesis: true location shift is not equal to 0
The test resulted in a W value of 183930 and a p-value of 2.664e-05. The p-value is less than the significance level of 0.05, indicating that we can reject the null hypothesis of no difference between the two groups. Therefore, we can conclude that there is a significant difference in the internet use in minutes between people who voted on the last national election and those who did not vote.
We can conclude that people who do not vote spend more time in the internet on average then those who do vote.
Part III. ANOVA (3.5 points) 8) The correct choice of variables, a boxplot with these variables (0.5)
greece_4 <- greece %>% select(polintr, netustm)
greece_4 <- na.omit(greece_4)
greece_4$polintr <- as.factor(greece_4$polintr)
greece_4$netustm <- as.numeric(as.character(greece_4$netustm))
greece_4
## # A tibble: 2,019 × 2
## polintr netustm
## <fct> <dbl>
## 1 3 60
## 2 2 240
## 3 3 120
## 4 3 60
## 5 3 120
## 6 3 120
## 7 1 120
## 8 2 480
## 9 3 190
## 10 3 300
## # … with 2,009 more rows
polintr - interests in politics 1 - Very interested 2 - Quite interested 3 - Hardly interested 4 - Not at all interested
netustm - internet use in minutes
# Create a vector for each polit. interest category (1, 2, 3,4)
pol_1 <- subset(greece_4, polintr == 1)$netustm
pol_2 <- subset(greece_4, polintr == 2)$netustm
pol_3 <- subset(greece_4, polintr == 3)$netustm
pol_4 <- subset(greece_4, polintr == 4)$netustm
ggplot()+
geom_boxplot(data = greece_4, aes(x = polintr, y = netustm), fill="green3", col="purple", alpha = 0.5) +
ylim(c(0,1000)) +
xlab("How interested are people in politics?") +
ylab("Internet use in minutes ") +
ggtitle("Internet use in minutes vs Political interest")
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).
Hypothesis: the mean amount of time spent on the Internet is the same for people with different levels of political interests
library(kableExtra)
##
## Присоединяю пакет: 'kableExtra'
## Следующий объект скрыт от 'package:dplyr':
##
## group_rows
library(psych)
##
## Присоединяю пакет: 'psych'
## Следующие объекты скрыты от 'package:ggplot2':
##
## %+%, alpha
describeBy(greece_4$netustm, greece_4$polintr, mat = TRUE) %>% #create dataframe
select(polintr = group1, N=n, Mean=mean, SD=sd, Median=median, Min=min, Max=max,
Skew=skew, Kurtosis=kurtosis, st.error = se) %>%
kable(align=c("lrrrrrrrr"), digits=2, row.names = FALSE,
caption="Political interests") %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
| polintr | N | Mean | SD | Median | Min | Max | Skew | Kurtosis | st.error |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 96 | 211.70 | 117.44 | 180 | 30 | 600 | 1.03 | 0.95 | 11.99 |
| 2 | 502 | 193.35 | 131.56 | 150 | 20 | 1200 | 1.94 | 7.40 | 5.87 |
| 3 | 660 | 181.22 | 127.12 | 150 | 1 | 960 | 1.94 | 5.55 | 4.95 |
| 4 | 761 | 207.33 | 149.58 | 180 | 0 | 900 | 1.53 | 2.34 | 5.42 |
par(mar = c(3,10,0,3))
barplot(table(greece_4$polintr)/nrow(greece_4)*100, horiz = T, xlim = c(0,60), las = 2)
it can be concluded from these data that the samples can be compared
library(car)
## Загрузка требуемого пакета: carData
##
## Присоединяю пакет: 'car'
## Следующий объект скрыт от 'package:psych':
##
## logit
## Следующий объект скрыт от 'package:dplyr':
##
## recode
leveneTest(greece_4$netustm ~ greece_4$polintr)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 3.3025 0.01956 *
## 2015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The variances are not equal, since Pr < 0.05 , so we can reject the null hypothesis of equality of variances
oneway.test(greece_4$netustm ~ greece_4$polintr, var.equal = F)
##
## One-way analysis of means (not assuming equal variances)
##
## data: greece_4$netustm and greece_4$polintr
## F = 4.9501, num df = 3.00, denom df = 432.13, p-value = 0.002174
pairwise.t.test(greece_4$netustm, greece_4$polintr,
adjust = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: greece_4$netustm and greece_4$polintr
##
## 1 2 3
## 2 0.457 - -
## 3 0.207 0.402 -
## 4 0.768 0.302 0.002
##
## P value adjustment method: holm
library(sjPlot)
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
plot_grpfrq(greece_4$netustm, greece_4$polintr, type = "box")
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## ℹ The deprecated feature was likely used in the sjPlot package.
## Please report the issue at <]8;;https://github.com/strengejacke/sjPlot/issueshttps://github.com/strengejacke/sjPlot/issues]8;;>.
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique
## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique
it can be concluded that those who are Very interested, Quite interesting and Not at all interested spend the most time on the Internet, unlike those who are Hardly interested in politics