Biba_and_Boba_project2

library(haven)
ESS10 <- read_sav("/Users/DP/OneDrive/Документы/ESS10.sav")

Step 0 - clear the dataset, leave only important things

library(dplyr)

## 
## Присоединяю пакет: 'dplyr'

## Следующие объекты скрыты от 'package:stats':
## 
##     filter, lag

## Следующие объекты скрыты от 'package:base':
## 
##     intersect, setdiff, setequal, union

greece <- filter(ESS10, ESS10$cntry == "GR")

Let’s describe relationships between the following pairs of variables:

polintr and vote

I expect to find out if there is a relationship between the level of political interest (polintr) and whether individuals voted during last national elections (vote).

greece_2 <- greece %>% select(polintr, vote)
greece_2 <- na.omit(greece_2)
greece_2

## # A tibble: 2,752 × 2
##    polintr                   vote                    
##    <dbl+lbl>                 <dbl+lbl>               
##  1 3 [Hardly interested]     1 [Yes]                 
##  2 3 [Hardly interested]     1 [Yes]                 
##  3 2 [Quite interested]      1 [Yes]                 
##  4 3 [Hardly interested]     3 [Not eligible to vote]
##  5 3 [Hardly interested]     1 [Yes]                 
##  6 4 [Not at all interested] 1 [Yes]                 
##  7 3 [Hardly interested]     1 [Yes]                 
##  8 4 [Not at all interested] 1 [Yes]                 
##  9 3 [Hardly interested]     1 [Yes]                 
## 10 4 [Not at all interested] 1 [Yes]                 
## # … with 2,742 more rows

greece_2$polintr <- as.numeric(greece_2$polintr)
greece_2$vote <- as.numeric(greece_2$vote)
greece_2

## # A tibble: 2,752 × 2
##    polintr  vote
##      <dbl> <dbl>
##  1       3     1
##  2       3     1
##  3       2     1
##  4       3     3
##  5       3     1
##  6       4     1
##  7       3     1
##  8       4     1
##  9       3     1
## 10       4     1
## # … with 2,742 more rows

Part I. Chi-squared Test (2 points) 1) The correct choice of variables. A plot with the two variables involved (0.5 points)

library(ggplot2)

# Convert the variables to factors
greece_2$polintr <- factor(greece_2$polintr)
greece_2$vote <- factor(greece_2$vote)

# Create a table with the counts for each combination of polintr and vote
table_data <- table(greece_2$polintr, greece_2$vote)

# Convert the table to a data frame
table_df <- as.data.frame.matrix(table_data)

# Add row names as a variable
table_df$polintr <- rownames(table_df)

# Reshape the data from wide to long format
table_long <- reshape2::melt(table_df, id.vars = "polintr")

# Create the stacked bar chart
ggplot(table_long, aes(x = polintr, y = value, fill = variable)) + 
  geom_bar(stat = "identity") + 
  labs(x = "Political Interest", y = "Count", fill = "Vote")

2) The null hypothesis is spelled out, and you make conclusions as table(flowers) to how the results relate to it (0.5)

Vote Categories 1 Yes 2 No 3 Not eligible to vote

polintr Category 1 Very interested 2 Quite interested 3 Hardly interested 4 Not at all interested

We are interested in examining the association between political interest (polintr) and voting behavior (vote) among Greek citizens. Specifically, we want to test whether there is a significant difference in the distribution of votes among individuals with different levels of political interest. We can formulate the following null and alternative hypotheses:

Null hypothesis (H0): The distribution of votes is the same across all levels of political interest. Alternative hypothesis (HA): The distribution of votes is different across at least one pair of levels of political interest.

# Perform chi-squared test of independence
chisq_res <- chisq.test(table_data)

## Warning in chisq.test(table_data): аппроксимация на основе хи-квадрат может
## быть неправильной

# Print the test result
chisq_res

## 
##  Pearson's Chi-squared test
## 
## data:  table_data
## X-squared = 76.519, df = 6, p-value = 1.867e-14

We can reject the null hypothesis (p-value = 1.867e-14 < 0,05) and conclude that there is a significant difference in the distribution of votes across at least one pair of levels of political interest.

library(gmodels)
CrossTable(table_data, expected=T)

## Warning in chisq.test(t, correct = FALSE, ...): аппроксимация на основе
## хи-квадрат может быть неправильной

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |              Expected N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2752 
## 
##  
##              |  
##              |         1 |         2 |         3 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##            1 |       127 |         5 |         1 |       133 | 
##              |   112.605 |    17.253 |     3.141 |           | 
##              |     1.840 |     8.702 |     1.460 |           | 
##              |     0.955 |     0.038 |     0.008 |     0.048 | 
##              |     0.055 |     0.014 |     0.015 |           | 
##              |     0.046 |     0.002 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            2 |       654 |        59 |         4 |       717 | 
##              |   607.053 |    93.012 |    16.935 |           | 
##              |     3.631 |    12.437 |     9.880 |           | 
##              |     0.912 |     0.082 |     0.006 |     0.261 | 
##              |     0.281 |     0.165 |     0.062 |           | 
##              |     0.238 |     0.021 |     0.001 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            3 |       748 |       102 |        23 |       873 | 
##              |   739.132 |   113.249 |    20.620 |           | 
##              |     0.106 |     1.117 |     0.275 |           | 
##              |     0.857 |     0.117 |     0.026 |     0.317 | 
##              |     0.321 |     0.286 |     0.354 |           | 
##              |     0.272 |     0.037 |     0.008 |           | 
## -------------|-----------|-----------|-----------|-----------|
##            4 |       801 |       191 |        37 |      1029 | 
##              |   871.210 |   133.486 |    24.304 |           | 
##              |     5.658 |    24.781 |     6.632 |           | 
##              |     0.778 |     0.186 |     0.036 |     0.374 | 
##              |     0.344 |     0.535 |     0.569 |           | 
##              |     0.291 |     0.069 |     0.013 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |      2330 |       357 |        65 |      2752 | 
##              |     0.847 |     0.130 |     0.024 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  76.51923     d.f. =  6     p =  1.867254e-14 
## 
## 
##

You have checked all the assumptions of the chi-square test and they are matched. You have run and correctly interpreted the chi-square test, analysing the standardized residuals (1)

The result of the chi-square test is a p-value of 1.867254e-14, which is less than the standard significance level of 0.05. Therefore, we reject the null hypothesis and conclude that there is a significant association between political interest and vote.

In this case, we can see that the cells with the largest standardized residuals are in the fourth second column (vote = 2). Specifically, the observed counts for those cells where political interest = 1 and political interest = 2 we overpredict, while when political interest = 4 we underpredict. Basically, people who are interested in politics do not vote less than we predict, on the other hand people who are not intereste din politics do not vote more often than we predict.

Part II. The t-test (2.5 points) 4) The correct choice of variables, a plot with these variables (0.5)

greece_3 <- greece %>% select(netustm, vote)
greece_3 <- na.omit(greece_3)
greece_3$netustm <- as.numeric(greece_3$netustm)
greece_3$vote <- as.numeric(greece_3$vote)
greece_3

## # A tibble: 1,991 × 2
##    netustm  vote
##      <dbl> <dbl>
##  1      60     1
##  2     240     1
##  3     120     3
##  4      60     1
##  5     120     1
##  6     120     1
##  7     120     1
##  8     480     1
##  9     190     1
## 10     300     2
## # … with 1,981 more rows

netustm - internet use in minutes vote - national elections

# Create a vector for each vote category (1, 2, 3)
vote_1 <- subset(greece_3, vote == 1)$netustm
vote_2 <- subset(greece_3, vote == 2)$netustm
vote_3 <- subset(greece_3, vote == 3)$netustm #We will not use it

ggplot() + labs (title = "Internet use in minutes vs Voting", x="Did people vote on last national elections?", y="") +
  geom_boxplot(aes(x='Yes', y=vote_1), fill="tomato1") + 
  geom_boxplot(aes(x='No', y=vote_2), fill="purple4") + 
  theme_bw()

5) You have checked the normality assumption for the t-test in 2 different ways (QQ plots / histogram / skew and kurtosis) (0.5)

shapiro.test(vote_1)

## 
##  Shapiro-Wilk normality test
## 
## data:  vote_1
## W = 0.83022, p-value < 2.2e-16

shapiro.test(vote_2)

## 
##  Shapiro-Wilk normality test
## 
## data:  vote_2
## W = 0.8899, p-value = 7.288e-13

var.test(vote_1, vote_2)

## 
##  F test to compare two variances
## 
## data:  vote_1 and vote_2
## F = 1.0064, num df = 1670, denom df = 261, p-value = 0.9645
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8312979 1.2030520
## sample estimates:
## ratio of variances 
##           1.006375

# QQ plot for vote_1
qqnorm(vote_1)
qqline(vote_1)

# Histogram for vote_1
hist(vote_1)

# QQ plot for vote_2
qqnorm(vote_2)
qqline(vote_2)

# Histogram for vote_2
hist(vote_2)

library(moments)

# Calculate skewness and kurtosis for vote_1
skewness(vote_1)

## [1] 1.87265

kurtosis(vote_1)

## [1] 8.054205

# Calculate skewness and kurtosis for vote_2
skewness(vote_2)

## [1] 1.2126

kurtosis(vote_2)

## [1] 4.260232

For vote_1, the skewness value is 1.87265 which is greater than zero, indicating a positively skewed distribution. The kurtosis value of 8.054205 indicates that the distribution is highly leptokurtic, meaning it has a sharp peak and heavy tails. These results suggest that the distribution of vote_1 may not be normal.

For vote_2, the skewness value is 1.2126 which is also greater than zero, indicating a positively skewed distribution. The kurtosis value of 4.260232 indicates that the distribution is moderately leptokurtic, meaning it has a relatively sharp peak and moderately heavy tails. These results also suggest that the distribution of vote_2 may not be normal.

Overall, neither vote_1 nor vote_2 appears to have a normal distribution based on their skewness and kurtosis values.

The null hypothesis is spelled out, and you make conclusions as to how the results relate to it. You have applied the correct t-test formula and interpreted the result correctly. If the result is statistically significant, the effect size is reported (1)

Null hypothesis: There is no significant difference in the mean time spent on the internet between people who voted and those who didn’t vote in the last national elections.

Alternative hypothesis: There is a significant difference in the mean time spent on the internet between people who voted and those who didn’t vote in the last national elections.

t.test(vote_1, vote_2, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  vote_1 and vote_2
## t = -3.2426, df = 1931, p-value = 0.001204
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -47.33311 -11.65582
## sample estimates:
## mean of x mean of y 
##  190.8414  220.3359

The t-test results show that there is a statistically significant difference between the mean number of minutes spent on the internet by people who voted in the last national elections (vote_1) and those who did not vote (vote_2). The t-statistic value is -3.2426 with a p-value of 0.001204, which is less than the typical threshold of 0.05, indicating strong evidence against the null hypothesis.

The 95% confidence interval for the difference in means is between -47.33311 and -11.65582, which does not include zero, indicating that the difference between the two means is statistically significant. The negative sign of the confidence interval suggests that people who did not vote tend to spend more minutes on the internet compared to those who did vote.

Therefore, we can reject the null hypothesis and conclude that there is a significant difference in the mean number of minutes spent on the internet between people who voted and those who did not vote.

You have double-checked your results with a non-parametric test (0.5)

wilcox.test(vote_1, vote_2, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  vote_1 and vote_2
## W = 183930, p-value = 2.664e-05
## alternative hypothesis: true location shift is not equal to 0

The test resulted in a W value of 183930 and a p-value of 2.664e-05. The p-value is less than the significance level of 0.05, indicating that we can reject the null hypothesis of no difference between the two groups. Therefore, we can conclude that there is a significant difference in the internet use in minutes between people who voted on the last national election and those who did not vote.

We can conclude that people who do not vote spend more time in the internet on average then those who do vote.

Part III. ANOVA (3.5 points) 8) The correct choice of variables, a boxplot with these variables (0.5)

greece_4 <- greece %>% select(polintr, netustm)
greece_4 <- na.omit(greece_4)
greece_4$polintr <- as.factor(greece_4$polintr)
greece_4$netustm <- as.numeric(as.character(greece_4$netustm))
greece_4

## # A tibble: 2,019 × 2
##    polintr netustm
##    <fct>     <dbl>
##  1 3            60
##  2 2           240
##  3 3           120
##  4 3            60
##  5 3           120
##  6 3           120
##  7 1           120
##  8 2           480
##  9 3           190
## 10 3           300
## # … with 2,009 more rows

polintr - interests in politics 1 - Very interested 2 - Quite interested 3 - Hardly interested 4 - Not at all interested

netustm - internet use in minutes

# Create a vector for each polit. interest category (1, 2, 3,4)
pol_1 <- subset(greece_4, polintr == 1)$netustm
pol_2 <- subset(greece_4, polintr == 2)$netustm
pol_3 <- subset(greece_4, polintr == 3)$netustm 
pol_4 <- subset(greece_4, polintr == 4)$netustm

  ggplot()+
  geom_boxplot(data = greece_4, aes(x = polintr, y = netustm), fill="green3", col="purple", alpha = 0.5) +
  ylim(c(0,1000)) +
  xlab("How interested are people in politics?") + 
  ylab("Internet use in minutes ") +
  ggtitle("Internet use in minutes vs Political interest")

## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).

You have checked and correctly interpreted the assumptions (1)

Hypothesis: the mean amount of time spent on the Internet is the same for people with different levels of political interests

library(kableExtra)

## 
## Присоединяю пакет: 'kableExtra'

## Следующий объект скрыт от 'package:dplyr':
## 
##     group_rows

library(psych)

## 
## Присоединяю пакет: 'psych'

## Следующие объекты скрыты от 'package:ggplot2':
## 
##     %+%, alpha

describeBy(greece_4$netustm, greece_4$polintr, mat = TRUE) %>% #create dataframe
  select(polintr = group1, N=n, Mean=mean, SD=sd, Median=median, Min=min, Max=max, 
                Skew=skew, Kurtosis=kurtosis, st.error = se) %>% 
  kable(align=c("lrrrrrrrr"), digits=2, row.names = FALSE,
        caption="Political interests") %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)

Political interests
polintr	N	Mean	SD	Median	Min	Max	Skew	Kurtosis	st.error
1	96	211.70	117.44	180	30	600	1.03	0.95	11.99
2	502	193.35	131.56	150	20	1200	1.94	7.40	5.87
3	660	181.22	127.12	150	1	960	1.94	5.55	4.95
4	761	207.33	149.58	180	0	900	1.53	2.34	5.42

par(mar = c(3,10,0,3))
barplot(table(greece_4$polintr)/nrow(greece_4)*100, horiz = T, xlim = c(0,60), las = 2)

it can be concluded from these data that the samples can be compared

library(car)

## Загрузка требуемого пакета: carData

## 
## Присоединяю пакет: 'car'

## Следующий объект скрыт от 'package:psych':
## 
##     logit

## Следующий объект скрыт от 'package:dplyr':
## 
##     recode

leveneTest(greece_4$netustm ~ greece_4$polintr)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value  Pr(>F)  
## group    3  3.3025 0.01956 *
##       2015                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The variances are not equal, since Pr < 0.05 , so we can reject the null hypothesis of equality of variances

oneway.test(greece_4$netustm ~ greece_4$polintr, var.equal = F)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  greece_4$netustm and greece_4$polintr
## F = 4.9501, num df = 3.00, denom df = 432.13, p-value = 0.002174

pairwise.t.test(greece_4$netustm, greece_4$polintr, 
                adjust = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  greece_4$netustm and greece_4$polintr 
## 
##   1     2     3    
## 2 0.457 -     -    
## 3 0.207 0.402 -    
## 4 0.768 0.302 0.002
## 
## P value adjustment method: holm

library(sjPlot)

## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!

plot_grpfrq(greece_4$netustm, greece_4$polintr,  type = "box")

## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## ℹ The deprecated feature was likely used in the sjPlot package.
##   Please report the issue at <]8;;https://github.com/strengejacke/sjPlot/issueshttps://github.com/strengejacke/sjPlot/issues]8;;>.

## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique

## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique

## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique

## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique

## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique

## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique

## Warning in rq.fit.br(wx, wy, tau = tau, ...): Solution may be nonunique

it can be concluded that those who are Very interested, Quite interesting and Not at all interested spend the most time on the Internet, unlike those who are Hardly interested in politics

Biba_and_Boba_project2

2023-03-19