Let’s download our data

ESS1 <- select(ESS, c("nwspol", "prtvtcil", "psppipla", "ppltrst", "polintr", "stfgov", "pbldmn", "vote", "contplt", "gndr", "icpart1"))

Constructing chi-square test

For chi-square test we decided to choose variables of gender (var. gndr) and participation in public demonstrations (var. pbldmn). Both variables are nominal.

H0 - there is no relationship between gender of respondents and participation in public demonstrations
H1 - the relationship exist

table(ESS1$gndr, ESS1$pbldmn)

##         
##           Yes   No
##   Male    151 1074
##   Female   89 1239

Table <- matrix(c(151, 89, 1074, 1239), nrow=2)

row.names(Table) <- c("Male","Female")

colnames(Table) <- c("Yes", "No")

Table

##        Yes   No
## Male   151 1074
## Female  89 1239

chisq.test(Table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Table
## X-squared = 23.014, df = 1, p-value = 1.608e-06

knitr::kable(Table)

	Yes	No
Male	151	1074
Female	89	1239

The probability of getting a value as 23.01, if there were no association between the variables in the population, is 1.608e-06. Since the P-value (1.608e-06) is less than the significance level (0.05), we might reject H0 and consider that there is a relationship between a gender of respondents and their involvements in demostrations.

set_theme(
  geom.outline.size = 0.3, 
  geom.label.size = 4,
  geom.label.color = "black",
  axis.angle.x = 45, 
  base = theme_bw()
)

sjp.xtab(ESS1$gndr, ESS1$pbldmn, margin = "row", bar.pos = "stack", show.summary = TRUE, coord.flip = TRUE, geom.colors = c("#CD423F", "#F5C6AC"))

Specially, most Israelis do not participate in demonstrations, however, men participate in demonstrations more often than women. Thus, only 7% of women participate in demonstations, while 12,3% do it among males. And 93,3% of females do not participate, while there’s less amount of men (87,7%) not participating.

firstchi <- chisq.test(Table)

knitr::kable(firstchi$stdres)

	Yes	No
Male	4.865195	-4.865195
Female	-4.865195	4.865195

The value of the standardized residue less than -2 according to the obtained table means that:

- the cell contains fewer observations that it was expected (the case of variables independence).

The value of standardized residual is higher than 2 according to the obtained table means that:

- the cell contains more observations that it was expected

There are more men who participate in demonstations than women (std.res. = 4.86), and there are more women who do not participate than men (std.res. = 4.86).

Now we need to plot the results:

assocplot(t(Table), main="Residuals and number of observations")

Thus, we can see standardized residuals by using assocplot() function.

corrplot(firstchi$stdres, is.cor = FALSE)

Positive residuals are in blue
Negative residuals are in red

Let’s draw a stacked barplot with two variables!

counts = table(ESS1$pbldmn, ESS1$gndr)

barplot(counts, col=brewer.pal(n = 3, name = "PuRd"), legend = rownames(counts), las = 2, main = "Do you participate in demonstrations?")

We can conclude that there is a relationship between sex and participation in demonstrations. The citizens in Israel tend not to attend demonstrations, but men do it more often than women.

Now we are going to perform independent (unpaired) samples t-test.

Ho: males and females spend equal amount of time (in minutes) on news watching
H1: there is a difference in time

Let us inspect our data and perform descriptive statistics by groups.

ESS1$nwspol <- as.numeric(as.character(ESS1$nwspol))

describeBy(ESS1, ESS1$gndr)

## 
##  Descriptive statistics by group 
## group: Male
##           vars    n  mean    sd median trimmed   mad min max range  skew
## nwspol       1 1223 86.22 94.65     60   70.24 74.13   0 600   600  2.31
## prtvtcil*    2  853  5.51  3.81      5    5.31  4.45   1  14    13  0.38
## psppipla*    3 1189  2.02  1.01      2    1.90  1.48   1   5     4  0.77
## ppltrst*     4 1208  6.42  2.43      7    6.59  1.48   1  11    10 -0.51
## polintr*     5 1224  2.43  1.07      2    2.42  1.48   1   4     3  0.21
## stfgov*      6 1197  5.18  2.58      5    5.17  2.97   1  11    10  0.00
## pbldmn*      7 1225  1.88  0.33      2    1.97  0.00   1   2     1 -2.29
## vote*        8 1217  1.23  0.50      1    1.11  0.00   1   3     2  2.16
## contplt*     9 1224  1.83  0.37      2    1.92  0.00   1   2     1 -1.80
## gndr*       10 1227  1.00  0.00      1    1.00  0.00   1   1     0   NaN
## icpart1*    11 1220  1.33  0.47      1    1.28  0.00   1   2     1  0.73
##           kurtosis   se
## nwspol        7.84 2.71
## prtvtcil*    -1.32 0.13
## psppipla*    -0.06 0.03
## ppltrst*     -0.20 0.07
## polintr*     -1.22 0.03
## stfgov*      -0.91 0.07
## pbldmn*       3.24 0.01
## vote*         3.81 0.01
## contplt*      1.25 0.01
## gndr*          NaN 0.00
## icpart1*     -1.46 0.01
## -------------------------------------------------------- 
## group: Female
##           vars    n  mean    sd median trimmed   mad min max range  skew
## nwspol       1 1324 76.09 95.74     60   57.90 66.72   0 960   960  3.06
## prtvtcil*    2  935  5.19  3.68      4    4.92  2.97   1  15    14  0.54
## psppipla*    3 1283  1.88  0.94      2    1.76  1.48   1   5     4  0.88
## ppltrst*     4 1325  6.49  2.26      6    6.63  2.97   1  11    10 -0.45
## polintr*     5 1326  2.68  1.00      3    2.72  1.48   1   4     3 -0.08
## stfgov*      6 1270  4.94  2.55      5    4.89  2.97   1  11    10  0.11
## pbldmn*      7 1328  1.93  0.25      2    2.00  0.00   1   2     1 -3.46
## vote*        8 1320  1.21  0.48      1    1.10  0.00   1   3     2  2.22
## contplt*     9 1330  1.89  0.31      2    1.99  0.00   1   2     1 -2.48
## gndr*       10 1330  2.00  0.00      2    2.00  0.00   2   2     0   NaN
## icpart1*    11 1325  1.38  0.48      1    1.35  0.00   1   2     1  0.51
##           kurtosis   se
## nwspol       14.34 2.63
## prtvtcil*    -1.11 0.12
## psppipla*     0.12 0.03
## ppltrst*     -0.03 0.06
## polintr*     -1.13 0.03
## stfgov*      -0.83 0.07
## pbldmn*       9.97 0.01
## vote*         4.20 0.01
## contplt*      4.16 0.01
## gndr*          NaN 0.00
## icpart1*     -1.74 0.01

Now it’s time to create a boxplots with a continuous variable “nwspol” which is time, spent on news watching, split into 2 groups: males and females.

ggplot(ESS1, aes(x = gndr, y = nwspol)) +
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = "point", shape = 4, size = 4) +
  theme_classic() +
  ggtitle("Time of News Watching by Sex of Respondent")

## Warning: Removed 10 rows containing non-finite values (stat_boxplot).

## Warning: Removed 10 rows containing non-finite values (stat_summary).

Let’s check the normality assumption for the t-test, using Q-Q plot, a histogram, skewness and kurtosis.

Q-Q plot

qqnorm(ESS1$nwspol); qqline(ESS1$nwspol, col= 2)

Histogram

ESS1$gndr<- relevel(ESS1$gndr, ref = "Female")

ggplot(ESS1, aes(x = nwspol, col = gndr, fill = gndr)) +

geom_histogram(aes(y = ..density..), alpha = 0.5) +

facet_grid(. ~ gndr) +

ggtitle("Watching news by Sex")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 10 rows containing non-finite values (stat_bin).

As it can be seen from 2 graphs, our data is far from being distributed normally. However, we have more than thousand of observations in both groups, so normality is not that much of a concern.

Looking at our descriptive statistics, we observe that for men and women groups distributions of the variable “nwspol” are leptokurtic (but female group depict higher kurtosis), as K > 0 for both.

We also see that both of our distributions are positevely skewed (2.31 for males and 3.06 for females)

As we have 2 unpaired samples, we should check the equity of variances across groups. However, the regular t.test in R accounts for it by default.

That is why, we just have to perform our t-test.

t.test(ESS1$nwspol ~ as.factor(ESS1$gndr))

## 
##  Welch Two Sample t-test
## 
## data:  ESS1$nwspol by as.factor(ESS1$gndr)
## t = -2.6835, df = 2533.3, p-value = 0.007333
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -17.531258  -2.727624
## sample estimates:
## mean in group Female   mean in group Male 
##             76.09215             86.22159

Conclusion: p-value is less than 0,05 which means that our H0 should be rejected. So, there is a statistically significant difference in time spent on news watching between two sexes: males spent near 86 min on average, while women 76 min.

Double checking our results with a non-parametric test: Mann-Whitney-Wilcoxon test for 2 independent samples

wilcox.test(nwspol ~ gndr, data = ESS1)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  nwspol by gndr
## W = 739570, p-value = 0.0001377
## alternative hypothesis: true location shift is not equal to 0

Conclusion: the Wilcoxon rank sum W = 879690 (p < 0.001), which means that males and females spend different time watching news and this difference is statistically significant.

Project 2

AD riders

Let’s download our data

Constructing chi-square test

For chi-square test we decided to choose variables of gender (var. gndr) and participation in public demonstrations (var. pbldmn). Both variables are nominal.

The value of the standardized residue less than -2 according to the obtained table means that:

The value of standardized residual is higher than 2 according to the obtained table means that:

There are more men who participate in demonstations than women (std.res. = 4.86), and there are more women who do not participate than men (std.res. = 4.86).

Now we need to plot the results:

Thus, we can see standardized residuals by using assocplot() function.

Let’s draw a stacked barplot with two variables!

We can conclude that there is a relationship between sex and participation in demonstrations. The citizens in Israel tend not to attend demonstrations, but men do it more often than women.

Now we are going to perform independent (unpaired) samples t-test.

Let us inspect our data and perform descriptive statistics by groups.

Now it’s time to create a boxplots with a continuous variable “nwspol” which is time, spent on news watching, split into 2 groups: males and females.

Let’s check the normality assumption for the t-test, using Q-Q plot, a histogram, skewness and kurtosis.

Q-Q plot

Histogram

As it can be seen from 2 graphs, our data is far from being distributed normally. However, we have more than thousand of observations in both groups, so normality is not that much of a concern.

Looking at our descriptive statistics, we observe that for men and women groups distributions of the variable “nwspol” are leptokurtic (but female group depict higher kurtosis), as K > 0 for both.

We also see that both of our distributions are positevely skewed (2.31 for males and 3.06 for females)

As we have 2 unpaired samples, we should check the equity of variances across groups. However, the regular t.test in R accounts for it by default.

That is why, we just have to perform our t-test.

Conclusion: p-value is less than 0,05 which means that our H0 should be rejected. So, there is a statistically significant difference in time spent on news watching between two sexes: males spent near 86 min on average, while women 76 min.

Double checking our results with a non-parametric test: Mann-Whitney-Wilcoxon test for 2 independent samples

Conclusion: the Wilcoxon rank sum W = 879690 (p < 0.001), which means that males and females spend different time watching news and this difference is statistically significant.