knitr::opts_chunk$set(warning = FALSE, message = FALSE)

~ Introduction

Hi!

Short about us.

Our team-members and their responsibilities:

Anna Moroz - boxplots, checking normality and T-test;
Anna Podolskaya - barplots, collecting all the scripts in one and arranging the final layout;
Anastasia Prokudina - recheck for t-test with non-parametriс ones;
Olga Chemodanova - chi-square test.

Our data:

country, we are interested in is Netherlands;
the data was collected in 2016;
the theme is “Subjective well-being”.

First of all, let’s open all the packages (or install and open, if you do not have some), that we will need:

library(foreign)
library(dplyr)
library(magrittr)
library(ggplot2)
library(lubridate)
library(psych)
library(lsr)
library(vcd)
library(sjPlot)
library(corrplot)
library(knitr)

Our work can be divided on two parts: chi-square test and T-test. For each we’ve created datasets: NL_chi and NL_ttest, and also reduced NA in them.

ESS <- read.spss("ESS8NL.sav", use.value.labels=T, to.data.frame=T)
NL_chi <- ESS %>% select(blgetmg, dscrgrp)
NL_chi = na.omit(NL_chi)
NL_ttest<- ESS %>% select (eduyrs, wkhtot, gndr, yrbrn, happy, crmvct, health)
NL_ttest = na.omit(NL_ttest)

Chi-square test

Stacked barplot with two variables

To start with let’s look at the variables “blgetmg”, which is belonging to minority ethnic group in Netherlands and “dscrgrp” - being a member of a group discriminated against in Netherlands. We are interested in defferences and correlation that may be noticed.

ggplot(NL_chi, aes(x = blgetmg, fill = dscrgrp)) +
  geom_bar(position = "stack") +
  xlab("Belonging to minority ethnic groups") + ylab("Number of observations") +
  labs(fill="Being a member of a group\ndiscriminated against in Netherlands") +
  ggtitle("Number of respondents, who belong to minotity ethnic group and if they belong\nor not to a group discriminated against in Netherlands") +
  theme_bw()

It is clear that number of respondents, who belong to minority ethnic group, is significantly lower than number of those, who do not. If we look closer to red parts - amount of those, who are members of a discriminated group, we may notice, that it is about one third of respondents, who belong to minority ethnic group. In the second bar this propostion is much less. Let’s check it with a chi-square test.

Checking all the assumptions of the chi-square

For that we firstly we should check assumptions of the chi-square. To do it, we construct a table with our variables. Then visualize it.

table <- table(NL_chi)
table

##        dscrgrp
## blgetmg  Yes   No
##     Yes   35   57
##     No    94 1487

#visualize a contingency table
assoc(head(table, 5), shade = TRUE, las=3, xlab = "Being a member of a group discriminated against in Netherlands", ylab = "Belonging to minority ethnic groups")

From the table it is seen that in each сell of contingency table there are at least 5 observations (in fact, there are much more than 5). Each observation contribute to one group only (so the observations are independent) and there are no empty cells. So, the assumptions are met.

Chi-square test and analysing residuals

Let’s observe the difference in proportion of people from Netherlands who belong to minority ethnic group and who is a member of a group discriminated against. It can be seen that the percentage of those who doesn’t belong to minority and is discriminated is small (6%) while the percentage of those who belongs to ethnic minority and is discriminated is higher (38%).

sjp.xtab(NL_chi$blgetmg, NL_chi$dscrgrp, margin = "row", bar.pos = "stack", title = "The proportion of respondents who belong to minotity ethnic group and their belongness to a group discriminated against in Netherlands", axis.titles = "Belonging to minority ethnic groups", legend.title = "Being a member of a group discriminated against in Netherlands", 
         show.summary = TRUE, coord.flip = TRUE)

The hypothesis for the test are:

H0: the variables are not associated;
H1: the variables are associated

We perform chi-square test and look how many observed and expected observations we have. From the test we can see that variables are statistically significantly associated (p-value ~ 0). Also, we can see observed and expected values.

#chi square
chisq.test(table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 121.4, df = 1, p-value < 2.2e-16

chi <- chisq.test(table)
str(chi)

## List of 9
##  $ statistic: Named num 121
##   ..- attr(*, "names")= chr "X-squared"
##  $ parameter: Named int 1
##   ..- attr(*, "names")= chr "df"
##  $ p.value  : num 3.12e-28
##  $ method   : chr "Pearson's Chi-squared test with Yates' continuity correction"
##  $ data.name: chr "table"
##  $ observed : 'table' int [1:2, 1:2] 35 94 57 1487
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ blgetmg: chr [1:2] "Yes" "No"
##   .. ..$ dscrgrp: chr [1:2] "Yes" "No"
##  $ expected : num [1:2, 1:2] 7.09 121.91 84.91 1459.09
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ blgetmg: chr [1:2] "Yes" "No"
##   .. ..$ dscrgrp: chr [1:2] "Yes" "No"
##  $ residuals: 'table' num [1:2, 1:2] 10.478 -2.527 -3.029 0.731
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ blgetmg: chr [1:2] "Yes" "No"
##   .. ..$ dscrgrp: chr [1:2] "Yes" "No"
##  $ stdres   : 'table' num [1:2, 1:2] 11.2 -11.2 -11.2 11.2
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ blgetmg: chr [1:2] "Yes" "No"
##   .. ..$ dscrgrp: chr [1:2] "Yes" "No"
##  - attr(*, "class")= chr "htest"

chi$observed #observed

##        dscrgrp
## blgetmg  Yes   No
##     Yes   35   57
##     No    94 1487

round(chi$expected,2) #expected

##        dscrgrp
## blgetmg    Yes      No
##     Yes   7.09   84.91
##     No  121.91 1459.09

From chi-square test we have that p-value < 2.2e-16, so if the value of standardized residual is lower than -3.29 it means that the cell contains fewer observations that it was expected (the case of variables independence). If the value of standardized residual is higher than 3,29 it means that the cell contains more observations that it was expected. The standardized residuals and plots with the results are shown below:

chi$stdres #standardized residuals

##        dscrgrp
## blgetmg      Yes       No
##     Yes  11.2193 -11.2193
##     No  -11.2193  11.2193

assocplot(t(table), main="Residuals and number of observations", xlab = "Being a member of a group discriminated against in Netherlands", ylab = "Belonging to minority ethnic groups")

#Let’s visualize Pearson residuals using the package corrplot
corrplot(chi$stdres, is.cor = FALSE)

#we can see that there is association between variables

The probability of independence of variables (or that H0 is true) is 0. The function can be seen below:

1 - pchisq(121.4, 0.05)

## [1] 0

So, having the p-value < 2.2e-16 and looking at the plots above, we can conclude that H0 is false. That means that there is association between belonging to minority ethnic group in Netherlands and being a member of a group discriminated against in Netherlands.

T-test

Box-plots and checking the normality assumption

While studying the topic of subjective well-being in Netherlands, we pose a question of what age do people typically become a victim of a burglary or an assault. To find this out, we can start with constructing a boxplot which would show us minimum, maximum, and mean values of respondents’ age, and find out if these values tend to differ across those who had a traumatizing experience in last 5 years and those who did not. Information about the traumatizing experience is contained in variable “crmvct” the answer on which denotes if a respondent or household member was a victim of burglary or assault in last 5 years or not. Thus, we have two groups on which we can check the distribution of age parameter, that we calculated on the base of “yrbrn” variable and coded as “estAge” (‘est’ due to the number being approximate). Having constructed a boxplot, we got the infromation that mean, maximum, and minimum values indeed vary across the two groups, and the data in both groups don’t produce outliers. The summary of the mentioned values can be seen as the result of applying function psych::describeBy().

estAge = 2019 - as.numeric(as.character(NL_ttest$yrbrn)) 
ggplot(NL_ttest, aes(x = NL_ttest$crmvct, y = estAge, fill = NL_ttest$crmvct, na.rm = T)) +
  scale_x_discrete(name = "Having had traumatizing experience") +
  scale_y_continuous(name = "Estimated age", limits = c(0, 100)) +
  ggtitle("Boxplot of Age Distribution by Presence of Traumatizing Experience") +
  geom_boxplot() +
  theme_bw() +
  labs(fill = "Traumatizing experience")

describeBy(estAge, NL_ttest$crmvct, mat = T)

Now, to be absolutely confident about the result given by the boxplot and its correctness, we want to suggest a research hypothesis and apply a t-test to verify it. Yet, before doing so, we need to check the required assumptions of this test. Precisely, the normality of our data on the estAge variable and homogeneity of the sample groups. Previously printed table on psych::describeBy() function has provided some paramaters that are applicable in the current task.

These parameters are skews and kurtosis. The first group, that consists of people who were victims of vurglary or assault in last 5 years, shows a skew of value 0.19 meaning that the distribution of data in this group is slightly skewed to the right, while kurtosis is of value -0.83 which confirms the absence of outliers visualized in the boxplots. The second group also shows a negative kurtosis, this time of value -0.91, confriming no outlies in the data. Yet, its skew is of negative value meaning that the data is insignificantly, as the value is -0.12, skewed to the left. Besides, sample groups are homogeneous as their variances are equal.

The next step would be to draw a Q-Q plot or a histogram of our two variables, in order to double check normality. Well, let’s do both.

ggplot(NL_ttest, aes(x = estAge)) +
  geom_histogram(col = "orange", fill = "wheat1", binwidth = 2.5, na.rm = T) +
  labs(title = "Estimated Age Distribution", x = "Distribution of age", y = "Frequency") +
  scale_x_continuous(limits = c(0, 100)) +
  geom_vline(aes(xintercept = mean(estAge)), color="darkblue", size = 0.8) +
  geom_vline(aes(xintercept = median(estAge)), color = "red4", size = 0.8) +
  facet_grid(NL_ttest$crmvct ~ .) +
  theme_bw()

qqnorm(estAge)
qqline(estAge, col = 2)

The histogram visualizes the data distribution and shows that the median and the mean, denoted by vertical lines of different color, appear to almost coincide indicating that the distribution is close to symmetric. The QQ-plot lets us assume that our sample of “estAge” variable comes from a population that is approximately normally distributed with some signiffiant deviations from normality in the beginning and in the end of it. That is not far from the perfect normality required to conduct a t-test, yet, first of all, for large samples (n > 300) this characteristic is not that important, and, secondly, “estAge”" is the most normally distributed variable among all dataset on responses from Netherlands 2016.

T-test itself

Now, let us apply a t-test. As our null hypothesis(H0) we suggest that the means in two populations, from which the sample groups are taken, are equal, and as an alternative hypothesis (H1) we suggest that these means are not equal.

t.test(estAge ~ NL_ttest$crmvct, na.rm = T, var.equal = F)

## 
##  Welch Two Sample t-test
## 
## data:  estAge by NL_ttest$crmvct
## t = -3.8893, df = 381.69, p-value = 0.0001186
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.728339 -2.209783
## sample estimates:
## mean in group Yes  mean in group No 
##          50.37945          54.84851

The results of the t-test show that the means in two populations are not equal, that allows us to reject the null hypothesis. P-value, which is less than 0.05 is confirms it. This way, we conducted the analysis that has given us some insights for further exploration of our initial question (“what age do people in Netherlands typically become a victim of a burglary or an assault, a.k.a. come to have a traumatizing experience”).

Double-checking the results with a non-parametric test

Firstly, let’s check results from t-test with unpaired Wilcoxon or Mann-Whitney test.

The hypothesis are:

H0: the distribution of scores for the two groups are equal;
HA: the distribution of scores for the two groups are not equal

wilcox.test(estAge ~ NL_ttest$crmvct)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  estAge by NL_ttest$crmvct
## W = 144210, p-value = 0.0001627
## alternative hypothesis: true location shift is not equal to 0

So, we can see that the p-value of the test is 0.0001627, which is less than the significance level alpha = 0.05. The null hypothesis can be rejected. We can conclude that expirienced burglary/assalt people’s median age is significantly different from non-expirienced burglary/assalt people’s median age with a p-value = 0.0001627

Let’s aso check the results with Cohen’s D test! It will show how strong the effect size and difference is.

cohensD(estAge ~ NL_ttest$crmvct)

## [1] 0.2466662

The result of 0.2466662 shows that there is a small effect size. With a Cohen’s d of 0.2:

58 % of the treatment group will be above the mean of the control group;
92 % of the two groups will overlap, and there is a 56 % chance that a person picked at random from the treatment group will have a higher score than a person picked at random from the control group;
Moreover, in order to have one more favorable outcome in the treatment group compared to the control group we need to treat 16.5 people. This means that if 100 people go through the treatment, 6.1 more people will have a favorable outcome compared to if they had received the control treatment.

~ Thank you for your precious attention!

Odoo’s 2nd project. Chi-squared and t-test

odoo_team

22/02/2019