Group work & personal contribution

  • Alexandra Shanina prepared data for the tests, wrote hypotheses for chi-squared test, constructed some graphs, and interpreted graphs and the results of chi-squared test.

  • Milena Oleshko wrote hypotheses for t-test, constructed some graphs, and interpreted graphs and the results of t-test.

  • Zakharova Victoria planned the project, defined essential libraries, wrote assumptions for two tests, organized .Rmd, helped and supported A. Shanina and M. Oleshko.

All in all, there were several group discussions about some aspects such as defining possible variables for the project, constructing graphs and properties for them.

Topic & Data

This project is based on the data collected by European Social Survey (ESS8-2016 Edition 2.1 was released on 1st of December 2018) that is available for downloading on the website.

The key notions for data analysis are connected with a migration process in Germany. Taking it into account, the following spelled out hypotheses are presented and tested with the chi-squared test and t-test.

Chi-squared test

The general idea of the Chi-squared test it to check whether there is dependence between two categorical variables or not.

In order to conduct this type of test, the presented below variables were chosen.

library(knitr)

Abbreviation <- c("ctzcntr", "tporgwk")
LevelOfMeasure <- c("Binominal", "Nominal")
Definition <- c("Citizen of country", "What type of organisation work/worked for")
vardesc_1 <- data.frame(Abbreviation, LevelOfMeasure, Definition) 
knitr::kable(vardesc_1) 
Abbreviation LevelOfMeasure Definition
ctzcntr Binominal Citizen of country
tporgwk Nominal What type of organisation work/worked for

Assumptions

  • Raw data in counts not per cent;
  • Two categorical variables (usually at the nominal level);
  • The independence of observations;
  • The value of the cell expecteds should be 5 or more in at least 80% of the cells, and no cell should have an expected of less than 1.

Hypothesis

Choosing such variables, the following hypothesis is supposed to be checked:

The status of a citizen of the country does not affect the type of organization in which a person works. In other words, there is independence between these variables.

H0: there is independence between being a citizen of the country and type of organization worked for.

H1: the opposite case of H0.

Preparing data

library(tidyverse)
cst_data <- Germany %>% 
              select(tporgwk, ctzcntr) %>% 
              filter(tporgwk != 66 & tporgwk != 77 & tporgwk != 88 & 
                     tporgwk != 99 & ctzcntr != 7 & ctzcntr != 8 & ctzcntr != 9) %>%
              na.omit()

Data as table & its plotted version

# Representation of the data in a format of table
table(cst_data$tporgwk, cst_data$ctzcntr)
##                                                     
##                                                       Yes   No
##   Central or local government                         198    4
##   Other public sector (such as education and health)  372   18
##   A state owned enterprise                            163    6
##   A private firm                                     1567  102
##   Self employed                                       180    7
##   Other                                                75    9
# Create a table as variable
cst_tbl <- matrix(c(198, 372, 163, 1567, 180, 75, 4, 18, 6, 102, 7, 9), nrow = 6)
rownames(cst_tbl) <- c("Central or local government", 
                       "Other public sector (such as education and health)",
                       "A state owned enterprise", 
                       "A private firm", 
                       "Self employed", 
                       "Other")
colnames(cst_tbl) <- c("Citizen","Non-citizen")

As can be noticed, there is a cell with a value that is lower than 5. However, the chi-squared test can be conducted as according to an assumption such values can be no more than in 20% of cells.

# Visualization
library(sjPlot)
sjp.xtab(cst_data$tporgwk, cst_data$ctzcntr, 
         margin = "row",
         bar.pos = "stack", coord.flip = FALSE, 
         vjust = "right", hjust = "center", 
         title = "The proportions of citizens and non-citizens in different types of organizations", 
         legend.title = "Citizen of country", 
         axis.titles = "Type of organisation",
         expand.grid = TRUE, geom.colors = c("#bfdbed", "#158cba"))

This stacked barplot demonstrates the percentage distribution of people with and without citizenship in the German labor market. It is clearly seen that the number of positions occupied by non-citizens is too small. Obviously, there is migrant discrimination. The most closed positions are governmental, there can be seen very small percentage of people without citizenship, while private firms are to employ migrants. It is also interesting to observe public sector position, because 4,8% places are occupied by migrants.

Conducting the test

chisq.test(cst_tbl)
## 
##  Pearson's Chi-squared test
## 
## data:  cst_tbl
## X-squared = 13.516, df = 5, p-value = 0.019
Table of critical values for Chi-squared test

Table of critical values for Chi-squared test

As can be seen, the chi-squared test is significant as the critical chi-square statistic value for p = 0.05 (95% confidence level) with 5 degrees of freedom is equal to 11.07. In this case, we have to take a look at residuals that can be got from the structure of the test.

The data structure & its elements

  • The structure
## List of 9
##  $ statistic: Named num 13.5
##   ..- attr(*, "names")= chr "X-squared"
##  $ parameter: Named int 5
##   ..- attr(*, "names")= chr "df"
##  $ p.value  : num 0.019
##  $ method   : chr "Pearson's Chi-squared test"
##  $ data.name: chr "cst_tbl"
##  $ observed : num [1:6, 1:2] 198 372 163 1567 180 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
##   .. ..$ : chr [1:2] "Citizen" "Non-citizen"
##  $ expected : num [1:6, 1:2] 191 369 160 1579 177 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
##   .. ..$ : chr [1:2] "Citizen" "Non-citizen"
##  $ residuals: num [1:6, 1:2] 0.501 0.16 0.248 -0.297 0.234 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
##   .. ..$ : chr [1:2] "Citizen" "Non-citizen"
##  $ stdres   : num [1:6, 1:2] 2.238 0.746 1.102 -2.064 1.042 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
##   .. ..$ : chr [1:2] "Citizen" "Non-citizen"
##  - attr(*, "class")= chr "htest"
  • Observed counts
##                                                    Citizen Non-citizen
## Central or local government                            198           4
## Other public sector (such as education and health)     372          18
## A state owned enterprise                               163           6
## A private firm                                        1567         102
## Self employed                                          180           7
## Other                                                   75           9
  • Expected counts
##                                                    Citizen Non-citizen
## Central or local government                         191.08       10.92
## Other public sector (such as education and health)  368.92       21.08
## A state owned enterprise                            159.86        9.14
## A private firm                                     1578.78       90.22
## Self employed                                       176.89       10.11
## Other                                                79.46        4.54
  • Residuals: the difference between observed and expected
##                                                    Citizen Non-citizen
## Central or local government                          0.501      -2.094
## Other public sector (such as education and health)   0.160      -0.671
## A state owned enterprise                             0.248      -1.037
## A private firm                                      -0.297       1.241
## Self employed                                        0.234      -0.978
## Other                                               -0.500       2.093
  • Standardized residuals
##                                                      Citizen Non-citizen
## Central or local government                         2.238177   -2.238177
## Other public sector (such as education and health)  0.745909   -0.745909
## A state owned enterprise                            1.101529   -1.101529
## A private firm                                     -2.063630    2.063630
## Self employed                                       1.041856   -1.041856
## Other                                              -2.186028    2.186028

Another assumption of the chi-squared test is held: expected values in all cells are bigger than 1.

Looking at the standardized residuals, it can be seen there are some cells values of which are lower than -2 (it means that the cell contains fewer observations that it was expected (the case of variables independence)) as well as are higher than 2 (it means that the cell contains more observations that it was expected). The following visualization of the standardized residuals demonstrates it, as well.

# Vizualization of residuals
library(graphics)
assocplot(cst_tbl, main = "The standardized residuals of observations")

library(corrplot)
corrplot(chi_test2$stdres, is.cor = FALSE, method = "number", tl.col = "black")

Note: positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables. Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables.

Conclusion

According to the got results, it can be concluded that there is evidence against the null hypothesis. By this it means that there is dependence between the chosen variables such as the status of a citizen of the country and the type of organization in which a person works (X-squared = 13.516, p-value = 0.019).

Independent t-test

Independent t-test allows to check whether there is a statistically significant difference between the means in two unrelated groups.

In order to conduct an independent t-test, the presented below variables were chosen.

Abbreviation_tt <- c("eduyrs", " gndr", "ctzcntr") 
LevelOfMeasure_tt <- c("Ratio", "Binominal", "Binominal") 
Definition_tt <- c("Years of full-time education completed", "Gender","Citizen of country") 
vardesc_2 <- data.frame(Abbreviation, LevelOfMeasure, Definition) 
knitr::kable(vardesc_2)
Abbreviation LevelOfMeasure Definition
ctzcntr Binominal Citizen of country
tporgwk Nominal What type of organisation work/worked for

Assumptions

  • Two groups;
  • Continuous dependent variable;
  • Observations are independent;
  • Normally distributed continuous variable across two groups;
  • Equality of variances of the continuous variable across two groups;
  • A reasonably large sample size (at least 30 observations per group).

Hypothesis

In this way, the purpose of conducting such type of test is to compare whether the mean values of spending years on education by non-citizens according to their gender are equal or not.

H0: the mean values of spending years on education by non-citizens taking into account their gender are equal.

H1: the opposite case of H0.

Preparing data

tt_data <- Germany %>% 
              select(eduyrs, gndr, ctzcntr) %>% 
              filter(eduyrs != 66 & eduyrs != 77 & eduyrs != 88 & eduyrs != 99 & 
                     ctzcntr != 7 & ctzcntr != 8 & ctzcntr != 9) %>% 
              na.omit()
tt_data$eduyrs <- as.numeric(tt_data$eduyrs) 
tt_data$gndr<- as.factor(tt_data$gndr) 

Looking at some basic descriptive statistics of a continuous variable such as years of full-time education completed.

library(psych)
describe(tt_data$eduyrs)

Data visualization

#Boxplot
ggplot(tt_data, aes(x = gndr, y = eduyrs)) + 
        geom_boxplot() + 
        stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) + 
        ggtitle("Distribution of years spent on education by non-citizens") + 
        labs(x = "", y = "Years of education") +
        theme_bw() +
        scale_y_continuous(breaks = 0:30*2)

At this box plot two distribution of spent years on education are depicted by gender among non-citizens in Germany. It can be seen that the median value of years spent by female is equal to 12 and it is less than male’s value (13 years). Additionally, the outliers are presented. Also, according to this graph, the maximum value of female is 21 years and men have the bigger level which is equal to 23 years, and the minimum for women is 5 years while the minimum years spent on education among male is 4 years. The interquartile ranges are slightly different: female’s one is less than male’s one. Overall, the equality of means of spent years for both genders (“+” symbol) is quite difficult to define exactly without any calculations.

# Code for mode
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Histogram 1
ggplot() + 
  geom_histogram(data = tt_data, aes(x = eduyrs, y=..density..), 
               position = "identity", binwidth = 1, alpha = 0.4) + 
  labs(title = "Distribution of spent years on education by non-citizens", 
      x = "Years spent on education by non-citizens", y = "Density") +
  geom_vline(aes(xintercept = mean(tt_data$eduyrs), color="mean"), size = 0.7) +
  geom_vline(aes(xintercept = median(tt_data$eduyrs), color = "median"), size = 0.7) +
  geom_vline(aes(xintercept = getmode(tt_data$eduyrs), color = "mode"), size = 0.7) +
  scale_x_continuous(breaks = 0:35*5) +
  scale_color_manual(name = "Statistics", values = c(mean = "blue", mode = "red", median = "black"))

There is presented histogram, which prints out the association between the number of years spend on education and the absence of German citizenship. Overall, it can be said that this data is distributed almost normally. However, the graph also demonstrates the mean (a blue line), the median, which is black, and the mode as a red line. These three central tendency measures are not overlaid. Thus, it was decided to observe a variable of years according to migrants’ gender: if there is any difference between men and women. To answer to this issue the t-test was used.

# Histogram 2
tt_data$gndr <- relevel(tt_data$gndr, ref = "Female")

ggplot(tt_data, aes(x = eduyrs, fill = gndr)) +
        geom_histogram(aes(y=..density..), position = "identity", binwidth = 1, alpha = 0.6) +
        labs(title = "Distribution of spent years on education by non-citizens", 
             subtitle = "taking into account gender", 
             x = "Years spent on education", y = "Density",
             fill = "Gender") +
        facet_grid(. ~ gndr) +
        scale_x_continuous(breaks = 0:30*5) +
        geom_vline(aes(xintercept = mean(tt_data$eduyrs), color="mean"), size = 0.7) +
        geom_vline(aes(xintercept = median(tt_data$eduyrs), color = "median"), size = 0.7) +
        geom_vline(aes(xintercept = getmode(tt_data$eduyrs), color = "mode"), size = 0.7) +
        scale_color_manual(name = "Statistics", values = c(mean = "blue", mode = "red", median = "black"))

This histogram shows the connection between years spend on education and a gender of respondents among migrants. The median (a dotted line), the mode (a red line) and the mean (a blue line) are represented, as well. It can be got from the graph that there is no obvious difference between male and female: both genders are used to spend approximatly the same time for education. However, such statement as a hypothesis should be checked.

Normality of data

qqnorm(tt_data$eduyrs); 
qqline(tt_data$eduyrs, col= 2) 

The Q-Q plot looks approximately normal as the distribution is slightly skewed.

Equality of variances

bartlett.test(tt_data$eduyrs ~ tt_data$gndr)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  tt_data$eduyrs by tt_data$gndr
## Bartlett's K-squared = 0.12866, df = 1, p-value = 0.7198
var.test(tt_data$eduyrs, tt_data$gndr)
## 
##  F test to compare two variances
## 
## data:  tt_data$eduyrs and tt_data$gndr
## F = 44.066, num df = 2848, denom df = 2848, p-value <
## 0.00000000000000022
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  40.94496 47.42568
## sample estimates:
## ratio of variances 
##           44.06634

As for Bartlett’s test, the variances are equal (K-squared = 0.12866, p-value = 0.7198). The same conclusion can be made based on the results of F-test (F = 44.066, p-value < 2.2e-16).

Conducting the test

t.test(tt_data$eduyrs ~ tt_data$gndr, var.equal = T)
## 
##  Two Sample t-test
## 
## data:  tt_data$eduyrs by tt_data$gndr
## t = -2.5737, df = 2847, p-value = 0.01011
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.56349294 -0.07616885
## sample estimates:
## mean in group Female   mean in group Male 
##             13.08799             13.40782

Conclusion

Summing up, the mean values of spending years on education by non-citizens according to their gender are not equal (t-statistic = -2.5737, p-value = 0.01011). There is slightly difference between women’s average years of completed education (m = 13.08799) and men’s ones (m = 13.40782).

Non-parametric test

Non-parametric test for two independent samples can also be conducted. Although this type of test has no assumptions, it is not much less powerful than t-test.

wilcox.test(eduyrs ~ gndr, data = tt_data)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  eduyrs by gndr
## W = 946220, p-value = 0.002912
## alternative hypothesis: true location shift is not equal to 0

According to the results of the Wilcoxon rank sum test, there is difference in years of completed education between both genders of non-citizens (W = 946220, p < 0.002912).