Data Analysis in Sociology course. Second year students’ portfolio
Bachelor’s Programme ‘Sociology and Social Informatics’. HSE University
May 25, 2019
Alexandra Shanina, Milena Oleshko, Zakharova Victoria

Data

The whole project is based on the data collected by European Social Survey (ESS8-2016 Edition 2.1 was released on 1st of December 2018) that is available for downloading on the website. As for the main topic on which we focus is a migration process in Germany. More information about Germany is availiable here.

Go to the website

Project 1: Describe data

Group work & personal contribution

  • Alexandra Shanina was responsible for graph descriptions and wrote code for variable selecting and filtering for further graph construction.

  • Milena Oleshko created descriptive tables for all used variables, constructed scatter-plot and helped to code other graphs.

  • Zakharova Victoria organized .Rmd and constructed graphs (except the scatter-plot).

Overall, we had several group discussions that were about a deal of aspects such as distinguishing possible variables for the whole project, for instance.

Note: we wanted to organize our work in the following way: “what we do” -> “code” -> “graph” -> “description”. Hope, we did it :)

Identification of variable types

As our topic is migration in Germany, we attempted to choose appropriate variables. We are going to work with the following ones:

setwd("C:/Users/Victoria/Desktop/Data Analysis in Sociology/Project")
Variable <- c("gndr","agea", "imsclbn","impcntr", "livecnta", "hhmmb", "wkhtot","ctzcntr", "mainact", "domicil" ) 
LevelOfMeasure <- c("Nominal", "Ratio", "Interval", "Interval", "Ratio", "Ratio", "Ratio", "Nominal", "Nominal", "Nominal") 
mode_is_it_possible <- c("Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes") 
median_is_it_possible <- c("No", "Yes", "No", "No", "Yes", "Yes", "Yes", "No", "No","No") 
mean_is_it_possible <- c("No", "Yes", "Yes", "Yes","No","No","No","No","No","No") 
QualOrQuant <- c("Qualitive", "Quantitative", "Quantitative","Quantitative","Quantitative","Quantitative","Quantitative", "Qualitive", "Qualitive", "Qualitive") 
ContinOrDiscrete <- c("None", "Discrete", "Discrete", "Discrete", "Continuous", "Continuous","Continuous", "None","None", "None") 
DescriptionOfVariables <- c("Gender", "Age of respondent, calculated", "When should immigrants obtain rights to social benefits/services", "Allow many/few immigrants from poorer countries outside Europe", "What year you first came to live in country", "Number of people living regularly as member of household", "Total hours normally worked per week in main job overtime included", "Citizen of country", "Main activity last 7 days", "Domicile, respondent's description") 


Variables_discription1 <- data.frame(Variable, LevelOfMeasure, QualOrQuant, ContinOrDiscrete, mode_is_it_possible, median_is_it_possible, mean_is_it_possible) 
knitr::kable(Variables_discription1) 
Variable LevelOfMeasure QualOrQuant ContinOrDiscrete mode_is_it_possible median_is_it_possible mean_is_it_possible
gndr Nominal Qualitive None Yes No No
agea Ratio Quantitative Discrete Yes Yes Yes
imsclbn Interval Quantitative Discrete Yes No Yes
impcntr Interval Quantitative Discrete Yes No Yes
livecnta Ratio Quantitative Continuous Yes Yes No
hhmmb Ratio Quantitative Continuous Yes Yes No
wkhtot Ratio Quantitative Continuous Yes Yes No
ctzcntr Nominal Qualitive None Yes No No
mainact Nominal Qualitive None Yes No No
domicil Nominal Qualitive None Yes No No
Variables_discription2 <- data.frame(Variable, DescriptionOfVariables)
knitr::kable(Variables_discription2)
Variable DescriptionOfVariables
gndr Gender
agea Age of respondent, calculated
imsclbn When should immigrants obtain rights to social benefits/services
impcntr Allow many/few immigrants from poorer countries outside Europe
livecnta What year you first came to live in country
hhmmb Number of people living regularly as member of household
wkhtot Total hours normally worked per week in main job overtime included
ctzcntr Citizen of country
mainact Main activity last 7 days
domicil Domicile, respondent’s description
Germany <- haven::read_sav("ESS8DE.sav") 

library(haven)
library(ggplot2)
library(dplyr)
library(cowplot)
library(knitr) 

Histogram

The first graph is a histogram created for single continuous variable(s). We would like to look at the distribution of ages of people who were for the migrants’ immediate access to social benefits. Here it is!

### Choose some variables such as age, gender and opinions about the mentioned issue for construction of histogram

## for male
data_hist1 <- Germany %>% 
                select(imsclbn, gndr, agea) %>% 
                  na.omit() %>% 
                    filter(gndr == 1 & imsclbn == 1)

## for female
data_hist2 <- Germany %>% 
                select(imsclbn, gndr, agea) %>% 
                  na.omit() %>% 
                    filter(gndr == 2 & imsclbn == 1)

### Construct two histograms according to gender and depict them on one picture

pic_hist1 <- ggplot() + 
                geom_histogram(data = data_hist1, aes(x = as.numeric(agea)), 
                col = "#6F15A4", fill = "#D5ACEE", alpha = 0.75, binwidth = 1) +
                  geom_vline(aes(xintercept = mean(data_hist1$agea)), color="#490A72", size=0.75)+
                  geom_vline(aes(xintercept = median(data_hist1$agea)), linetype="dashed", 
                  color="#D9137F", size=0.75) +
                    labs(title = "Distribution of men's age", 
                    x = "Age", y = "Frequency") +
                    theme_minimal() 

pic_hist2 <- ggplot() + 
                geom_histogram(data = data_hist2, aes(x = as.numeric(agea)), 
                col = "#D58E22", fill = "#FACB84", alpha = 0.75, binwidth = 1) +
                  geom_vline(aes(xintercept = mean(data_hist2$agea)), color="#490A72", size=0.75)+
                  geom_vline(aes(xintercept = median(data_hist2$agea)), linetype="dashed", 
                  color="#D9137F", size=0.75) +
                    labs(title = "Distribution of women's age", 
                    x = "Age", y = "Frequency") +
                    theme_minimal() 

two_hist <- plot_grid(pic_hist1, pic_hist2, labels = "AUTO")

title <- ggdraw() + draw_label("Age of people who are for the migrants' immediate access to social benefits", fontface='bold', size = 12.5)

plot_grid(title, two_hist, ncol = 1, rel_heights = c(0.1, 1))
### Now we create a table with mean and median of ages for more accurate visibility

mean_male_age <-  mean(data_hist1$agea)
median_male_age <- median(data_hist1$agea) 
mean_female_age <-  mean(data_hist2$agea)
median_female_age <- median(data_hist2$agea) 

df_age <- data.frame(mean_male_age, mean_female_age, median_male_age, median_female_age)
kable(df_age)
mean_male_age mean_female_age median_male_age median_female_age
43.35758 44.03425 43 43

This graph show the amount of people according to their age group and gender, who agree with the statement that migrants should have the immediate access to social benefits. There were used such variables as Calculated Age (agea), Gender (gndr) and When should immigrants obtain rights to social benefits/services (imsclbn).

It was interesting to explore which age group more frequently answered on “When should immigrants obtain rights to social benefits/services” that immigrants should be treated with all social benefits as soon as they come to the country. That is why, the answers on this particular question were filtered and only one variant of answer was chosen: “Immediately on arrival”.

It can be observed from the graph that males aged thirty, forty and sixty are more likely to provide migrants with all social benefits immediately, thus it could be said they are more tolerant towards migrants. In female group could be seen that women in the age of thirty and fifty are tend to agree with the statement “Migrants should obtain rights to social benefits Immediately on arrival”. Both genders from sixty-five to seventy-five are more conservative, they are less likely to answer that way.

Barplot

Now, it is time to create a barplot that is created for single categorical variable(s). Doing it, we decided to choose a variable such as viewpoints about people from the poorer countries outside Europe and make a division according to respondents’ gender.

### Select some variables in which we are interested in (main one is opinions and one for division)

data_bar_men <- Germany %>% select(impcntr, gndr) %>% na.omit() %>% 
                    filter((impcntr != 7 | impcntr != 8 | impcntr != 9)
                           & (gndr == 1))

data_bar_women <- Germany %>% select(impcntr, gndr) %>% na.omit() %>% 
                    filter((impcntr != 7 | impcntr != 8 | impcntr != 9)
                           & (gndr == 2))


### Rename the chosen variables for better presentation on graphs

data_bar_men$impcntr <- factor(data_bar_men$impcntr, levels = c(1, 2, 3, 4), labels = c("Allow many \nto come and live \nhere", "Allow some", "Allow a few", "Allow none"))
data_bar_men$gndr <- factor(data_bar_men$gndr, levels = c(1), labels = c("Male"))

data_bar_women$impcntr <- factor(data_bar_women$impcntr, levels = c(1, 2, 3, 4), labels = c("Allow many \nto come and live \nhere", "Allow some", "Allow a few", "Allow none"))
data_bar_women$gndr <- factor(data_bar_women$gndr, levels = c(2), labels = c("Female"))


### Construct two barplots and depict them on one picture 

barplot1 <- ggplot() + geom_bar(data = data_bar_men, aes(x = as.factor(impcntr)), fill = "#D5ACEE") +
           xlab(" ") +
           ylab("Number of men") +
           theme(axis.text.x = element_text(angle=90)) +
           theme_bw(base_size = 9)

barplot2 <- ggplot() + geom_bar(data = data_bar_women, aes(x = as.factor(impcntr)), fill = "#FACB84") +
           xlab(" ") +
           ylab("Number of women") +
           theme(axis.text.x = element_text(angle=90)) +
           theme_bw(base_size = 9)

two_bar <- plot_grid(barplot1, barplot2, labels = "AUTO")

title_bar <- ggdraw() + draw_label("Viewpoints about people from the poorer countries outside Europe", fontface='bold', size = 12.5)

plot_grid(title_bar, two_bar, ncol=1, rel_heights=c(0.1, 1))

This bar chart illustrates the different attitudes of both genders Germans towards immigrants from Non-European countries. We used Gender (gndr) and Allow many/few immigrants from poorer countries outside Europe (impcntr) variables to build this graph.

Men and Women in Germany shows the same results. Both genders are prefer to choose “Allow some”, so people are seems to be rather tolerant towards immigrant. It thesis also can be proved by considering that the smallest number of people answered “Allow none”.

Scatterplot

The next is a scatter-plot that is created for bivariate distribution of continuous variables. Here we want to look at the correlation between number of people living in a household and year of migration.

### We do similar manipulations as we have done before: 
## a) choosing variables
data_sc <- Germany %>% 
                select(livecnta, hhmmb) %>% na.omit()

## b) and construction graph
pic_sc <- ggplot(data_sc, aes(x = as.numeric(livecnta), y = as.numeric(hhmmb))) + 
            geom_point(shape = 1) + geom_smooth(method = lm) +
            ylab("Number of people in the household") +
            xlab("Year in which people came to Germany") +
            ggtitle("Correlation between members of household and year of migration") 
pic_sc

This scatter plot depicts the relation between year of migrating to Germany and the number of family members. By this graph we wanted to observe how many people moved to Germany with their families or started a family during 1940 and 2000 years. Variables are used in the graph: Number of people living regularly as member of household (hhmmb) and What year you first came to live in country (livecnta).

It is clearly seen from the graph, that the number of migrants, who came to the Germany increased a lot. Here is depicted the upward trend line. More and more people started to move to Germany with their families or started a family, that can say, that migrants seek not for labor opportunities, but also for better living conditions.

Boxplot

Further, let us make things more complicated and construct a boxplot for bivariate distribution of a continuous and categorical variables.

### We do similar manipulations as we have done before: 
## a) choosing variables
data_box1 <- Germany %>% 
                select(wkhtot, ctzcntr, mainact) %>% na.omit() 

## b) renaming the chosen variables for better presentation on graphs
data_box1$ctzcntr <- factor(data_box1$ctzcntr, levels = c(1, 2), labels = c("Yes", "No"))

## c) and construction graph
box1 <- ggplot() +
  geom_boxplot(data = data_box1, aes(x = as.factor(mainact), y = as.numeric(wkhtot))) +
  theme_bw() + 
  facet_grid(~as.factor(data_box1$ctzcntr)) +
  ylab("Total hours worked per week in main job") + 
  xlab("Main activity last 7 days") +
  ggtitle("Variety of working activity depending on citizenship") +
  theme_set(theme_bw(base_size = 10))

### There is a creation of a table that describes  meaning of numbers regarding to activity
activity_number <- c(1:9)
text_of_description <- c("Paid work", "Education", "Unemployed, looking for job", "Unemployed, not looking for job", "Permanently sick or disabled", "Retired", "Community or military service", "Housework, looking after children, others", "Other")
desc_df <- data.frame(activity_number, text_of_description)

box1
kable(desc_df)
activity_number text_of_description
1 Paid work
2 Education
3 Unemployed, looking for job
4 Unemployed, not looking for job
5 Permanently sick or disabled
6 Retired
7 Community or military service
8 Housework, looking after children, others
9 Other

That boxplot gives the information about the difference in activities of immigrants and citizens. To make this figure we used “Main activity last 7 days” (mainact) and “Total hours normally worked per week in main job overtime included” (wkhtot)

The graph shows that there are more immigrants, who unemployed and not looking for a job (4), there are less people without work among citizens. Immigrants are more engaged in the service sector (8) than citizens are. Also, what is interesting, there are little number of immigrants who work in education sphere (2), community and military service (8). More than that it is seen from the graph that there are no retired immigrants (6), So that point is obvious because we were observing people based on citizenship presence. What is more, there are small number among immigrants, who are sick or disabled (5), while there are a lot of people with citizenship which are permanently sick (9).

Overall, there is a gap among owner of citizenship and immigrants. Immigrants are tend to have low-paid jobs, do not have any privileges. Immigrants engaged mostly in service sphere.

Stacked barplot

The last graph is another type of barplot: stacked barplot for bivariate distribution of categorical variables.

data_st_bar <- Germany %>% 
                select(domicil, impcntr) %>% 
                  na.omit() %>% 
                    filter((impcntr != 7 | impcntr != 8 | impcntr != 9)
                           & (domicil != 7 | domicil != 8 | domicil != 9))

data_st_bar$impcntr <- factor(data_st_bar$impcntr, levels = c(1, 2, 3, 4), labels = c("Allow many to come and live here", "Allow some", "Allow a few", "Allow none"))

data_st_bar$domicil <- factor(data_st_bar$domicil, levels = c(1, 2, 3, 4, 5), labels = c("A big city", "Suburbs or\n outskirts of \n big city", "Town or \nsmall city", "Country \nvillage", "Farm or home \nin countyside"))

data_st_bar <- data_st_bar %>% group_by(domicil, impcntr) %>% summarise(number = n())
data_st_bar$domicil <- as.factor(data_st_bar$domicil)
data_st_bar$impcntr <- as.factor(data_st_bar$impcntr)

st_bar <- ggplot(data = data_st_bar, aes(x = domicil, y = number, fill = impcntr)) +
  geom_bar(stat = "identity", position = "fill") +
  xlab("Place of living")  +
  ylab("Number of people") +
  theme_bw() + 
  labs(fill = "Variation of opinions", title = "Viewpoints about people from the poorer countries outside Europe \nexpressed by respondents who differ in places of dwelling") +
  scale_fill_brewer(palette = "Set3") +
  theme_set(theme_bw(base_size = 10)) 

st_bar

Here is a stacked bar plot which demonstrates how citizens of different country areas accept migrants from poor countries to live with them. By “accepting” it is meant what people answered on the question How about people from the poorer countries outside Europe?: Allow many is acception, allow none is refusal. This graph was constructed by these variables: Allow many/few immigrants from poorer countries outside Europe (impcntr) and Domicile, respondent’s description (domicil).

As it is seen from the bar plot, people from big cities and suburbs are likely to allow some people from poor countries and here is a little amount of people who reject any immigrants from poor countries, so it means that people from big cities are tollerante towards immigrants. The biggest part of native population , who do not allow immigrants or allow only few, is concentrated in small towns. It can be said that citizens of urban area much more tolerant towards immigrants from poorer countries than people, who live in towns or countryside.

Project 2: Chi-squared test & t-test

Group work & personal contribution

  • Alexandra Shanina prepared data for the tests, wrote hypotheses for chi-squared test, constructed some graphs, and interpreted graphs and the results of chi-squared test.

  • Milena Oleshko wrote hypotheses for t-test, constructed some graphs, and interpreted graphs and the results of t-test.

  • Zakharova Victoria planned the project, defined essential libraries, wrote assumptions for two tests, organized .Rmd, helped and supported A. Shanina and M. Oleshko.

All in all, there were several group discussions about some aspects such as defining possible variables for the project, constructing graphs and properties for them.

# In order to see the full-length numbers the following code is implemented

options(scipen = 999)

# Load the data

library(foreign)
Germany <- read.spss("ESS8DE.sav", use.value.labels = T, to.data.frame = T)

Chi-squared test

The general idea of the Chi-squared test it to check whether there is dependence between two categorical variables or not.

In order to conduct this type of test, the presented below variables were chosen.

Abbreviation <- c("ctzcntr", "tporgwk")
LevelOfMeasure <- c("Binominal", "Nominal")
Definition <- c("Citizen of country", "What type of organisation work/worked for")
vardesc_1 <- data.frame(Abbreviation, LevelOfMeasure, Definition) 
knitr::kable(vardesc_1) 
Abbreviation LevelOfMeasure Definition
ctzcntr Binominal Citizen of country
tporgwk Nominal What type of organisation work/worked for

Assumptions

  • Raw data in counts not per cent;
  • Two categorical variables (usually at the nominal level);
  • The independence of observations;
  • The value of the cell expecteds should be 5 or more in at least 80% of the cells, and no cell should have an expected of less than 1.

Hypothesis

Choosing such variables, the following hypothesis is supposed to be checked:

The status of a citizen of the country does not affect the type of organization in which a person works. In other words, there is independence between these variables.

H0: there is independence between being a citizen of the country and type of organization worked for.

H1: the opposite case of H0.

Preparing data

library(tidyverse)
library(sjPlot)
cst_data <- Germany %>% 
              select(tporgwk, ctzcntr) %>% 
              filter(tporgwk != 66 & tporgwk != 77 & tporgwk != 88 & 
                     tporgwk != 99 & ctzcntr != 7 & ctzcntr != 8 & ctzcntr != 9) %>%
              na.omit()

Data as table & its plotted version

# Representation of the data in a format of table
table(cst_data$tporgwk, cst_data$ctzcntr)
##                                                     
##                                                       Yes   No
##   Central or local government                         198    4
##   Other public sector (such as education and health)  372   18
##   A state owned enterprise                            163    6
##   A private firm                                     1567  102
##   Self employed                                       180    7
##   Other                                                75    9
# Create a table as variable
cst_tbl <- matrix(c(198, 372, 163, 1567, 180, 75, 4, 18, 6, 102, 7, 9), nrow = 6)
rownames(cst_tbl) <- c("Central or local government", 
                       "Other public sector (such as education and health)",
                       "A state owned enterprise", 
                       "A private firm", 
                       "Self employed", 
                       "Other")
colnames(cst_tbl) <- c("Citizen","Non-citizen")

As can be noticed, there is a cell with a value that is lower than 5. However, the chi-squared test can be conducted as according to an assumption such values can be no more than in 20% of cells.

# Visualization
sjp.xtab(cst_data$tporgwk, cst_data$ctzcntr, 
         margin = "row",
         bar.pos = "stack", coord.flip = FALSE, 
         vjust = "right", hjust = "center", 
         title = "The proportions of citizens and non-citizens in different types of organizations", 
         legend.title = "Citizen of country", 
         axis.titles = "Type of organisation",
         expand.grid = TRUE, geom.colors = c("#bfdbed", "#158cba"))

This stacked barplot demonstrates the percentage distribution of people with and without citizenship in the German labor market. It is clearly seen that the number of positions occupied by non-citizens is too small. Obviously, there is migrant discrimination. The most closed positions are governmental, there can be seen very small percentage of people without citizenship, while private firms are to employ migrants. It is also interesting to observe public sector position, because 4,8% places are occupied by migrants.

Conducting the test

chisq.test(cst_tbl)
## 
##  Pearson's Chi-squared test
## 
## data:  cst_tbl
## X-squared = 13.516, df = 5, p-value = 0.019

Table of critical values for Chi-squared test

chi_test2 <- chisq.test(cst_tbl)

As can be seen, the chi-squared test is significant as the critical chi-square statistic value for p = 0.05 (95% confidence level) with 5 degrees of freedom is equal to 11.07. In this case, we have to take a look at residuals that can be got from the structure of the test.

The data structure & its elements

  • The structure
    str(chi_test2)
    ## List of 9
    ##  $ statistic: Named num 13.5
    ##   ..- attr(*, "names")= chr "X-squared"
    ##  $ parameter: Named int 5
    ##   ..- attr(*, "names")= chr "df"
    ##  $ p.value  : num 0.019
    ##  $ method   : chr "Pearson's Chi-squared test"
    ##  $ data.name: chr "cst_tbl"
    ##  $ observed : num [1:6, 1:2] 198 372 163 1567 180 ...
    ##   ..- attr(*, "dimnames")=List of 2
    ##   .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
    ##   .. ..$ : chr [1:2] "Citizen" "Non-citizen"
    ##  $ expected : num [1:6, 1:2] 191 369 160 1579 177 ...
    ##   ..- attr(*, "dimnames")=List of 2
    ##   .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
    ##   .. ..$ : chr [1:2] "Citizen" "Non-citizen"
    ##  $ residuals: num [1:6, 1:2] 0.501 0.16 0.248 -0.297 0.234 ...
    ##   ..- attr(*, "dimnames")=List of 2
    ##   .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
    ##   .. ..$ : chr [1:2] "Citizen" "Non-citizen"
    ##  $ stdres   : num [1:6, 1:2] 2.238 0.746 1.102 -2.064 1.042 ...
    ##   ..- attr(*, "dimnames")=List of 2
    ##   .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
    ##   .. ..$ : chr [1:2] "Citizen" "Non-citizen"
    ##  - attr(*, "class")= chr "htest"
    
  • Observed counts
    chi_test2$observed
    ##                                                    Citizen Non-citizen
    ## Central or local government                            198           4
    ## Other public sector (such as education and health)     372          18
    ## A state owned enterprise                               163           6
    ## A private firm                                        1567         102
    ## Self employed                                          180           7
    ## Other                                                   75           9
    
  • Expected counts
    round(chi_test2$expected,2)
    ##                                                    Citizen Non-citizen
    ## Central or local government                         191.08       10.92
    ## Other public sector (such as education and health)  368.92       21.08
    ## A state owned enterprise                            159.86        9.14
    ## A private firm                                     1578.78       90.22
    ## Self employed                                       176.89       10.11
    ## Other                                                79.46        4.54
    
  • Residuals: the difference between observed and expected
    round(chi_test2$residuals, 3)
    ##                                                    Citizen Non-citizen
    ## Central or local government                          0.501      -2.094
    ## Other public sector (such as education and health)   0.160      -0.671
    ## A state owned enterprise                             0.248      -1.037
    ## A private firm                                      -0.297       1.241
    ## Self employed                                        0.234      -0.978
    ## Other                                               -0.500       2.093
    
  • Standardized residuals
    chi_test2$stdres
    ##                                                      Citizen Non-citizen
    ## Central or local government                         2.238177   -2.238177
    ## Other public sector (such as education and health)  0.745909   -0.745909
    ## A state owned enterprise                            1.101529   -1.101529
    ## A private firm                                     -2.063630    2.063630
    ## Self employed                                       1.041856   -1.041856
    ## Other                                              -2.186028    2.186028
    

Another assumption of the chi-squared test is held: expected values in all cells are bigger than 1.

Looking at the standardized residuals, it can be seen there are some cells values of which are lower than -2 (it means that the cell contains fewer observations that it was expected (the case of variables independence)) as well as are higher than 2 (it means that the cell contains more observations that it was expected). The following visualization of the standardized residuals demonstrates it, as well.

# Vizualization of residuals
library(graphics)
assocplot(cst_tbl, main = "The standardized residuals of observations")
library(corrplot)
corrplot(chi_test2$stdres, is.cor = FALSE, method = "number", tl.col = "black")

Note: positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables. Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables.

Conclusion

According to the got results, it can be concluded that there is evidence against the null hypothesis. By this it means that there is dependence between the chosen variables such as the status of a citizen of the country and the type of organization in which a person works (X-squared = 13.516, p-value = 0.019).

Independent t-test

Independent t-test allows to check whether there is a statistically significant difference between the means in two unrelated groups.

In order to conduct an independent t-test, the presented below variables were chosen.

Abbreviation_tt <- c("eduyrs", " gndr", "ctzcntr") 
LevelOfMeasure_tt <- c("Ratio", "Binominal", "Binominal") 
Definition_tt <- c("Years of full-time education completed", "Gender","Citizen of country") 
vardesc_2 <- data.frame(Abbreviation_tt, LevelOfMeasure_tt, Definition_tt) 
knitr::kable(vardesc_2)
Abbreviation_tt LevelOfMeasure_tt Definition_tt
eduyrs Ratio Years of full-time education completed
gndr Binominal Gender
ctzcntr Binominal Citizen of country

Assumptions

  • Two groups;
  • Continuous dependent variable;
  • Observations are independent;
  • Normally distributed continuous variable across two groups;
  • Equality of variances of the continuous variable across two groups;
  • A reasonably large sample size (at least 30 observations per group).

Hypothesis

In this way, the purpose of conducting such type of test is to compare whether the mean values of spending years on education by non-citizens according to their gender are equal or not.

H0: the mean values of spending years on education by non-citizens taking into account their gender are equal.

H1: the opposite case of H0.

Preparing data

tt_data <- Germany %>% 
              select(eduyrs, gndr, ctzcntr) %>% 
              filter(eduyrs != 66 & eduyrs != 77 & eduyrs != 88 & eduyrs != 99 & 
                    ctzcntr != 1 & ctzcntr != 7 & ctzcntr != 8 & ctzcntr != 9) %>% 
              na.omit()
tt_data$eduyrs <- as.numeric(tt_data$eduyrs) 
tt_data$gndr <- as.factor(tt_data$gndr) 

Looking at some basic descriptive statistics of a continuous variable such as years of full-time education completed.

library(psych)
describe(tt_data$eduyrs)
##    vars    n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 2849 13.26 3.31     13    13.1 2.97   1  27    26 0.46     0.51
##      se
## X1 0.06

Data visualization

#Boxplot
ggplot(tt_data, aes(x = gndr, y = eduyrs)) + 
        geom_boxplot() + 
        stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) + 
        ggtitle("Distribution of years spent on education by non-citizens") + 
        labs(x = "", y = "Years of education") +
        theme_bw() +
        scale_y_continuous(breaks = 0:30*2)

At this box plot two distribution of spent years on education are depicted by gender among non-citizens in Germany. It can be seen that the median value of years spent by female is nearer to 12 and it is less than male’s value. Additionally, the outliers are presented. Also, according to this graph, the maximum value of female is 21 years and men have the bigger level which is equal to 23 years, and the minimum for women is 5 years while the minimum years spent on education among male is 4 years. The interquartile ranges are slightly different: female’s one is less than male’s one. Overall, the equality of means of spent years for both genders (“+” symbol) is quite difficult to define exactly without any calculations.

# Code for mode
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Histogram 1
ggplot() + 
  geom_histogram(data = tt_data, aes(x = eduyrs, y=..density..), 
               position = "identity", binwidth = 1, alpha = 0.4) + 
  labs(title = "Distribution of spent years on education by non-citizens", 
      x = "Years spent on education by non-citizens", y = "Density") +
  geom_vline(aes(xintercept = mean(tt_data$eduyrs), color = "mean"), size = 0.7) +
  geom_vline(aes(xintercept = median(tt_data$eduyrs), color = "median"), size = 0.7) +
  geom_vline(aes(xintercept = getmode(tt_data$eduyrs), color = "mode"), size = 0.7) +
  scale_x_continuous(breaks = 0:35*5) +
  scale_color_manual(name = "Statistics", values = c(mean = "blue", mode = "red", median = "black"))

There is presented histogram, which prints out the association between the number of years spend on education and the absence of German citizenship. Overall, it can be said that this data is distributed almost normally. However, the graph also demonstrates the mean (a blue line), the median, which is black, and the mode as a red line. These three central tendency measures are not overlaid. Thus, it was decided to observe a variable of years according to migrants’ gender: if there is any difference between men and women. To answer to this issue the t-test was used.

# Histogram 2
tt_data$gndr <- relevel(tt_data$gndr, ref = "Female")

ggplot(tt_data, aes(x = eduyrs, fill = gndr)) +
        geom_histogram(aes(y=..density..), position = "identity", binwidth = 1, alpha = 0.6) +
        labs(title = "Distribution of spent years on education by non-citizens", 
             subtitle = "taking into account gender", 
             x = "Years spent on education", y = "Density",
             fill = "Gender") +
        facet_grid(. ~ gndr) +
        scale_x_continuous(breaks = 0:30*5) +
        geom_vline(aes(xintercept = mean(tt_data$eduyrs), color="mean"), size = 0.7) +
        geom_vline(aes(xintercept = median(tt_data$eduyrs), color = "median"), size = 0.7) +
        geom_vline(aes(xintercept = getmode(tt_data$eduyrs), color = "mode"), size = 0.7) +
        scale_color_manual(name = "Statistics", values = c(mean = "blue", mode = "red", median = "black"))

This histogram shows the connection between years spend on education and a gender of respondents among migrants. The median (a dotted line), the mode (a red line) and the mean (a blue line) are represented, as well. It can be got from the graph that there is no obvious difference between male and female: both genders are used to spend approximatly the same time for education. However, such statement as a hypothesis should be checked.

Normality of data

qqnorm(tt_data$eduyrs); 
qqline(tt_data$eduyrs, col = 2) 

The Q-Q plot looks approximately normal as the distribution is slightly skewed.

Equality of variances

bartlett.test(tt_data$eduyrs ~ tt_data$gndr)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  tt_data$eduyrs by tt_data$gndr
## Bartlett's K-squared = 0.12866, df = 1, p-value = 0.7198
var.test(tt_data$eduyrs, tt_data$gndr)
## 
##  F test to compare two variances
## 
## data:  tt_data$eduyrs and tt_data$gndr
## F = 44.066, num df = 2848, denom df = 2848, p-value <
## 0.00000000000000022
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  40.94496 47.42568
## sample estimates:
## ratio of variances 
##           44.06634

As for Bartlett’s test, the variances are equal (K-squared = 0.12866, p-value = 0.7198). The same conclusion can be made based on the results of F-test (F = 44.066, p-value < 2.2e-16).

Conducting the test

t.test(tt_data$eduyrs ~ tt_data$gndr, var.equal = T)
## 
##  Two Sample t-test
## 
## data:  tt_data$eduyrs by tt_data$gndr
## t = -2.5737, df = 2847, p-value = 0.01011
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.56349294 -0.07616885
## sample estimates:
## mean in group Female   mean in group Male 
##             13.08799             13.40782

Conclusion

Summing up, the mean values of spending years on education by non-citizens according to their gender are not equal (t-statistic = -2.5737, p-value = 0.01011). There is slightly difference between women’s average years of completed education (m = 13.08799) and men’s ones (m = 13.40782).

Non-parametric test

Non-parametric test for two independent samples can also be conducted. Although this type of test has no assumptions, it is not much less powerful than t-test.

wilcox.test(eduyrs ~ gndr, data = tt_data)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  eduyrs by gndr
## W = 946220, p-value = 0.002912
## alternative hypothesis: true location shift is not equal to 0

According to the results of the Wilcoxon rank sum test, there is difference in years of completed education between both genders of non-citizens (W = 946220, p < 0.002912).

Project 3: ANOVA

# In order to see the full-length numbers the following code is implemented
options(scipen = 999)


# Load libraries 
library(foreign) 
library(knitr) 
library(tidyverse) 
library(graphics) 
library(psych) 
library(dplyr) 
library(ggplot2) 
library(car)
library(magrittr)
library(kableExtra)
library(gplots)
library(DescTools) 

Research purpose & Variables

Research puspose To refine how the years spent on training and the level of satisfaction with the current financial situation are related.

Variables

# Load data and prepare it

Germany <- read.spss("ESS8DE.sav", use.value.labels = T, to.data.frame = T) 

dataset <- Germany %>% select(hincfel, eduyrs, livecnta) %>% 
                       filter((eduyrs != 66 | eduyrs != 77 | eduyrs != 88 | eduyrs != 99) 
                              & (livecnta != 6666 | livecnta != 7777 | 
                                   livecnta != 8888 | livecnta != 9999) &
                                (hincfel != "Refusal" | hincfel != "Don't know" | hincfel != "No answer")) %>% na.omit() 

dataset$eduyrs <- as.numeric(as.character(dataset$eduyrs))
# Recoding variables to reduce lenght of their values

dataset$income <- rep(NA, length(dataset$hincfel))
dataset$income[dataset$hincfel == "Living comfortably on present income"] <- "Living comfortably"
dataset$income[dataset$hincfel == "Coping on present income"] <- "Coping"
dataset$income[dataset$hincfel == "Difficult on present income"] <- "Difficult"
dataset$income[dataset$hincfel == "Very difficult on present income"] <- "Very difficult"
dataset$income <- as.factor(dataset$income)
Abbreviation <- c("hincfel", "eduyrs")
LevelOfMeasure <- c("Nomial", "Ratio")
R_type <- c("factor", "numeric")
Definition <- c("Feeling about household's income nowadays", "Years of full-time education completed")
vardesc <- data.frame(Abbreviation, LevelOfMeasure, Definition, R_type) 
kable(vardesc) 
Abbreviation LevelOfMeasure Definition R_type
hincfel Nomial Feeling about household’s income nowadays factor
eduyrs Ratio Years of full-time education completed numeric

In order to define some observation that are related only to people who have moved to Germany in former times, the data were selected in accordance with a variable of years when people came to Germany.

ANOVA

One-way ANOVA is used to perform statistical analysis in case of checking equality of means among three or more groups.

Assumptions

  • The data are independent and normally distributed.
  • The variances of the sampled populations are equal.
  • The residuals from the data are normally distributed.

It should be noted that ANOVAs are fairly robust to violations to these assumptions. However, it is needed to be alert when these assumptions do not hold.

Hypotheses

H0: the difference between means of groups related to subjective perception of income and time spent on education equals to 0.

HA: the difference between means of groups related to subjective perception of income and time spent on education is not equal to 0.

Social mechanism

The social mechanism, which was observed is the education one has or to be more precise, the years person spend on education. Educational attainment influences directly on the future occupation, thus it partly defines the future income of a person.

In the case of migrants, the educational process can be rather difficult, so not everyone wants to study a lot, sometimes they cannot afford it.People, migrating to another country, have to adapt, which takes time, to the educational system, learn a foreign language, which is official in the country they moved in. Teachers and tutors can have biased attitude towards migrants students, which also has an impact on educational attainment. Moreover, migrants do not always have enough money to let their children or themselves study at university, because education obviously costs a lot, maybe not directly, but by calculating time spend on it and other resources. So, it is really hard for migrants to get higher education and to dedicate their lives to the educational process. What is more, the majority of migrants move to another country to find a job, not for education. However, it is probably applicable only for migrants from non-European countries, because of the significant difference in cultures.

In our research, the main question is how migrants define their well-being. So, migrants were distributed according to their level of education. It was interesting to learn the subjective well-being status of migrants with different sums of years spend on education. The subjective well-being was used because it helps to understand how people estimate their material situation observing o people material conditions surrounding them in society, in our case in Germany.

Supporting the issue, the infographic below might be interesting for further considerations in general ( source ).

Data distribution

describeBy(dataset$eduyrs, dataset$income, mat = TRUE) %>% 
    select(Perception_of_income = group1, N = n, Mean = mean, SD = sd, 
           Median = median, Min = min, Max = max, Skew = skew, 
           Kurtosis = kurtosis, st.error = se) %>% 
    kable(align = c("lrrrrrrrr"), digits = 2, row.names = FALSE,
        caption = "Spent time on education by subjective perception of income") %>% 
    kable_styling(bootstrap_options = c("bordered", "responsive","striped"), 
                  full_width = FALSE)
Spent time on education by subjective perception of income
Perception_of_income N Mean SD Median Min Max Skew Kurtosis st.error
Coping 155 13.23 3.81 13 2 26 0.30 0.70 0.31
Difficult 40 11.65 3.89 11 5 22 0.48 -0.17 0.62
Living comfortably 84 14.88 3.31 15 4 22 -0.24 0.13 0.36
Very difficult 9 12.44 4.61 13 3 18 -0.70 -0.68 1.54

As can be seen, each skew, as well as kurtosis, is less than 2.

# Code for mode
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Histogram 1
ggplot(dataset, aes(dataset$eduyrs)) + 
    geom_density(aes(data = dataset$eduyrs), 
                 position = 'identity', alpha = 0.2) +
    labs(x = 'Years spent on education', y = 'Density',
         title = "Distribution of years spent on education") + 
   # scale_x_continuous(limits = c(0, 30)) +
    geom_vline(aes(xintercept = mean(dataset$eduyrs), color="mean"), size = 0.7) +
    geom_vline(aes(xintercept = median(dataset$eduyrs), color = "median"), size = 0.7) +
    geom_vline(aes(xintercept = getmode(dataset$eduyrs), color = "mode"), size = 0.7) +
    scale_color_manual(name = "Statistics", values = c(mean = "blue", mode = "red", 
                                                       median = "black")) +
    theme_bw()

According to the density plot, the variable of years is distributed not as it would like to be in ideal. However, it can be named normal (more or less).

qqnorm(dataset$eduyrs); 
qqline(dataset$eduyrs, col = 2) 

QQ-plot demonstrates that mostly points seem not to fall about a straight line.

ggplot(dataset, aes(x = income, y = eduyrs)) +   
        geom_boxplot() + 
        stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) + 
        ggtitle("Distribution of years spent on education") + 
        labs(x = "", y = "Years spent on education", 
             subtitle = "per category of income perception") +
        theme_bw() +
        coord_flip() +
        scale_y_continuous(breaks = 0:30*2)

From the boxplot, it can be seen that there are some outliers. Also, a variable, which stands for years spent on education, is not normally distributed in across the groups as it might be wanted. Overall, further the data will be treated as normal.

Analysis

Homogeneity of variances

In order to check whether variances are equal or not, the Levene’s Test is applied.

# Check for homogeneity of variance
leve_test <- leveneTest(eduyrs ~ income, data = dataset)
leve_test
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   3  0.4516 0.7164
##       284

According to the Levene’s test, variances are equal (p-value = 0.7164). The null hypothesis about equality of variances is not rejected.

F-test

aov_test <- aov(dataset$eduyrs ~ dataset$income) 
summary(aov_test)
##                 Df Sum Sq Mean Sq F value    Pr(>F)    
## dataset$income   3    318  106.14   7.711 0.0000571 ***
## Residuals      284   3909   13.76                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It can be inferred that differences in years spent on education across groups classified by subjective perception of income are not equal in average as F(3, 284) = 7.711, p-value = 0.0000571. The F-ratio is significant.

Normality of residuals

layout(matrix(1:4, 2, 2))
plot(aov_test)

In case of normally distributed residuals, there must be a straight red line in the two upper graphs, and a straight line in the Q-Q plot. Although the first upper graph has not such a straight line, the red line fractionally fluctuated.

The Scale-Location plot shows whether the residuals are spread equally along the predictor range, i.e. homoscedastic. The red line should be straightly horizontal. However, in this case it is not entirely so. The line slopes slightly up around 12, and then there are downward and straight trends. It can be said that the line consistes of several horizontal sections.

As for the Q-Q plot, it can be seen that a majority of dots are on the line.

Previously, we found that skew and kurtosis are less than 2. Shapiro–Wilk test is applied.

anova_residuals <- residuals(object = aov_test)
sha_test <- shapiro.test(x = anova_residuals)
sha_test
## 
##  Shapiro-Wilk normality test
## 
## data:  anova_residuals
## W = 0.99064, p-value = 0.06271
hist(anova_residuals, ylim = c(0, 80))

According to the results of the Shapiro-Wilk normality test, p-value = 0.06271 means that the null hypothesis about normal distibution is not rejected. Consequently, it can be concluded that residuals are distributed normally.

Post hoc test

In case of the equality of variances across groups, the post hoc such as Tukey’s one can be run in order to inspect which groups contribute to the statistical significance of this test.

par(mar = c(5, 15, 3, 1))
Tukey <- TukeyHSD(aov_test)
plot(Tukey, las = 2)

The second pair “Living comfortably - Coping” might be considered as significant rather than minor. The same can be said about the another pair “Living comfortably - Difficult” that is also the most notable category among pairs.

Non-parametric equivalent of ANOVA

The non-parametric test compares the medians of three of more groups. Such test is the Kruskal-Wallis one. As for the null hypothesis, it is stated that mean ranks of the groups are the same.

kr_test <- kruskal.test(eduyrs ~ income, data = dataset) 
kr_test 
## 
##  Kruskal-Wallis rank sum test
## 
## data:  eduyrs by income
## Kruskal-Wallis chi-squared = 24.352, df = 3, p-value = 0.00002109

KW chi-square(3) = 24.352, p-value is < .001, which means that the mean ranks of the groups related to different types of organizations are not the same. Consequently, the Dunn’s test can be implimented, as well.

dunn <- DunnTest(eduyrs ~ income, data = dataset)
dunn
## 
##  Dunn's test of multiple comparisons using rank sums : holm  
## 
##                                   mean.rank.diff     pval    
## Difficult-Coping                      -34.631855   0.0741 .  
## Living comfortably-Coping              39.598502   0.0021 ** 
## Very difficult-Coping                  -5.086022   0.8580    
## Living comfortably-Difficult           74.230357 0.000019 ***
## Very difficult-Difficult               29.545833   0.6683    
## Very difficult-Living comfortably     -44.684524   0.3733    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to results, there are two pairs of groups that have statistically significant differences in their medians, while the others do not.

Effect size

The question such as what the substantive significance of group membership in determining amount of time spent on education is. In order to get an answer, looking at the share of variance explained by groups in all the variance of time variable can be helpful.

omega_sq <- function(aov_test){ 
  sum_stats <- summary(aov_test)[[1]]
  SSm <- sum_stats[["Sum Sq"]][1]
  SSr <- sum_stats[["Sum Sq"]][2]
  DFm <- sum_stats[["Df"]][1]
  MSr <- sum_stats[["Mean Sq"]][2]
  W2 <- (SSm-DFm*MSr)/(SSm+SSr+MSr)
  return(W2)
}
omega_sq(aov_test)
## [1] 0.06533891

Overall, it can be consluded that the finding is not so important because the omega squared is below .07 that is medium size effect.

Projects 4-5: Multiple Linear Regression (+ Interaction)

Group work & personal contribution

  • Alexandra Shanina participating in the process of finding needed variables, constructed the model background theory, made interpretations of some models, made interpretations of some graphs, made hypothesis and research question.

  • Milena Oleshko described graphs with distributions – histograms and boxplots, described graphs with correlation, wrote down the interpretations of regression models and the conclusion to the whole project.

  • Zakharova Victoria found variables, wrote equations, coded all graphs, constructed models, made a fabulous html., checked significance and interpreted ANOVE, tied up whole project. Also, helped and supported Alexandra and Milena taking into account their ideas, comments, and even dealt with their criticism.

Theory & Hypotheses

Theory

The key idea of this research project is to observe the social stratification and types of inequality in German society. To make it, it was decided to introduce a model, which is based on socioeconomic status measurement. Socioeconomic status is the set of factors, which effects on ability to access to collectively desired resources. One of the most popular ways to measure person’s economic status is to observe his or her annual income source. So, income was chosen by us as a dependent variable.

It is known, that SES consists of different parts: income depends on a variety of social factors. Our model is to watch what variables affect the income size in Germany. The first variable, which we decided to check, was years spent on education. According to the theory Sewell, educational attainment has a direct causal link to the occupational attainment in the future, so, more time person spent on education, more successful he or she is, the bigger the income size and otherwise source In the case of migrants, there are much more obstacles to spent more time on education than it necessary, because 1) it costs a lot, 2) it takes a lot of time, 3) the process of educational adaptation is hard. The second variable was age. By this, we were curious if there is any discriminating pattern among age groups. It is assumed that different age groups do not have equally access to the resources (as children and pensioners). The third variable is gender. Across the world males and females earnings differ (wage gap among genders). The thought was find whether the income depends on gender. The final is refers to migrants (whether a person has a citizenship or not). By this variable, we wanted to check if there is an inequality between migrants and native population. Migrants on the same job get lower income than native people. Migrants usually agree to work at the low-paid jobs source. Another variable, that was used is the presence of fixed work schedule. From the Goldthorpe theory, it is known that people of high social position and status have not fixed working schedule, they make it by them own (manage their labor by themselves), thus earn much more money. However, there are precarious type of jobs, which do not obtain high status, but provide an ability to earn high income source. By this variable we wanted to know how the presence of fixed working hours may influence on household’s total income. To make this case much more fascinating, we took gender as a moderator, to find out if there is inequality between males and females.

In order to make our case more specific, we decided to filter our data and left only whose people who have any children in the households. As we know from the theory, SES affects children. The aim was to observe the respondents with children, because if there are kids in household, the SES of respondent affects the future perspectives of kids. The interesting case of SES reproduction source.

Below schematical presentation of models are shown.

Research question

Do such factors as years spent on education, age, gender, citizenship, having fixed working hours taling into account gender tend to affect household’s total income?

Hypotheses

Mainly we suppose the following ideas to be checked:

  1. The more years person spent on education, the bigger size of income he or she would get.

  2. Being older, a person earns more (= has more income).

  3. Males would get higher income than females.

  4. Non-migrants would get higher income than migrants. Note: in our case as well as in the models by migrants we mean people who do not have citizenship.

  5. The presence or absence of working schedule affect the difference in incomes.

  6. Hypothesis for interactive model: the type of working schedule would affect the income and at the same time gender would influence the income. So, we expect that the first mentioned variable does not have the “pure” effect. Thus, people with the same working schedule characteristics but different gender would get dissimilar income.

Data

In our data we have the following meanings behind income variable:

In order to deal with income in further regression models we recorded this variable having added number as ordering.

# Libraries
library(foreign)
library(haven)
library(tidyverse)
library(psych)
library(stargazer)
library(polycor) 
library(DataExplorer)
library(magrittr)
library(sjPlot)
library(ggpubr)
library(formattable)
library(effects)
library(ggstatsplot)
library(margins)

# Functions
'%out%' <- function(x,y)!('%in%'(x,y))
mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Loading data
Germany <- read.spss("ESS8DE.sav", use.value.labels = T, to.data.frame = T) 

# Data preparation
data <- Germany %>% 
  select(chldhm, hinctnta, eduyrs, agea, gndr, icwhct, ctzcntr) %>% 
  filter(hinctnta %out% c(77, 88, 99) & 
         (eduyrs %out% c(77, 88, 99)) & 
         (agea != 999) & (gndr != 9) &  
         (icwhct %out% c(6, 9)) &
         (ctzcntr %out% c(7, 8, 9)) &
         (chldhm == "Respondent lives with children at household grid")) %>% 
  select(-chldhm)

# Recoding working hours variable
data$new[data$icwhct == "Yes"] <- "Fixed working hours"
data$new[data$icwhct == "No"] <- "Non-fixed working hours"
data <- data %>% select(-icwhct) %>% rename(icwhct = new)
data$icwhct <- as.factor(data$icwhct)

# Recoding income variable
data$hinctnta_num[data$hinctnta == "J - 1st decile"] <- "1"
data$hinctnta_num[data$hinctnta == "R - 2nd decile"] <- "2"
data$hinctnta_num[data$hinctnta == "C - 3rd decile"] <- "3"
data$hinctnta_num[data$hinctnta == "M - 4th decile"] <- "4"
data$hinctnta_num[data$hinctnta == "F - 5th decile"] <- "5"
data$hinctnta_num[data$hinctnta == "S - 6th decile"] <- "6"
data$hinctnta_num[data$hinctnta == "K - 7th decile"] <- "7"
data$hinctnta_num[data$hinctnta == "P - 8th decile"] <- "8"
data$hinctnta_num[data$hinctnta == "D - 9th decile"] <- "9"
data$hinctnta_num[data$hinctnta == "H - 10th decile"] <- "10"
data <- data %>% select(-hinctnta) %>% rename(hinctnta = hinctnta_num)
data$hinctnta <- factor(data$hinctnta, 
                        levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

# Appropriate types
data$eduyrs <- as.numeric(as.character(data$eduyrs))
data$agea <- as.numeric(as.character(data$agea))
data$hinctnta <- as.numeric(as.character(data$hinctnta))

# Used variables in regression models
Number <- c("1", "2", "3", "4", "5", "6")

Abbreviation <- c("hinctnta", "eduyrs", "agea", "gndr", "ctzcntr", "icwhct")

RoleInRegressionModel <- c("dependent", "independent", "independent", 
                           "independent", "independent", "independent")

LevelOfMeasurement <- c("Ratio", "Ratio", "Ratio", "Binomial", "Binomial", "Binomial")

Definition <- c("Household's total net income, all sources", 
                "Years of full-time education completed",
                "Age of respondent",
                "Gender of respondent: male or female",
                "Citizen of country: yes or no",
                "Having a set basic or contracted number of working hours: yes or no")

vardesc <- data.frame(Number, Abbreviation, RoleInRegressionModel, LevelOfMeasurement, Definition)

formattable(vardesc, align = c("l", rep("r", NCOL(vardesc) - 1)), 
            list(`Abbreviation` = formatter("span", style = ~style(color = "green", 
                                                                   font.weight = "bold")))) 
Number Abbreviation RoleInRegressionModel LevelOfMeasurement Definition
1 hinctnta dependent Ratio Household’s total net income, all sources
2 eduyrs independent Ratio Years of full-time education completed
3 agea independent Ratio Age of respondent
4 gndr independent Binomial Gender of respondent: male or female
5 ctzcntr independent Binomial Citizen of country: yes or no
6 icwhct independent Binomial Having a set basic or contracted number of working hours: yes or no

After selecting needed variables we can look at the data frame and define whether there are not available observations or not.

# Looking at NAs
plot_missing(data, title = "NAs in the data")

As can be seen, there are NAs. Although it is possible to deal with it, we are going to remove them to have completely cleaned data for further regression models. After that let us to present basic information about the data and its descriptive statistics.

Descriptive Statistics

# Removing NAs
data <- data %>% na.omit()

# Basic information 
introduce(data)
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1  790       6                3                  3                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                    0           790               4740        32112
# Descriptive statistics
describe(data)
##          vars   n  mean    sd median trimmed   mad min max range  skew
## eduyrs      1 790 15.21  3.19     15   15.08  2.97   2  25    23  0.25
## agea        2 790 44.79 10.82     45   44.45 10.38  18  83    65  0.34
## gndr*       3 790  1.49  0.50      1    1.49  0.00   1   2     1  0.04
## ctzcntr*    4 790  1.08  0.27      1    1.00  0.00   1   2     1  3.13
## icwhct*     5 790  1.09  0.28      1    1.00  0.00   1   2     1  2.89
## hinctnta    6 790  6.98  2.52      8    7.22  2.97   1  10     9 -0.68
##          kurtosis   se
## eduyrs       0.40 0.11
## agea         0.37 0.38
## gndr*       -2.00 0.02
## ctzcntr*     7.80 0.01
## icwhct*      6.36 0.01
## hinctnta    -0.50 0.09

Overall, there are 790 observations that are going to be base for analysis. 3 variables are defined as discrete, the other 3 are continuous. There are no NAs as we have removed them.

Looking at the second table we mainly interested in continuous variables such as (1) years spent on education, (2) age of respondent, and (3) income. The other variables are factors. Let us present descriptive statistics in graphically.

# Plotting income distribution
ggplot(data, aes(hinctnta)) + 
    geom_histogram(position = 'identity', stat = "count", 
                   fill = "chartreuse3", color = "azure4", alpha = 0.2) +
    labs(x = 'Income', y = 'Density',
         title = "Distribution of respondents' income") + 
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_x_continuous(breaks = unique(data$hinctnta))

As can be seen from the histogram, income is not normally distributed. We can also notice that there are more observations with income related to the second part of deciles. The most popular answer (mode) is the 8th decile. This distribution has a built-up character.

# Plotting age distribution
ggplot(data, aes(agea)) + 
    geom_histogram(position = 'identity', stat = "count", 
                   fill = "mediumpurple1", color = "mediumpurple3", alpha = 0.35) +
    labs(x = 'Age', y = 'Density',
         title = "Distribution of respondents' age") + 
    theme_bw() +
    ylim(0, 40) +
    theme(plot.title = element_text(hjust = 0.5)) +
    geom_vline(aes(xintercept = mean(data$agea), color = "mean"), size = 0.7) +
    geom_vline(aes(xintercept = median(data$agea), color = "median"), size = 0.7) +
    geom_vline(aes(xintercept = mode(data$agea), color = "mode"), size = 0.7) +
    scale_color_manual(name = "Statistics", values = c(mean = "green", 
                                                       median = "blue",
                                                       mode = "red")) +
    scale_y_continuous(breaks = 1:100*10)

As can be noticed, age of respondents is not normally distributed, but still it is roughtly bell-shaped. There are several ups-and-downs. Both mean value and median value are equal to 45 years, while the mode is higher, and it is equal to 49 years.

# Plotting years spent on education distribution
ggplot(data, aes(eduyrs)) + 
    geom_histogram(position = 'identity', stat = "count", 
                   fill = "orange", color = "orange3", alpha = 0.35) +
    labs(x = 'Years of studying', y = 'Density',
         title = "Distribution of years spent on education") + 
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5)) +
    geom_vline(aes(xintercept = mean(data$eduyrs), color = "mean"), size = 0.7) +
    geom_vline(aes(xintercept = median(data$eduyrs), color = "median"), size = 0.7) +
    geom_vline(aes(xintercept = mode(data$eduyrs), color = "mode"), size = 0.7) +
    scale_color_manual(name = "Statistics", values = c(mean = "green", 
                                                       median = "blue",
                                                       mode = "red")) +
    scale_y_continuous(breaks = 1:100*10)

It can be said that the variable regarding years that respondents spent on education is more or less normally distributed. There are less people who spent not many years. The most popular answer, which is a mode, is equal to 13 years, so it is the peak of the distribution. In this distribution the mean and the median have the same rounded values which are equal to 15 years.

# Plotting years spent on education distribution
ggplot(data, aes(x = ctzcntr, y = hinctnta)) + 
        geom_boxplot(fill = c("lightgreen", "lightskyblue")) + 
        stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) + 
        ggtitle("Distribution of income per statuses") + 
        labs(x = "Whether person has a status Germany citizen or not", y = "Income") +
        theme_bw() 

This graph depicts two boxplots of income distribution by two categories – (1) people, who have citizenship of Germany, and (2) who do not have it. Comparing these two boxplots it can be said that the median and mean income of people with citizenship is higher than the median value of people who do not have it. Both categories have the same maximum value – 10, but minimum value is higher in case of having citizenship.

# Plotting years spent on education distribution
ggplot(data, aes(x = gndr, y = hinctnta)) + 
        geom_boxplot(fill = c("orchid3", "royalblue3")) + 
        stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) + 
        ggtitle("Distribution of income per gender") + 
        labs(x = " ", y = "Income") +
        theme_bw() 

This graph shows two distribution of income in Germany by gender. Comparing these two boxplots in can be seen that the median and mean income of male are higher than females have. It also should be said that the third quantile of both categories are about the same value, but the first quantiles differ – females have lower value. Also it is seen that the maximum values of these two categories are equal, and the minimum value of males is higher than females have.

# Plotting years spent on education distribution
ggplot(data, aes(x = icwhct, y = hinctnta)) + 
        geom_boxplot(fill = c("lightgoldenrod1", "honeydew3")) + 
        stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) + 
        ggtitle("Distribution of income per\nhaving basic or contracted number of working hours") + 
        labs(x = " ", y = "Income") +
        theme_bw() 

This graph depicts the distribution of income in Germany by two categories – (1) people who have fixed working hours, and (2) people who do not have fixed working hours. It can be seen that the interquartile ranges of these two categories are very similar – their first and third quartiles do not differ. It should be mentioned that the median value of people with fixed working hours is higher than median of the second group, and the mean values of these categories do not differ a lot, but the first group has the mean higher than the people with non-fixed working hours have.

Regression Models

Next we would like to provide some some plots are created in order to visualize correlations.

ggscatter(data, x = "agea", y = "hinctnta", 
          add = "reg.line", conf.int = T, 
          cor.coef = T, cor.method = "pearson",
          xlab = "Age", ylab = "Income",
          title = "Correlation plot: income vs age",
          color = "slateblue4")
ggscatter(data, x = "eduyrs", y = "hinctnta", 
          add = "reg.line", conf.int = T, 
          cor.coef = T, cor.method = "pearson",
          xlab = "Years of studying", ylab = "Income",
          title = "Correlation plot: income vs years of studying",
          color = "orangered3")
cor.test(data$hinctnta, data$agea)
## 
##  Pearson's product-moment correlation
## 
## data:  data$hinctnta and data$agea
## t = 3.192, df = 788, p-value = 0.001469
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04357362 0.18130486
## sample estimates:
##       cor 
## 0.1129819
cor.test(data$hinctnta, data$eduyrs)
## 
##  Pearson's product-moment correlation
## 
## data:  data$hinctnta and data$eduyrs
## t = 10.148, df = 788, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2767790 0.4002284
## sample estimates:
##       cor 
## 0.3399674

As can be seen from the first blue plot, there is a positive tiny correlation (R = 0.11) between total income of respondent’s household and respondent’s age in Germany. The p-value (p = 0.0015) gives us evidence that there is the relationship between our variables. The following might be formulated: the higher person’s age is, the higher household income he or she has.

As for the second graph, it can be seen that there is a positive weak correlation (R = 0.34) between total income of respondent’s household and years spent on education by this person. The p-value is less than 0.001 (p < 2.2e-16), and it gives us the evidence that the relationship between income and years of studying exists. The following might be formulated: the more years person spent on education, the higher household income he or she has.

Additionally, another ways to look at data taking into account distributions of variables are shown below.

set.seed(123)
plot <- ggscatterstats(
  data = data,
  x = agea,
  y = hinctnta,
  xlab = "Age",
  ylab = "Income",
  title = "Distribution and correlation: income vs age", 
  type = "pearson", conf.level = 0.95, method = "lm",
  marginal = TRUE, 
  bf.message = TRUE,
  messages = FALSE
)
print(plot)

This graph shows the summary of previous information in a very fancy way! Here are presented the age and income distribution and the correlation between it, which is depicted by scatterplot! The Pearson’s coefficient here is positive and equals to 0.11, which is not very big, but we will deal with it.

set.seed(123)
plot2 <- ggscatterstats(
  data = data,
  x = eduyrs,
  y = hinctnta,
  xlab = "Years spent on education",
  ylab = "Income",
  title = "Distribution and correlation: income vs years of studying", 
  type = "pearson", conf.level = 0.95, method = "lm",
  marginal = TRUE, 
  bf.message = TRUE,
  messages = FALSE
)
print(plot2)

Another beautiful graph giving the summary about aforesaid information and explanation of income and years of studying. There are presented two histograms, which present income and education years distribution and a scatterplot, showing the correlation between two variables. There is correlation coefficient presented in the graph. This coefficient equals 0,34, which is rather good in our case mainly due to complex variables presented in the data.

##
Summaries of Models

Model 1

The first model has income as the outcome and one predictor – years spent on education.

# Model 1
model1 <- lm(hinctnta ~ eduyrs, data = data)
summary(model1)
## 
## Call:
## lm(formula = hinctnta ~ eduyrs, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7965 -1.4561  0.5439  1.8120  4.1523 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  2.89890    0.41051   7.062     0.00000000000362 ***
## eduyrs       0.26807    0.02642  10.148 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.369 on 788 degrees of freedom
## Multiple R-squared:  0.1156, Adjusted R-squared:  0.1145 
## F-statistic:   103 on 1 and 788 DF,  p-value: < 0.00000000000000022
par(mfrow = c(2,3))
plot(density(model1$res), main = "Model 1. Residuals")
plot(model1)

Coefficients

  • the intercept (β0) is 2.9.
  • the slope (β1) is 0.27.
  • equation:

$\hat{Income}$ = 2.9 + 0.27 ⋅ eduyrs + ei

As follows, beta coefficients are statistically significant (according to t-statistics and p-value that is < 0.001).

Residuals

  • Median equals to 0.54.
  • Minimum (-6.8) and maximum (4) are roughly closed to each other in their absolute values.
  • The Residual standard error (RSE) is 2.37 meaning that the observed values deviate from the true regression line by approximately 2 units in average.

Model fit

  • F-statistic is equal to 103 on 1 and 788 DF, and p-value is less than 0.001. So, it can be concluded that probability of that coefficients of predictors are equal to zero is extremely low. Good for us!
  • R-squared is edual to 0.1156 (adjusted R-squared is 0.1145). It says that about 12% (rounded value of R-squared) of the data is explained by the model.

Case as interpretation

Let suppose that person has spent 13 years on education, it means that the income would be equal to 6,38401 (6 rounded). Recording: 6 means that income is from 1500 to 2000 euros monthly.

Checking hypotheses One of hypotheses is that the more years person spent on education, the bigger size of income he or she would get. For instance, person has spent 3 years on education, when his or her household’s total income would be 3.71 (rounded 4) in accordance with the first regression model. Recording: 4 stands for 500 to under 1000 euros per month. Comparing this peron with the previously mentioned one in section “case as interpretation”, it can be concluded that the mentioned above empirical hypothesis is proved.

Model 2

In the second model just new variable is added. So, this model has two predictors – years spent on education and age of respondent.

# Model 2
model2 <- lm(hinctnta ~ eduyrs + agea, data = data)
summary(model2)
## 
## Call:
## lm(formula = hinctnta ~ eduyrs + agea, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6719 -1.5637  0.4866  1.7940  4.0124 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 1.493181   0.546080   2.734             0.006391 ** 
## eduyrs      0.272563   0.026213  10.398 < 0.0000000000000002 ***
## agea        0.029861   0.007735   3.860             0.000123 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.348 on 787 degrees of freedom
## Multiple R-squared:  0.132,  Adjusted R-squared:  0.1298 
## F-statistic: 59.85 on 2 and 787 DF,  p-value: < 0.00000000000000022
par(mfrow = c(2,3))
plot(density(model2$res), main = "Model 2. Residuals")
plot(model2)

Coefficients

  • the intercept (β0) is 1.49.
  • the slope (β1) is 0.27.
  • the slope (β2) is 0.03.
  • equation:

$\hat{Income}$ = 1.49 + 0.27 ⋅ eduyrs + 0.03 ⋅ agea + ei

As follows, beta coefficients are statistically significant (according to t-statistics and p-value that is < 0.001).

Residuals

  • Median equals to 0.49.
  • Minimum (-6.67) and maximum (4) are roughly closed to each other in their absolute values.
  • The Residual standard error (RSE) is 2.35 meaning that the observed values deviate from the true regression line by approximately 2 units in average.

Model fit

  • F-statistic is equal to 59.85 on 2 and 787 DF, and p-value is less than 0.001. So, it can be concluded that probability of that coefficients of predictors are equal to zero is extremely low. Good for us!
  • R-squared is edual to 0.132 (adjusted R-squared is 0.1298). It says that about 13% of the data is explained by the model.

Case as interpretation

Suppose that person has spent 13 years on education and he or she is 20 years old, it means that the income would be equal to 5,632769 (6 rounded). Recording: 6 means that income is from 1500 to 2000 euros monthly.

Checking hypotheses One of hypotheses is that being older, a person earns more (= has more income). For instance, person whose age is 30 has spent 13 years on education, when his or her household’s total income would be 5.9 (rounded 6) in accordance with the first regression model. Recording: 6 means that income is from 1500 to 2000 euros monthly. Comparing this peron with the previously mentioned one in section “case as interpretation”, it can be concluded that the mentioned above empirical hypothesis is not fully proved.

Model 3

The third model also has income as outcome, but for now there are three predictors – years spent on education, age and gender.

# Model 3
model3 <- lm(hinctnta ~ eduyrs + agea + gndr, data = data)
summary(model3)
## 
## Call:
## lm(formula = hinctnta ~ eduyrs + agea + gndr, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9200 -1.5908  0.4021  1.8006  4.1356 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  1.881150   0.551293   3.412             0.000677 ***
## eduyrs       0.274019   0.025999  10.540 < 0.0000000000000002 ***
## agea         0.027551   0.007696   3.580             0.000365 ***
## gndrFemale  -0.625940   0.166319  -3.764             0.000180 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.329 on 786 degrees of freedom
## Multiple R-squared:  0.1474, Adjusted R-squared:  0.1441 
## F-statistic: 45.29 on 3 and 786 DF,  p-value: < 0.00000000000000022
par(mfrow = c(2,3))
plot(density(model3$res), main = "Model 3. Residuals")
plot(model3)

Coefficients

  • the intercept (β0) is 1.88.
  • the slope (β1) is 0.27.
  • the slope (β2) is 0.03.
  • the slope (β3) is -0.63.
  • equation:

$\hat{Income}$ = 1.88 + 0.27 ⋅ eduyrs + 0.03 ⋅ agea − 0.63 ⋅ gndrFemale + ei

As follows, beta coefficients are statistically significant (according to t-statistics and p-value that is < 0.001).

Residuals

  • Median equals to 0.4.
  • Minimum (-6.9) and maximum (4) are roughly closed to each other in their absolute values.
  • The Residual standard error (RSE) is 2.33 meaning that the observed values deviate from the true regression line by approximately 2 units in average.

Model fit

  • F-statistic is equal to 45.29 on 3 and 786 DF, and p-value is less than 0.001. So, it can be concluded that probability of that coefficients of predictors are equal to zero is extremely low. Good for us!
  • R-squared is edual to 0.1474 (adjusted R-squared is 0.1441). It says that about 14-15% of the data is explained by the model.

Case as interpretation

The same person, who has spent 13 years on education and who is 20 years old, but now we define the gender of this person as male (gndrFemale = 0). It means that the income would be equal to 5,992651 (6 rounded). Recording: 6 means that income is from 1500 to 2000 euros monthly.

Checking hypotheses One of hypotheses is that males would get higher income than females. If the gender of previously mentioned person in section “case as interpretation” was female (gndrFemale = 1), the income would be equal to 5.36 (5 rounded). Recording: 5 means that income is from 1000 to 1500 euros monthly. So, gender inequality in income is noticed. The hypothesis is supported.

Model 4

# Model 4
model4 <- lm(hinctnta ~ eduyrs + agea + gndr + ctzcntr, data = data)
summary(model4)
## 
## Call:
## lm(formula = hinctnta ~ eduyrs + agea + gndr + ctzcntr, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9291 -1.4535  0.4041  1.7570  4.4567 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  2.358474   0.565981   4.167            0.0000343 ***
## eduyrs       0.258953   0.026220   9.876 < 0.0000000000000002 ***
## agea         0.024091   0.007716   3.122             0.001861 ** 
## gndrFemale  -0.648123   0.165381  -3.919            0.0000967 ***
## ctzcntrNo   -1.049527   0.313357  -3.349             0.000849 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.314 on 785 degrees of freedom
## Multiple R-squared:  0.1594, Adjusted R-squared:  0.1551 
## F-statistic: 37.21 on 4 and 785 DF,  p-value: < 0.00000000000000022
par(mfrow = c(2,3))
plot(density(model4$res), main = "Model 4. Residuals")
plot(model4)

Coefficients

  • the intercept (β0) is 2.36.
  • the slope (β1) is 0.26.
  • the slope (β2) is 0.02.
  • the slope (β3) is -0.65.
  • the slope (β4) is -1.05.
  • equation:

$\hat{Income}$ = 2.36 + 0.26 ⋅ eduyrs + 0.02 ⋅ agea − 0.65 ⋅ gndrFemale − 1.05 ⋅ ctzcntrNo + ei

As follows, beta coefficients are statistically significant (according to t-statistics and p-value that is < 0.001).

Residuals + Median equals to 0.41. + Minimum (-6.9) and maximum (4.5) are roughly closed to each other in their absolute values. + The Residual standard error (RSE) is 2.314 meaning that the observed values deviate from the true regression line by approximately 2 units in average.

Model fit

  • F-statistic is equal to 37.21 on 4 and 785 DF, and p-value is less than 0.001. So, it can be concluded that probability of that coefficients of predictors are equal to zero is extremely low. Good for us!
  • R-squared is edual to 0.1594 (adjusted R-squared is 0.1551). It says that about 15% of the data is explained by the model.

Case as interpretation

Male person (gndrFemale = 0), who is 20 years old and who has spent 13 years on education does not have a status of citizen (ctzcntrNo = 1). As follows, the income would be equal to 5,08 (5 rounded). But if this person has this statuse, the income would be 6.13 (rounded 6). Recording: 5 means that income is from 1000 to 1500 euros monthly, 6 means that income is from 1500 to 2000 euros monthly.

Checking hypotheses One of hypotheses is that people with status of citizen would get higher income than people without it. As illustrated above in section “case as interpretation”, this hypothesis is supported.

Model 5

The specific point of the 5th model is a two-way interaction. This model includes income as outcome, and there are for some predictors – years spent on education, age, gender, presence or absence of German citizenship and working schedule, where gender is presented as moderator.

# Model 5
model5 <- lm(hinctnta ~ eduyrs + agea + gndr + ctzcntr + icwhct*gndr, data = data)
summary(model5)
## 
## Call:
## lm(formula = hinctnta ~ eduyrs + agea + gndr + ctzcntr + icwhct * 
##     gndr, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9198 -1.4213  0.4053  1.7057  4.5123 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## (Intercept)                               2.240270   0.566785   3.953
## eduyrs                                    0.264546   0.026239  10.082
## agea                                      0.024447   0.007769   3.147
## gndrFemale                               -0.530403   0.172285  -3.079
## ctzcntrNo                                -1.028761   0.312702  -3.290
## icwhctNon-fixed working hours             0.177163   0.412123   0.430
## gndrFemale:icwhctNon-fixed working hours -1.305154   0.579162  -2.254
##                                                      Pr(>|t|)    
## (Intercept)                                         0.0000843 ***
## eduyrs                                   < 0.0000000000000002 ***
## agea                                                  0.00171 ** 
## gndrFemale                                            0.00215 ** 
## ctzcntrNo                                             0.00105 ** 
## icwhctNon-fixed working hours                         0.66740    
## gndrFemale:icwhctNon-fixed working hours              0.02450 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.306 on 783 degrees of freedom
## Multiple R-squared:  0.1676, Adjusted R-squared:  0.1613 
## F-statistic: 26.28 on 6 and 783 DF,  p-value: < 0.00000000000000022
par(mfrow = c(2,3))
plot(density(model5$res), main = "Model 5. Residuals")
plot(model5)

Coefficients

  • the intercept (β0) is 2.24.
  • the slope (β1) is 0.26.
  • the slope (β2) is 0.02.
  • the slope (β3) is -0.5.
  • the slope (β4) is -1.03.
  • the slope (β5) is 0.18.
  • the slope (β6) is -1.3.
  • equation:

$\hat{Income}$ = 2.24 + 0.26 ⋅ eduyrs + 0.02 ⋅ agea − 0.5 ⋅ gndrFemale − 1.03 ⋅ ctzcntrNo + 0.18 ⋅ icwhctNo − 1.3 ⋅ gndrFemale ⋅ icwhctNo + ei

As follows, previously used variables have beta coefficients that are statistically significant (according to t-statistics and p-value that is < 0.001). Although a new added variable related to working hours has no direct impact in the model as it is insignificant, in case of interaction this variable and gender there is a significance. Also, we can conclude that there is a moderation.

Residuals + Median equals to 0.41. + Minimum (-6.9) and maximum (4.5) are roughly closed to each other in their absolute values. + The Residual standard error (RSE) is 2.315 meaning that the observed values deviate from the true regression line by approximately 2 units in average.

Model fit

  • F-statistic is equal to 37.13 on 4 and 784 DF, and p-value is less than 0.001. So, it can be concluded that probability of that coefficients of predictors are equal to zero is extremely low. Good for us!
  • R-squared is edual to 0.1593 (adjusted R-squared is 0.155). It says that about 15% of the data is explained by the model.

Interaction

plot_model(model5, type = "int")
plot(allEffects(model5))

As can be seen from the first graph, males are associated with having higher household’s total income in both cases of having or not a set basic or contracted number of hours. Additionally, plots are created to demonstrate effects.

Marginal effects

Marginal effects show how the outcome would change in case of 1-unit change of the predictor, taking into account all the other predictors.

mrgns <- margins(model5)
summary(mrgns)
##                         factor     AME     SE       z      p   lower
##                           agea  0.0244 0.0078  3.1469 0.0017  0.0092
##                      ctzcntrNo -1.0288 0.3127 -3.2899 0.0010 -1.6416
##                         eduyrs  0.2645 0.0262 10.0821 0.0000  0.2131
##                     gndrFemale -0.6460 0.1648 -3.9201 0.0001 -0.9691
##  icwhctNon-fixed working hours -0.4622 0.2921 -1.5824 0.1135 -1.0347
##    upper
##   0.0397
##  -0.4159
##   0.3160
##  -0.3230
##   0.1103
plot(mrgns)

The highest average value of marginal effects has variable related to studying years (eduyrs), the lowest one is gender that is female. All these variable are extremely important in the model as their effects demonstrate how value of the outcomes might be significantly changed.

Standardised coefficients

plot_model(model5, type = "std")

According to the plot presented above, years spent on education can predict household’s total income better than the other variables.

Case as interpretation

Here is again 20 year-old male, who has spent 13 years on education and he is not the citizen of Germany (ctzcntrNo = 1). Having added working hours schedule and gender, suppose, that he has schedule (icwhctNo = 1) and his gender is “male” (gndrFemale = 0). So, the income equals to 5.17 (rounded 5). Recording: 5 means that income is from 1000 to 1500 euros monthly.

Checking hypotheses

One of our hypothesis is that _ people with the same working schedule characteristics but different gender would get dissimilar income_. So, let us continue and assume that there is almost identical person, but the gender is female. In this case the income is 3.37 (rounded 3). Recording: 3 means that income is 300 to under 500 euros montly. Overall, the mentioned hypothesis is supported.

Model Comparisons

# ANOVA
anova(model1, model2, model3, model4, model5)
## Analysis of Variance Table
## 
## Model 1: hinctnta ~ eduyrs
## Model 2: hinctnta ~ eduyrs + agea
## Model 3: hinctnta ~ eduyrs + agea + gndr
## Model 4: hinctnta ~ eduyrs + agea + gndr + ctzcntr
## Model 5: hinctnta ~ eduyrs + agea + gndr + ctzcntr + icwhct * gndr
##   Res.Df    RSS Df Sum of Sq       F     Pr(>F)    
## 1    788 4422.6                                    
## 2    787 4340.4  1    82.187 15.4609 0.00009169 ***
## 3    786 4263.6  1    76.831 14.4534  0.0001548 ***
## 4    785 4203.5  1    60.069 11.3001  0.0008126 ***
## 5    783 4162.2  2    41.255  3.8805  0.0210392 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As can be seen from the table above, the model is improving hierarchically due to new added variables. It is proved by less p-values (< 0.05) and RSS which tends to reduce. As for F-statistics, they also show that model is improving because the null hypothesis about absence of model improvment is rejected. Overall, the 5th model with interaction is better than the others.

Summing Up

tab_model(model1, model2, model3, model4, model5)
  hinctnta hinctnta hinctnta hinctnta hinctnta
Predictors Estimates CI p Estimates CI p Estimates CI p Estimates CI p Estimates CI p
(Intercept) 2.90 2.09 – 3.70 <0.001 1.49 0.42 – 2.56 0.006 1.88 0.80 – 2.96 0.001 2.36 1.25 – 3.47 <0.001 2.24 1.13 – 3.35 <0.001
eduyrs 0.27 0.22 – 0.32 <0.001 0.27 0.22 – 0.32 <0.001 0.27 0.22 – 0.32 <0.001 0.26 0.21 – 0.31 <0.001 0.26 0.21 – 0.32 <0.001
agea 0.03 0.01 – 0.05 <0.001 0.03 0.01 – 0.04 <0.001 0.02 0.01 – 0.04 0.002 0.02 0.01 – 0.04 0.002
Female -0.63 -0.95 – -0.30 <0.001 -0.65 -0.97 – -0.32 <0.001 -0.53 -0.87 – -0.19 0.002
No -1.05 -1.66 – -0.44 0.001 -1.03 -1.64 – -0.42 0.001
Non-fixed working hours 0.18 -0.63 – 0.98 0.667
gndrFemale:icwhctNon-fixed working hours -1.31 -2.44 – -0.17 0.025
Observations 790 790 790 790 790
R2 / R2 adjusted 0.116 / 0.114 0.132 / 0.130 0.147 / 0.144 0.159 / 0.155 0.168 / 0.161

This table depicts the all results of all models. It can be said that our models improve with add new predictors – it can be concluded grounding on the fact that with adding new predictor each time R-square increase, which means that each time our model become more explained. We start with explain 12% (rounded) of model and end with explaining 17% (rounded).

With this table we can conclude, whether our hypotheses were confirmed. We see from all models that more years person spent on education, the bigger size of income her or his household gets – the hypothesis 1 was confirmed. Starting from the second model we can conclude that older person is, the bigger size of income he or she has - hypotheses 2 was confirmed. We also see that being female decrease the income, so our hypothesis 3 that males get income higher than females was confirmed. We can conclude that non-migrants get higher income than migrants, and it confirms our hypothesis 4. We can also conclude, that people with who do not have fixed working hours have higher income – and it confirms our hypothesis 5 that different types of working hours schedule affect to different income. And at the end we can say that the type of working hours with gender together affect the income – which confirm our last hypothesis 6 of interactive effect.