The whole project is based on the data collected by European Social Survey (ESS8-2016 Edition 2.1 was released on 1st of December 2018) that is available for downloading on the website. As for the main topic on which we focus is a migration process in Germany. More information about Germany is availiable here.
Alexandra Shanina was responsible for graph descriptions and wrote code for variable selecting and filtering for further graph construction.
Milena Oleshko created descriptive tables for all used variables, constructed scatter-plot and helped to code other graphs.
Zakharova Victoria organized .Rmd and constructed graphs (except the scatter-plot).
Overall, we had several group discussions that were about a deal of aspects such as distinguishing possible variables for the whole project, for instance.
Note: we wanted to organize our work in the following way: “what we do” -> “code” -> “graph” -> “description”. Hope, we did it :)
As our topic is migration in Germany, we attempted to choose appropriate variables. We are going to work with the following ones:
setwd("C:/Users/Victoria/Desktop/Data Analysis in Sociology/Project")
Variable <- c("gndr","agea", "imsclbn","impcntr", "livecnta", "hhmmb", "wkhtot","ctzcntr", "mainact", "domicil" )
LevelOfMeasure <- c("Nominal", "Ratio", "Interval", "Interval", "Ratio", "Ratio", "Ratio", "Nominal", "Nominal", "Nominal")
mode_is_it_possible <- c("Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes","Yes")
median_is_it_possible <- c("No", "Yes", "No", "No", "Yes", "Yes", "Yes", "No", "No","No")
mean_is_it_possible <- c("No", "Yes", "Yes", "Yes","No","No","No","No","No","No")
QualOrQuant <- c("Qualitive", "Quantitative", "Quantitative","Quantitative","Quantitative","Quantitative","Quantitative", "Qualitive", "Qualitive", "Qualitive")
ContinOrDiscrete <- c("None", "Discrete", "Discrete", "Discrete", "Continuous", "Continuous","Continuous", "None","None", "None")
DescriptionOfVariables <- c("Gender", "Age of respondent, calculated", "When should immigrants obtain rights to social benefits/services", "Allow many/few immigrants from poorer countries outside Europe", "What year you first came to live in country", "Number of people living regularly as member of household", "Total hours normally worked per week in main job overtime included", "Citizen of country", "Main activity last 7 days", "Domicile, respondent's description")
Variables_discription1 <- data.frame(Variable, LevelOfMeasure, QualOrQuant, ContinOrDiscrete, mode_is_it_possible, median_is_it_possible, mean_is_it_possible)
knitr::kable(Variables_discription1)
| Variable | LevelOfMeasure | QualOrQuant | ContinOrDiscrete | mode_is_it_possible | median_is_it_possible | mean_is_it_possible |
|---|---|---|---|---|---|---|
| gndr | Nominal | Qualitive | None | Yes | No | No |
| agea | Ratio | Quantitative | Discrete | Yes | Yes | Yes |
| imsclbn | Interval | Quantitative | Discrete | Yes | No | Yes |
| impcntr | Interval | Quantitative | Discrete | Yes | No | Yes |
| livecnta | Ratio | Quantitative | Continuous | Yes | Yes | No |
| hhmmb | Ratio | Quantitative | Continuous | Yes | Yes | No |
| wkhtot | Ratio | Quantitative | Continuous | Yes | Yes | No |
| ctzcntr | Nominal | Qualitive | None | Yes | No | No |
| mainact | Nominal | Qualitive | None | Yes | No | No |
| domicil | Nominal | Qualitive | None | Yes | No | No |
Variables_discription2 <- data.frame(Variable, DescriptionOfVariables)
knitr::kable(Variables_discription2)
| Variable | DescriptionOfVariables |
|---|---|
| gndr | Gender |
| agea | Age of respondent, calculated |
| imsclbn | When should immigrants obtain rights to social benefits/services |
| impcntr | Allow many/few immigrants from poorer countries outside Europe |
| livecnta | What year you first came to live in country |
| hhmmb | Number of people living regularly as member of household |
| wkhtot | Total hours normally worked per week in main job overtime included |
| ctzcntr | Citizen of country |
| mainact | Main activity last 7 days |
| domicil | Domicile, respondent’s description |
Germany <- haven::read_sav("ESS8DE.sav")
library(haven)
library(ggplot2)
library(dplyr)
library(cowplot)
library(knitr)
The first graph is a histogram created for single continuous variable(s). We would like to look at the distribution of ages of people who were for the migrants’ immediate access to social benefits. Here it is!
### Choose some variables such as age, gender and opinions about the mentioned issue for construction of histogram
## for male
data_hist1 <- Germany %>%
select(imsclbn, gndr, agea) %>%
na.omit() %>%
filter(gndr == 1 & imsclbn == 1)
## for female
data_hist2 <- Germany %>%
select(imsclbn, gndr, agea) %>%
na.omit() %>%
filter(gndr == 2 & imsclbn == 1)
### Construct two histograms according to gender and depict them on one picture
pic_hist1 <- ggplot() +
geom_histogram(data = data_hist1, aes(x = as.numeric(agea)),
col = "#6F15A4", fill = "#D5ACEE", alpha = 0.75, binwidth = 1) +
geom_vline(aes(xintercept = mean(data_hist1$agea)), color="#490A72", size=0.75)+
geom_vline(aes(xintercept = median(data_hist1$agea)), linetype="dashed",
color="#D9137F", size=0.75) +
labs(title = "Distribution of men's age",
x = "Age", y = "Frequency") +
theme_minimal()
pic_hist2 <- ggplot() +
geom_histogram(data = data_hist2, aes(x = as.numeric(agea)),
col = "#D58E22", fill = "#FACB84", alpha = 0.75, binwidth = 1) +
geom_vline(aes(xintercept = mean(data_hist2$agea)), color="#490A72", size=0.75)+
geom_vline(aes(xintercept = median(data_hist2$agea)), linetype="dashed",
color="#D9137F", size=0.75) +
labs(title = "Distribution of women's age",
x = "Age", y = "Frequency") +
theme_minimal()
two_hist <- plot_grid(pic_hist1, pic_hist2, labels = "AUTO")
title <- ggdraw() + draw_label("Age of people who are for the migrants' immediate access to social benefits", fontface='bold', size = 12.5)
plot_grid(title, two_hist, ncol = 1, rel_heights = c(0.1, 1))
### Now we create a table with mean and median of ages for more accurate visibility
mean_male_age <- mean(data_hist1$agea)
median_male_age <- median(data_hist1$agea)
mean_female_age <- mean(data_hist2$agea)
median_female_age <- median(data_hist2$agea)
df_age <- data.frame(mean_male_age, mean_female_age, median_male_age, median_female_age)
kable(df_age)
| mean_male_age | mean_female_age | median_male_age | median_female_age |
|---|---|---|---|
| 43.35758 | 44.03425 | 43 | 43 |
This graph show the amount of people according to their age group and gender, who agree with the statement that migrants should have the immediate access to social benefits. There were used such variables as Calculated Age (agea), Gender (gndr) and When should immigrants obtain rights to social benefits/services (imsclbn).
It was interesting to explore which age group more frequently answered on “When should immigrants obtain rights to social benefits/services” that immigrants should be treated with all social benefits as soon as they come to the country. That is why, the answers on this particular question were filtered and only one variant of answer was chosen: “Immediately on arrival”.
It can be observed from the graph that males aged thirty, forty and sixty are more likely to provide migrants with all social benefits immediately, thus it could be said they are more tolerant towards migrants. In female group could be seen that women in the age of thirty and fifty are tend to agree with the statement “Migrants should obtain rights to social benefits Immediately on arrival”. Both genders from sixty-five to seventy-five are more conservative, they are less likely to answer that way.
Now, it is time to create a barplot that is created for single categorical variable(s). Doing it, we decided to choose a variable such as viewpoints about people from the poorer countries outside Europe and make a division according to respondents’ gender.
### Select some variables in which we are interested in (main one is opinions and one for division)
data_bar_men <- Germany %>% select(impcntr, gndr) %>% na.omit() %>%
filter((impcntr != 7 | impcntr != 8 | impcntr != 9)
& (gndr == 1))
data_bar_women <- Germany %>% select(impcntr, gndr) %>% na.omit() %>%
filter((impcntr != 7 | impcntr != 8 | impcntr != 9)
& (gndr == 2))
### Rename the chosen variables for better presentation on graphs
data_bar_men$impcntr <- factor(data_bar_men$impcntr, levels = c(1, 2, 3, 4), labels = c("Allow many \nto come and live \nhere", "Allow some", "Allow a few", "Allow none"))
data_bar_men$gndr <- factor(data_bar_men$gndr, levels = c(1), labels = c("Male"))
data_bar_women$impcntr <- factor(data_bar_women$impcntr, levels = c(1, 2, 3, 4), labels = c("Allow many \nto come and live \nhere", "Allow some", "Allow a few", "Allow none"))
data_bar_women$gndr <- factor(data_bar_women$gndr, levels = c(2), labels = c("Female"))
### Construct two barplots and depict them on one picture
barplot1 <- ggplot() + geom_bar(data = data_bar_men, aes(x = as.factor(impcntr)), fill = "#D5ACEE") +
xlab(" ") +
ylab("Number of men") +
theme(axis.text.x = element_text(angle=90)) +
theme_bw(base_size = 9)
barplot2 <- ggplot() + geom_bar(data = data_bar_women, aes(x = as.factor(impcntr)), fill = "#FACB84") +
xlab(" ") +
ylab("Number of women") +
theme(axis.text.x = element_text(angle=90)) +
theme_bw(base_size = 9)
two_bar <- plot_grid(barplot1, barplot2, labels = "AUTO")
title_bar <- ggdraw() + draw_label("Viewpoints about people from the poorer countries outside Europe", fontface='bold', size = 12.5)
plot_grid(title_bar, two_bar, ncol=1, rel_heights=c(0.1, 1))
This bar chart illustrates the different attitudes of both genders Germans towards immigrants from Non-European countries. We used Gender (gndr) and Allow many/few immigrants from poorer countries outside Europe (impcntr) variables to build this graph.
Men and Women in Germany shows the same results. Both genders are prefer to choose “Allow some”, so people are seems to be rather tolerant towards immigrant. It thesis also can be proved by considering that the smallest number of people answered “Allow none”.
The next is a scatter-plot that is created for bivariate distribution of continuous variables. Here we want to look at the correlation between number of people living in a household and year of migration.
### We do similar manipulations as we have done before:
## a) choosing variables
data_sc <- Germany %>%
select(livecnta, hhmmb) %>% na.omit()
## b) and construction graph
pic_sc <- ggplot(data_sc, aes(x = as.numeric(livecnta), y = as.numeric(hhmmb))) +
geom_point(shape = 1) + geom_smooth(method = lm) +
ylab("Number of people in the household") +
xlab("Year in which people came to Germany") +
ggtitle("Correlation between members of household and year of migration")
pic_sc
This scatter plot depicts the relation between year of migrating to Germany and the number of family members. By this graph we wanted to observe how many people moved to Germany with their families or started a family during 1940 and 2000 years. Variables are used in the graph: Number of people living regularly as member of household (hhmmb) and What year you first came to live in country (livecnta).
It is clearly seen from the graph, that the number of migrants, who came to the Germany increased a lot. Here is depicted the upward trend line. More and more people started to move to Germany with their families or started a family, that can say, that migrants seek not for labor opportunities, but also for better living conditions.
Further, let us make things more complicated and construct a boxplot for bivariate distribution of a continuous and categorical variables.
### We do similar manipulations as we have done before:
## a) choosing variables
data_box1 <- Germany %>%
select(wkhtot, ctzcntr, mainact) %>% na.omit()
## b) renaming the chosen variables for better presentation on graphs
data_box1$ctzcntr <- factor(data_box1$ctzcntr, levels = c(1, 2), labels = c("Yes", "No"))
## c) and construction graph
box1 <- ggplot() +
geom_boxplot(data = data_box1, aes(x = as.factor(mainact), y = as.numeric(wkhtot))) +
theme_bw() +
facet_grid(~as.factor(data_box1$ctzcntr)) +
ylab("Total hours worked per week in main job") +
xlab("Main activity last 7 days") +
ggtitle("Variety of working activity depending on citizenship") +
theme_set(theme_bw(base_size = 10))
### There is a creation of a table that describes meaning of numbers regarding to activity
activity_number <- c(1:9)
text_of_description <- c("Paid work", "Education", "Unemployed, looking for job", "Unemployed, not looking for job", "Permanently sick or disabled", "Retired", "Community or military service", "Housework, looking after children, others", "Other")
desc_df <- data.frame(activity_number, text_of_description)
box1
kable(desc_df)
| activity_number | text_of_description |
|---|---|
| 1 | Paid work |
| 2 | Education |
| 3 | Unemployed, looking for job |
| 4 | Unemployed, not looking for job |
| 5 | Permanently sick or disabled |
| 6 | Retired |
| 7 | Community or military service |
| 8 | Housework, looking after children, others |
| 9 | Other |
That boxplot gives the information about the difference in activities of immigrants and citizens. To make this figure we used “Main activity last 7 days” (mainact) and “Total hours normally worked per week in main job overtime included” (wkhtot)
The graph shows that there are more immigrants, who unemployed and not looking for a job (4), there are less people without work among citizens. Immigrants are more engaged in the service sector (8) than citizens are. Also, what is interesting, there are little number of immigrants who work in education sphere (2), community and military service (8). More than that it is seen from the graph that there are no retired immigrants (6), So that point is obvious because we were observing people based on citizenship presence. What is more, there are small number among immigrants, who are sick or disabled (5), while there are a lot of people with citizenship which are permanently sick (9).
Overall, there is a gap among owner of citizenship and immigrants. Immigrants are tend to have low-paid jobs, do not have any privileges. Immigrants engaged mostly in service sphere.
The last graph is another type of barplot: stacked barplot for bivariate distribution of categorical variables.
data_st_bar <- Germany %>%
select(domicil, impcntr) %>%
na.omit() %>%
filter((impcntr != 7 | impcntr != 8 | impcntr != 9)
& (domicil != 7 | domicil != 8 | domicil != 9))
data_st_bar$impcntr <- factor(data_st_bar$impcntr, levels = c(1, 2, 3, 4), labels = c("Allow many to come and live here", "Allow some", "Allow a few", "Allow none"))
data_st_bar$domicil <- factor(data_st_bar$domicil, levels = c(1, 2, 3, 4, 5), labels = c("A big city", "Suburbs or\n outskirts of \n big city", "Town or \nsmall city", "Country \nvillage", "Farm or home \nin countyside"))
data_st_bar <- data_st_bar %>% group_by(domicil, impcntr) %>% summarise(number = n())
data_st_bar$domicil <- as.factor(data_st_bar$domicil)
data_st_bar$impcntr <- as.factor(data_st_bar$impcntr)
st_bar <- ggplot(data = data_st_bar, aes(x = domicil, y = number, fill = impcntr)) +
geom_bar(stat = "identity", position = "fill") +
xlab("Place of living") +
ylab("Number of people") +
theme_bw() +
labs(fill = "Variation of opinions", title = "Viewpoints about people from the poorer countries outside Europe \nexpressed by respondents who differ in places of dwelling") +
scale_fill_brewer(palette = "Set3") +
theme_set(theme_bw(base_size = 10))
st_bar
Here is a stacked bar plot which demonstrates how citizens of different country areas accept migrants from poor countries to live with them. By “accepting” it is meant what people answered on the question How about people from the poorer countries outside Europe?: Allow many is acception, allow none is refusal. This graph was constructed by these variables: Allow many/few immigrants from poorer countries outside Europe (impcntr) and Domicile, respondent’s description (domicil).
As it is seen from the bar plot, people from big cities and suburbs are likely to allow some people from poor countries and here is a little amount of people who reject any immigrants from poor countries, so it means that people from big cities are tollerante towards immigrants. The biggest part of native population , who do not allow immigrants or allow only few, is concentrated in small towns. It can be said that citizens of urban area much more tolerant towards immigrants from poorer countries than people, who live in towns or countryside.
Alexandra Shanina prepared data for the tests, wrote hypotheses for chi-squared test, constructed some graphs, and interpreted graphs and the results of chi-squared test.
Milena Oleshko wrote hypotheses for t-test, constructed some graphs, and interpreted graphs and the results of t-test.
Zakharova Victoria planned the project, defined essential libraries, wrote assumptions for two tests, organized .Rmd, helped and supported A. Shanina and M. Oleshko.
All in all, there were several group discussions about some aspects such as defining possible variables for the project, constructing graphs and properties for them.
# In order to see the full-length numbers the following code is implemented
options(scipen = 999)
# Load the data
library(foreign)
Germany <- read.spss("ESS8DE.sav", use.value.labels = T, to.data.frame = T)
The general idea of the Chi-squared test it to check whether there is dependence between two categorical variables or not.
In order to conduct this type of test, the presented below variables were chosen.
Abbreviation <- c("ctzcntr", "tporgwk")
LevelOfMeasure <- c("Binominal", "Nominal")
Definition <- c("Citizen of country", "What type of organisation work/worked for")
vardesc_1 <- data.frame(Abbreviation, LevelOfMeasure, Definition)
knitr::kable(vardesc_1)
| Abbreviation | LevelOfMeasure | Definition |
|---|---|---|
| ctzcntr | Binominal | Citizen of country |
| tporgwk | Nominal | What type of organisation work/worked for |
Choosing such variables, the following hypothesis is supposed to be checked:
The status of a citizen of the country does not affect the type of organization in which a person works. In other words, there is independence between these variables.
H0: there is independence between being a citizen of the country and type of organization worked for.
H1: the opposite case of H0.
library(tidyverse)
library(sjPlot)
cst_data <- Germany %>%
select(tporgwk, ctzcntr) %>%
filter(tporgwk != 66 & tporgwk != 77 & tporgwk != 88 &
tporgwk != 99 & ctzcntr != 7 & ctzcntr != 8 & ctzcntr != 9) %>%
na.omit()
# Representation of the data in a format of table
table(cst_data$tporgwk, cst_data$ctzcntr)
##
## Yes No
## Central or local government 198 4
## Other public sector (such as education and health) 372 18
## A state owned enterprise 163 6
## A private firm 1567 102
## Self employed 180 7
## Other 75 9
# Create a table as variable
cst_tbl <- matrix(c(198, 372, 163, 1567, 180, 75, 4, 18, 6, 102, 7, 9), nrow = 6)
rownames(cst_tbl) <- c("Central or local government",
"Other public sector (such as education and health)",
"A state owned enterprise",
"A private firm",
"Self employed",
"Other")
colnames(cst_tbl) <- c("Citizen","Non-citizen")
As can be noticed, there is a cell with a value that is lower than 5. However, the chi-squared test can be conducted as according to an assumption such values can be no more than in 20% of cells.
# Visualization
sjp.xtab(cst_data$tporgwk, cst_data$ctzcntr,
margin = "row",
bar.pos = "stack", coord.flip = FALSE,
vjust = "right", hjust = "center",
title = "The proportions of citizens and non-citizens in different types of organizations",
legend.title = "Citizen of country",
axis.titles = "Type of organisation",
expand.grid = TRUE, geom.colors = c("#bfdbed", "#158cba"))
This stacked barplot demonstrates the percentage distribution of people with and without citizenship in the German labor market. It is clearly seen that the number of positions occupied by non-citizens is too small. Obviously, there is migrant discrimination. The most closed positions are governmental, there can be seen very small percentage of people without citizenship, while private firms are to employ migrants. It is also interesting to observe public sector position, because 4,8% places are occupied by migrants.
chisq.test(cst_tbl)
##
## Pearson's Chi-squared test
##
## data: cst_tbl
## X-squared = 13.516, df = 5, p-value = 0.019

chi_test2 <- chisq.test(cst_tbl)
As can be seen, the chi-squared test is significant as the critical chi-square statistic value for p = 0.05 (95% confidence level) with 5 degrees of freedom is equal to 11.07. In this case, we have to take a look at residuals that can be got from the structure of the test.
str(chi_test2)
## List of 9
## $ statistic: Named num 13.5
## ..- attr(*, "names")= chr "X-squared"
## $ parameter: Named int 5
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.019
## $ method : chr "Pearson's Chi-squared test"
## $ data.name: chr "cst_tbl"
## $ observed : num [1:6, 1:2] 198 372 163 1567 180 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
## .. ..$ : chr [1:2] "Citizen" "Non-citizen"
## $ expected : num [1:6, 1:2] 191 369 160 1579 177 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
## .. ..$ : chr [1:2] "Citizen" "Non-citizen"
## $ residuals: num [1:6, 1:2] 0.501 0.16 0.248 -0.297 0.234 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
## .. ..$ : chr [1:2] "Citizen" "Non-citizen"
## $ stdres : num [1:6, 1:2] 2.238 0.746 1.102 -2.064 1.042 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "Central or local government" "Other public sector (such as education and health)" "A state owned enterprise" "A private firm" ...
## .. ..$ : chr [1:2] "Citizen" "Non-citizen"
## - attr(*, "class")= chr "htest"
chi_test2$observed
## Citizen Non-citizen
## Central or local government 198 4
## Other public sector (such as education and health) 372 18
## A state owned enterprise 163 6
## A private firm 1567 102
## Self employed 180 7
## Other 75 9
round(chi_test2$expected,2)
## Citizen Non-citizen
## Central or local government 191.08 10.92
## Other public sector (such as education and health) 368.92 21.08
## A state owned enterprise 159.86 9.14
## A private firm 1578.78 90.22
## Self employed 176.89 10.11
## Other 79.46 4.54
round(chi_test2$residuals, 3)
## Citizen Non-citizen
## Central or local government 0.501 -2.094
## Other public sector (such as education and health) 0.160 -0.671
## A state owned enterprise 0.248 -1.037
## A private firm -0.297 1.241
## Self employed 0.234 -0.978
## Other -0.500 2.093
chi_test2$stdres
## Citizen Non-citizen
## Central or local government 2.238177 -2.238177
## Other public sector (such as education and health) 0.745909 -0.745909
## A state owned enterprise 1.101529 -1.101529
## A private firm -2.063630 2.063630
## Self employed 1.041856 -1.041856
## Other -2.186028 2.186028
Another assumption of the chi-squared test is held: expected values in all cells are bigger than 1.
Looking at the standardized residuals, it can be seen there are some cells values of which are lower than -2 (it means that the cell contains fewer observations that it was expected (the case of variables independence)) as well as are higher than 2 (it means that the cell contains more observations that it was expected). The following visualization of the standardized residuals demonstrates it, as well.
# Vizualization of residuals
library(graphics)
assocplot(cst_tbl, main = "The standardized residuals of observations")
library(corrplot)
corrplot(chi_test2$stdres, is.cor = FALSE, method = "number", tl.col = "black")
Note: positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables. Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables.
According to the got results, it can be concluded that there is evidence against the null hypothesis. By this it means that there is dependence between the chosen variables such as the status of a citizen of the country and the type of organization in which a person works (X-squared = 13.516, p-value = 0.019).
Independent t-test allows to check whether there is a statistically significant difference between the means in two unrelated groups.
In order to conduct an independent t-test, the presented below variables were chosen.
Abbreviation_tt <- c("eduyrs", " gndr", "ctzcntr")
LevelOfMeasure_tt <- c("Ratio", "Binominal", "Binominal")
Definition_tt <- c("Years of full-time education completed", "Gender","Citizen of country")
vardesc_2 <- data.frame(Abbreviation_tt, LevelOfMeasure_tt, Definition_tt)
knitr::kable(vardesc_2)
| Abbreviation_tt | LevelOfMeasure_tt | Definition_tt |
|---|---|---|
| eduyrs | Ratio | Years of full-time education completed |
| gndr | Binominal | Gender |
| ctzcntr | Binominal | Citizen of country |
In this way, the purpose of conducting such type of test is to compare whether the mean values of spending years on education by non-citizens according to their gender are equal or not.
H0: the mean values of spending years on education by non-citizens taking into account their gender are equal.
H1: the opposite case of H0.
tt_data <- Germany %>%
select(eduyrs, gndr, ctzcntr) %>%
filter(eduyrs != 66 & eduyrs != 77 & eduyrs != 88 & eduyrs != 99 &
ctzcntr != 1 & ctzcntr != 7 & ctzcntr != 8 & ctzcntr != 9) %>%
na.omit()
tt_data$eduyrs <- as.numeric(tt_data$eduyrs)
tt_data$gndr <- as.factor(tt_data$gndr)
Looking at some basic descriptive statistics of a continuous variable such as years of full-time education completed.
library(psych)
describe(tt_data$eduyrs)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 2849 13.26 3.31 13 13.1 2.97 1 27 26 0.46 0.51
## se
## X1 0.06
#Boxplot
ggplot(tt_data, aes(x = gndr, y = eduyrs)) +
geom_boxplot() +
stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) +
ggtitle("Distribution of years spent on education by non-citizens") +
labs(x = "", y = "Years of education") +
theme_bw() +
scale_y_continuous(breaks = 0:30*2)
At this box plot two distribution of spent years on education are depicted by gender among non-citizens in Germany. It can be seen that the median value of years spent by female is nearer to 12 and it is less than male’s value. Additionally, the outliers are presented. Also, according to this graph, the maximum value of female is 21 years and men have the bigger level which is equal to 23 years, and the minimum for women is 5 years while the minimum years spent on education among male is 4 years. The interquartile ranges are slightly different: female’s one is less than male’s one. Overall, the equality of means of spent years for both genders (“+” symbol) is quite difficult to define exactly without any calculations.
# Code for mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Histogram 1
ggplot() +
geom_histogram(data = tt_data, aes(x = eduyrs, y=..density..),
position = "identity", binwidth = 1, alpha = 0.4) +
labs(title = "Distribution of spent years on education by non-citizens",
x = "Years spent on education by non-citizens", y = "Density") +
geom_vline(aes(xintercept = mean(tt_data$eduyrs), color = "mean"), size = 0.7) +
geom_vline(aes(xintercept = median(tt_data$eduyrs), color = "median"), size = 0.7) +
geom_vline(aes(xintercept = getmode(tt_data$eduyrs), color = "mode"), size = 0.7) +
scale_x_continuous(breaks = 0:35*5) +
scale_color_manual(name = "Statistics", values = c(mean = "blue", mode = "red", median = "black"))
There is presented histogram, which prints out the association between the number of years spend on education and the absence of German citizenship. Overall, it can be said that this data is distributed almost normally. However, the graph also demonstrates the mean (a blue line), the median, which is black, and the mode as a red line. These three central tendency measures are not overlaid. Thus, it was decided to observe a variable of years according to migrants’ gender: if there is any difference between men and women. To answer to this issue the t-test was used.
# Histogram 2
tt_data$gndr <- relevel(tt_data$gndr, ref = "Female")
ggplot(tt_data, aes(x = eduyrs, fill = gndr)) +
geom_histogram(aes(y=..density..), position = "identity", binwidth = 1, alpha = 0.6) +
labs(title = "Distribution of spent years on education by non-citizens",
subtitle = "taking into account gender",
x = "Years spent on education", y = "Density",
fill = "Gender") +
facet_grid(. ~ gndr) +
scale_x_continuous(breaks = 0:30*5) +
geom_vline(aes(xintercept = mean(tt_data$eduyrs), color="mean"), size = 0.7) +
geom_vline(aes(xintercept = median(tt_data$eduyrs), color = "median"), size = 0.7) +
geom_vline(aes(xintercept = getmode(tt_data$eduyrs), color = "mode"), size = 0.7) +
scale_color_manual(name = "Statistics", values = c(mean = "blue", mode = "red", median = "black"))
This histogram shows the connection between years spend on education and a gender of respondents among migrants. The median (a dotted line), the mode (a red line) and the mean (a blue line) are represented, as well. It can be got from the graph that there is no obvious difference between male and female: both genders are used to spend approximatly the same time for education. However, such statement as a hypothesis should be checked.
The Q-Q plot looks approximately normal as the distribution is slightly skewed.
bartlett.test(tt_data$eduyrs ~ tt_data$gndr)
##
## Bartlett test of homogeneity of variances
##
## data: tt_data$eduyrs by tt_data$gndr
## Bartlett's K-squared = 0.12866, df = 1, p-value = 0.7198
var.test(tt_data$eduyrs, tt_data$gndr)
##
## F test to compare two variances
##
## data: tt_data$eduyrs and tt_data$gndr
## F = 44.066, num df = 2848, denom df = 2848, p-value <
## 0.00000000000000022
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 40.94496 47.42568
## sample estimates:
## ratio of variances
## 44.06634
As for Bartlett’s test, the variances are equal (K-squared = 0.12866, p-value = 0.7198). The same conclusion can be made based on the results of F-test (F = 44.066, p-value < 2.2e-16).
t.test(tt_data$eduyrs ~ tt_data$gndr, var.equal = T)
##
## Two Sample t-test
##
## data: tt_data$eduyrs by tt_data$gndr
## t = -2.5737, df = 2847, p-value = 0.01011
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.56349294 -0.07616885
## sample estimates:
## mean in group Female mean in group Male
## 13.08799 13.40782
Summing up, the mean values of spending years on education by non-citizens according to their gender are not equal (t-statistic = -2.5737, p-value = 0.01011). There is slightly difference between women’s average years of completed education (m = 13.08799) and men’s ones (m = 13.40782).
Non-parametric test for two independent samples can also be conducted. Although this type of test has no assumptions, it is not much less powerful than t-test.
wilcox.test(eduyrs ~ gndr, data = tt_data)
##
## Wilcoxon rank sum test with continuity correction
##
## data: eduyrs by gndr
## W = 946220, p-value = 0.002912
## alternative hypothesis: true location shift is not equal to 0
According to the results of the Wilcoxon rank sum test, there is difference in years of completed education between both genders of non-citizens (W = 946220, p < 0.002912).
# In order to see the full-length numbers the following code is implemented
options(scipen = 999)
# Load libraries
library(foreign)
library(knitr)
library(tidyverse)
library(graphics)
library(psych)
library(dplyr)
library(ggplot2)
library(car)
library(magrittr)
library(kableExtra)
library(gplots)
library(DescTools)
Research puspose To refine how the years spent on training and the level of satisfaction with the current financial situation are related.
Variables
# Load data and prepare it
Germany <- read.spss("ESS8DE.sav", use.value.labels = T, to.data.frame = T)
dataset <- Germany %>% select(hincfel, eduyrs, livecnta) %>%
filter((eduyrs != 66 | eduyrs != 77 | eduyrs != 88 | eduyrs != 99)
& (livecnta != 6666 | livecnta != 7777 |
livecnta != 8888 | livecnta != 9999) &
(hincfel != "Refusal" | hincfel != "Don't know" | hincfel != "No answer")) %>% na.omit()
dataset$eduyrs <- as.numeric(as.character(dataset$eduyrs))
# Recoding variables to reduce lenght of their values
dataset$income <- rep(NA, length(dataset$hincfel))
dataset$income[dataset$hincfel == "Living comfortably on present income"] <- "Living comfortably"
dataset$income[dataset$hincfel == "Coping on present income"] <- "Coping"
dataset$income[dataset$hincfel == "Difficult on present income"] <- "Difficult"
dataset$income[dataset$hincfel == "Very difficult on present income"] <- "Very difficult"
dataset$income <- as.factor(dataset$income)
Abbreviation <- c("hincfel", "eduyrs")
LevelOfMeasure <- c("Nomial", "Ratio")
R_type <- c("factor", "numeric")
Definition <- c("Feeling about household's income nowadays", "Years of full-time education completed")
vardesc <- data.frame(Abbreviation, LevelOfMeasure, Definition, R_type)
kable(vardesc)
| Abbreviation | LevelOfMeasure | Definition | R_type |
|---|---|---|---|
| hincfel | Nomial | Feeling about household’s income nowadays | factor |
| eduyrs | Ratio | Years of full-time education completed | numeric |
In order to define some observation that are related only to people who have moved to Germany in former times, the data were selected in accordance with a variable of years when people came to Germany.
One-way ANOVA is used to perform statistical analysis in case of checking equality of means among three or more groups.
Assumptions
It should be noted that ANOVAs are fairly robust to violations to these assumptions. However, it is needed to be alert when these assumptions do not hold.
Hypotheses
H0: the difference between means of groups related to subjective perception of income and time spent on education equals to 0.
HA: the difference between means of groups related to subjective perception of income and time spent on education is not equal to 0.
The social mechanism, which was observed is the education one has or to be more precise, the years person spend on education. Educational attainment influences directly on the future occupation, thus it partly defines the future income of a person.
In the case of migrants, the educational process can be rather difficult, so not everyone wants to study a lot, sometimes they cannot afford it.People, migrating to another country, have to adapt, which takes time, to the educational system, learn a foreign language, which is official in the country they moved in. Teachers and tutors can have biased attitude towards migrants students, which also has an impact on educational attainment. Moreover, migrants do not always have enough money to let their children or themselves study at university, because education obviously costs a lot, maybe not directly, but by calculating time spend on it and other resources. So, it is really hard for migrants to get higher education and to dedicate their lives to the educational process. What is more, the majority of migrants move to another country to find a job, not for education. However, it is probably applicable only for migrants from non-European countries, because of the significant difference in cultures.
In our research, the main question is how migrants define their well-being. So, migrants were distributed according to their level of education. It was interesting to learn the subjective well-being status of migrants with different sums of years spend on education. The subjective well-being was used because it helps to understand how people estimate their material situation observing o people material conditions surrounding them in society, in our case in Germany.
Supporting the issue, the infographic below might be interesting for further considerations in general ( source ).
describeBy(dataset$eduyrs, dataset$income, mat = TRUE) %>%
select(Perception_of_income = group1, N = n, Mean = mean, SD = sd,
Median = median, Min = min, Max = max, Skew = skew,
Kurtosis = kurtosis, st.error = se) %>%
kable(align = c("lrrrrrrrr"), digits = 2, row.names = FALSE,
caption = "Spent time on education by subjective perception of income") %>%
kable_styling(bootstrap_options = c("bordered", "responsive","striped"),
full_width = FALSE)
| Perception_of_income | N | Mean | SD | Median | Min | Max | Skew | Kurtosis | st.error |
|---|---|---|---|---|---|---|---|---|---|
| Coping | 155 | 13.23 | 3.81 | 13 | 2 | 26 | 0.30 | 0.70 | 0.31 |
| Difficult | 40 | 11.65 | 3.89 | 11 | 5 | 22 | 0.48 | -0.17 | 0.62 |
| Living comfortably | 84 | 14.88 | 3.31 | 15 | 4 | 22 | -0.24 | 0.13 | 0.36 |
| Very difficult | 9 | 12.44 | 4.61 | 13 | 3 | 18 | -0.70 | -0.68 | 1.54 |
As can be seen, each skew, as well as kurtosis, is less than 2.
# Code for mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Histogram 1
ggplot(dataset, aes(dataset$eduyrs)) +
geom_density(aes(data = dataset$eduyrs),
position = 'identity', alpha = 0.2) +
labs(x = 'Years spent on education', y = 'Density',
title = "Distribution of years spent on education") +
# scale_x_continuous(limits = c(0, 30)) +
geom_vline(aes(xintercept = mean(dataset$eduyrs), color="mean"), size = 0.7) +
geom_vline(aes(xintercept = median(dataset$eduyrs), color = "median"), size = 0.7) +
geom_vline(aes(xintercept = getmode(dataset$eduyrs), color = "mode"), size = 0.7) +
scale_color_manual(name = "Statistics", values = c(mean = "blue", mode = "red",
median = "black")) +
theme_bw()
According to the density plot, the variable of years is distributed not as it would like to be in ideal. However, it can be named normal (more or less).
QQ-plot demonstrates that mostly points seem not to fall about a straight line.
ggplot(dataset, aes(x = income, y = eduyrs)) +
geom_boxplot() +
stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) +
ggtitle("Distribution of years spent on education") +
labs(x = "", y = "Years spent on education",
subtitle = "per category of income perception") +
theme_bw() +
coord_flip() +
scale_y_continuous(breaks = 0:30*2)
From the boxplot, it can be seen that there are some outliers. Also, a variable, which stands for years spent on education, is not normally distributed in across the groups as it might be wanted. Overall, further the data will be treated as normal.
Homogeneity of variances
In order to check whether variances are equal or not, the Levene’s Test is applied.
# Check for homogeneity of variance
leve_test <- leveneTest(eduyrs ~ income, data = dataset)
leve_test
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.4516 0.7164
## 284
According to the Levene’s test, variances are equal (p-value = 0.7164). The null hypothesis about equality of variances is not rejected.
F-test
aov_test <- aov(dataset$eduyrs ~ dataset$income)
summary(aov_test)
## Df Sum Sq Mean Sq F value Pr(>F)
## dataset$income 3 318 106.14 7.711 0.0000571 ***
## Residuals 284 3909 13.76
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It can be inferred that differences in years spent on education across groups classified by subjective perception of income are not equal in average as F(3, 284) = 7.711, p-value = 0.0000571. The F-ratio is significant.
Normality of residuals
In case of normally distributed residuals, there must be a straight red line in the two upper graphs, and a straight line in the Q-Q plot. Although the first upper graph has not such a straight line, the red line fractionally fluctuated.
The Scale-Location plot shows whether the residuals are spread equally along the predictor range, i.e. homoscedastic. The red line should be straightly horizontal. However, in this case it is not entirely so. The line slopes slightly up around 12, and then there are downward and straight trends. It can be said that the line consistes of several horizontal sections.
As for the Q-Q plot, it can be seen that a majority of dots are on the line.
Previously, we found that skew and kurtosis are less than 2. Shapiro–Wilk test is applied.
anova_residuals <- residuals(object = aov_test)
sha_test <- shapiro.test(x = anova_residuals)
sha_test
##
## Shapiro-Wilk normality test
##
## data: anova_residuals
## W = 0.99064, p-value = 0.06271
hist(anova_residuals, ylim = c(0, 80))
According to the results of the Shapiro-Wilk normality test, p-value = 0.06271 means that the null hypothesis about normal distibution is not rejected. Consequently, it can be concluded that residuals are distributed normally.
Post hoc test
In case of the equality of variances across groups, the post hoc such as Tukey’s one can be run in order to inspect which groups contribute to the statistical significance of this test.
The second pair “Living comfortably - Coping” might be considered as significant rather than minor. The same can be said about the another pair “Living comfortably - Difficult” that is also the most notable category among pairs.
Non-parametric equivalent of ANOVA
The non-parametric test compares the medians of three of more groups. Such test is the Kruskal-Wallis one. As for the null hypothesis, it is stated that mean ranks of the groups are the same.
kr_test <- kruskal.test(eduyrs ~ income, data = dataset)
kr_test
##
## Kruskal-Wallis rank sum test
##
## data: eduyrs by income
## Kruskal-Wallis chi-squared = 24.352, df = 3, p-value = 0.00002109
KW chi-square(3) = 24.352, p-value is < .001, which means that the mean ranks of the groups related to different types of organizations are not the same. Consequently, the Dunn’s test can be implimented, as well.
dunn <- DunnTest(eduyrs ~ income, data = dataset)
dunn
##
## Dunn's test of multiple comparisons using rank sums : holm
##
## mean.rank.diff pval
## Difficult-Coping -34.631855 0.0741 .
## Living comfortably-Coping 39.598502 0.0021 **
## Very difficult-Coping -5.086022 0.8580
## Living comfortably-Difficult 74.230357 0.000019 ***
## Very difficult-Difficult 29.545833 0.6683
## Very difficult-Living comfortably -44.684524 0.3733
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to results, there are two pairs of groups that have statistically significant differences in their medians, while the others do not.
The question such as what the substantive significance of group membership in determining amount of time spent on education is. In order to get an answer, looking at the share of variance explained by groups in all the variance of time variable can be helpful.
omega_sq <- function(aov_test){
sum_stats <- summary(aov_test)[[1]]
SSm <- sum_stats[["Sum Sq"]][1]
SSr <- sum_stats[["Sum Sq"]][2]
DFm <- sum_stats[["Df"]][1]
MSr <- sum_stats[["Mean Sq"]][2]
W2 <- (SSm-DFm*MSr)/(SSm+SSr+MSr)
return(W2)
}
omega_sq(aov_test)
## [1] 0.06533891
Overall, it can be consluded that the finding is not so important because the omega squared is below .07 that is medium size effect.
Alexandra Shanina participating in the process of finding needed variables, constructed the model background theory, made interpretations of some models, made interpretations of some graphs, made hypothesis and research question.
Milena Oleshko described graphs with distributions – histograms and boxplots, described graphs with correlation, wrote down the interpretations of regression models and the conclusion to the whole project.
Zakharova Victoria found variables, wrote equations, coded all graphs, constructed models, made a fabulous html., checked significance and interpreted ANOVE, tied up whole project. Also, helped and supported Alexandra and Milena taking into account their ideas, comments, and even dealt with their criticism.
Theory
The key idea of this research project is to observe the social stratification and types of inequality in German society. To make it, it was decided to introduce a model, which is based on socioeconomic status measurement. Socioeconomic status is the set of factors, which effects on ability to access to collectively desired resources. One of the most popular ways to measure person’s economic status is to observe his or her annual income source. So, income was chosen by us as a dependent variable.
It is known, that SES consists of different parts: income depends on a variety of social factors. Our model is to watch what variables affect the income size in Germany. The first variable, which we decided to check, was years spent on education. According to the theory Sewell, educational attainment has a direct causal link to the occupational attainment in the future, so, more time person spent on education, more successful he or she is, the bigger the income size and otherwise source In the case of migrants, there are much more obstacles to spent more time on education than it necessary, because 1) it costs a lot, 2) it takes a lot of time, 3) the process of educational adaptation is hard. The second variable was age. By this, we were curious if there is any discriminating pattern among age groups. It is assumed that different age groups do not have equally access to the resources (as children and pensioners). The third variable is gender. Across the world males and females earnings differ (wage gap among genders). The thought was find whether the income depends on gender. The final is refers to migrants (whether a person has a citizenship or not). By this variable, we wanted to check if there is an inequality between migrants and native population. Migrants on the same job get lower income than native people. Migrants usually agree to work at the low-paid jobs source. Another variable, that was used is the presence of fixed work schedule. From the Goldthorpe theory, it is known that people of high social position and status have not fixed working schedule, they make it by them own (manage their labor by themselves), thus earn much more money. However, there are precarious type of jobs, which do not obtain high status, but provide an ability to earn high income source. By this variable we wanted to know how the presence of fixed working hours may influence on household’s total income. To make this case much more fascinating, we took gender as a moderator, to find out if there is inequality between males and females.
In order to make our case more specific, we decided to filter our data and left only whose people who have any children in the households. As we know from the theory, SES affects children. The aim was to observe the respondents with children, because if there are kids in household, the SES of respondent affects the future perspectives of kids. The interesting case of SES reproduction source.
Below schematical presentation of models are shown.

Research question
Do such factors as years spent on education, age, gender, citizenship, having fixed working hours taling into account gender tend to affect household’s total income?
Hypotheses
Mainly we suppose the following ideas to be checked:
The more years person spent on education, the bigger size of income he or she would get.
Being older, a person earns more (= has more income).
Males would get higher income than females.
Non-migrants would get higher income than migrants. Note: in our case as well as in the models by migrants we mean people who do not have citizenship.
The presence or absence of working schedule affect the difference in incomes.
Hypothesis for interactive model: the type of working schedule would affect the income and at the same time gender would influence the income. So, we expect that the first mentioned variable does not have the “pure” effect. Thus, people with the same working schedule characteristics but different gender would get dissimilar income.
In our data we have the following meanings behind income variable:

In order to deal with income in further regression models we recorded this variable having added number as ordering.
# Libraries
library(foreign)
library(haven)
library(tidyverse)
library(psych)
library(stargazer)
library(polycor)
library(DataExplorer)
library(magrittr)
library(sjPlot)
library(ggpubr)
library(formattable)
library(effects)
library(ggstatsplot)
library(margins)
# Functions
'%out%' <- function(x,y)!('%in%'(x,y))
mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Loading data
Germany <- read.spss("ESS8DE.sav", use.value.labels = T, to.data.frame = T)
# Data preparation
data <- Germany %>%
select(chldhm, hinctnta, eduyrs, agea, gndr, icwhct, ctzcntr) %>%
filter(hinctnta %out% c(77, 88, 99) &
(eduyrs %out% c(77, 88, 99)) &
(agea != 999) & (gndr != 9) &
(icwhct %out% c(6, 9)) &
(ctzcntr %out% c(7, 8, 9)) &
(chldhm == "Respondent lives with children at household grid")) %>%
select(-chldhm)
# Recoding working hours variable
data$new[data$icwhct == "Yes"] <- "Fixed working hours"
data$new[data$icwhct == "No"] <- "Non-fixed working hours"
data <- data %>% select(-icwhct) %>% rename(icwhct = new)
data$icwhct <- as.factor(data$icwhct)
# Recoding income variable
data$hinctnta_num[data$hinctnta == "J - 1st decile"] <- "1"
data$hinctnta_num[data$hinctnta == "R - 2nd decile"] <- "2"
data$hinctnta_num[data$hinctnta == "C - 3rd decile"] <- "3"
data$hinctnta_num[data$hinctnta == "M - 4th decile"] <- "4"
data$hinctnta_num[data$hinctnta == "F - 5th decile"] <- "5"
data$hinctnta_num[data$hinctnta == "S - 6th decile"] <- "6"
data$hinctnta_num[data$hinctnta == "K - 7th decile"] <- "7"
data$hinctnta_num[data$hinctnta == "P - 8th decile"] <- "8"
data$hinctnta_num[data$hinctnta == "D - 9th decile"] <- "9"
data$hinctnta_num[data$hinctnta == "H - 10th decile"] <- "10"
data <- data %>% select(-hinctnta) %>% rename(hinctnta = hinctnta_num)
data$hinctnta <- factor(data$hinctnta,
levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
# Appropriate types
data$eduyrs <- as.numeric(as.character(data$eduyrs))
data$agea <- as.numeric(as.character(data$agea))
data$hinctnta <- as.numeric(as.character(data$hinctnta))
# Used variables in regression models
Number <- c("1", "2", "3", "4", "5", "6")
Abbreviation <- c("hinctnta", "eduyrs", "agea", "gndr", "ctzcntr", "icwhct")
RoleInRegressionModel <- c("dependent", "independent", "independent",
"independent", "independent", "independent")
LevelOfMeasurement <- c("Ratio", "Ratio", "Ratio", "Binomial", "Binomial", "Binomial")
Definition <- c("Household's total net income, all sources",
"Years of full-time education completed",
"Age of respondent",
"Gender of respondent: male or female",
"Citizen of country: yes or no",
"Having a set basic or contracted number of working hours: yes or no")
vardesc <- data.frame(Number, Abbreviation, RoleInRegressionModel, LevelOfMeasurement, Definition)
formattable(vardesc, align = c("l", rep("r", NCOL(vardesc) - 1)),
list(`Abbreviation` = formatter("span", style = ~style(color = "green",
font.weight = "bold"))))
| Number | Abbreviation | RoleInRegressionModel | LevelOfMeasurement | Definition |
|---|---|---|---|---|
| 1 | hinctnta | dependent | Ratio | Household’s total net income, all sources |
| 2 | eduyrs | independent | Ratio | Years of full-time education completed |
| 3 | agea | independent | Ratio | Age of respondent |
| 4 | gndr | independent | Binomial | Gender of respondent: male or female |
| 5 | ctzcntr | independent | Binomial | Citizen of country: yes or no |
| 6 | icwhct | independent | Binomial | Having a set basic or contracted number of working hours: yes or no |
After selecting needed variables we can look at the data frame and define whether there are not available observations or not.
As can be seen, there are NAs. Although it is possible to deal with it, we are going to remove them to have completely cleaned data for further regression models. After that let us to present basic information about the data and its descriptive statistics.
# Removing NAs
data <- data %>% na.omit()
# Basic information
introduce(data)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 790 6 3 3 0
## total_missing_values complete_rows total_observations memory_usage
## 1 0 790 4740 32112
# Descriptive statistics
describe(data)
## vars n mean sd median trimmed mad min max range skew
## eduyrs 1 790 15.21 3.19 15 15.08 2.97 2 25 23 0.25
## agea 2 790 44.79 10.82 45 44.45 10.38 18 83 65 0.34
## gndr* 3 790 1.49 0.50 1 1.49 0.00 1 2 1 0.04
## ctzcntr* 4 790 1.08 0.27 1 1.00 0.00 1 2 1 3.13
## icwhct* 5 790 1.09 0.28 1 1.00 0.00 1 2 1 2.89
## hinctnta 6 790 6.98 2.52 8 7.22 2.97 1 10 9 -0.68
## kurtosis se
## eduyrs 0.40 0.11
## agea 0.37 0.38
## gndr* -2.00 0.02
## ctzcntr* 7.80 0.01
## icwhct* 6.36 0.01
## hinctnta -0.50 0.09
Overall, there are 790 observations that are going to be base for analysis. 3 variables are defined as discrete, the other 3 are continuous. There are no NAs as we have removed them.
Looking at the second table we mainly interested in continuous variables such as (1) years spent on education, (2) age of respondent, and (3) income. The other variables are factors. Let us present descriptive statistics in graphically.
# Plotting income distribution
ggplot(data, aes(hinctnta)) +
geom_histogram(position = 'identity', stat = "count",
fill = "chartreuse3", color = "azure4", alpha = 0.2) +
labs(x = 'Income', y = 'Density',
title = "Distribution of respondents' income") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(breaks = unique(data$hinctnta))
As can be seen from the histogram, income is not normally distributed. We can also notice that there are more observations with income related to the second part of deciles. The most popular answer (mode) is the 8th decile. This distribution has a built-up character.
# Plotting age distribution
ggplot(data, aes(agea)) +
geom_histogram(position = 'identity', stat = "count",
fill = "mediumpurple1", color = "mediumpurple3", alpha = 0.35) +
labs(x = 'Age', y = 'Density',
title = "Distribution of respondents' age") +
theme_bw() +
ylim(0, 40) +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(aes(xintercept = mean(data$agea), color = "mean"), size = 0.7) +
geom_vline(aes(xintercept = median(data$agea), color = "median"), size = 0.7) +
geom_vline(aes(xintercept = mode(data$agea), color = "mode"), size = 0.7) +
scale_color_manual(name = "Statistics", values = c(mean = "green",
median = "blue",
mode = "red")) +
scale_y_continuous(breaks = 1:100*10)
As can be noticed, age of respondents is not normally distributed, but still it is roughtly bell-shaped. There are several ups-and-downs. Both mean value and median value are equal to 45 years, while the mode is higher, and it is equal to 49 years.
# Plotting years spent on education distribution
ggplot(data, aes(eduyrs)) +
geom_histogram(position = 'identity', stat = "count",
fill = "orange", color = "orange3", alpha = 0.35) +
labs(x = 'Years of studying', y = 'Density',
title = "Distribution of years spent on education") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(aes(xintercept = mean(data$eduyrs), color = "mean"), size = 0.7) +
geom_vline(aes(xintercept = median(data$eduyrs), color = "median"), size = 0.7) +
geom_vline(aes(xintercept = mode(data$eduyrs), color = "mode"), size = 0.7) +
scale_color_manual(name = "Statistics", values = c(mean = "green",
median = "blue",
mode = "red")) +
scale_y_continuous(breaks = 1:100*10)
It can be said that the variable regarding years that respondents spent on education is more or less normally distributed. There are less people who spent not many years. The most popular answer, which is a mode, is equal to 13 years, so it is the peak of the distribution. In this distribution the mean and the median have the same rounded values which are equal to 15 years.
# Plotting years spent on education distribution
ggplot(data, aes(x = ctzcntr, y = hinctnta)) +
geom_boxplot(fill = c("lightgreen", "lightskyblue")) +
stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) +
ggtitle("Distribution of income per statuses") +
labs(x = "Whether person has a status Germany citizen or not", y = "Income") +
theme_bw()
This graph depicts two boxplots of income distribution by two categories – (1) people, who have citizenship of Germany, and (2) who do not have it. Comparing these two boxplots it can be said that the median and mean income of people with citizenship is higher than the median value of people who do not have it. Both categories have the same maximum value – 10, but minimum value is higher in case of having citizenship.
# Plotting years spent on education distribution
ggplot(data, aes(x = gndr, y = hinctnta)) +
geom_boxplot(fill = c("orchid3", "royalblue3")) +
stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) +
ggtitle("Distribution of income per gender") +
labs(x = " ", y = "Income") +
theme_bw()
This graph shows two distribution of income in Germany by gender. Comparing these two boxplots in can be seen that the median and mean income of male are higher than females have. It also should be said that the third quantile of both categories are about the same value, but the first quantiles differ – females have lower value. Also it is seen that the maximum values of these two categories are equal, and the minimum value of males is higher than females have.
# Plotting years spent on education distribution
ggplot(data, aes(x = icwhct, y = hinctnta)) +
geom_boxplot(fill = c("lightgoldenrod1", "honeydew3")) +
stat_summary(fun.y = mean, geom = "point", shape = 3, size = 4) +
ggtitle("Distribution of income per\nhaving basic or contracted number of working hours") +
labs(x = " ", y = "Income") +
theme_bw()
This graph depicts the distribution of income in Germany by two categories – (1) people who have fixed working hours, and (2) people who do not have fixed working hours. It can be seen that the interquartile ranges of these two categories are very similar – their first and third quartiles do not differ. It should be mentioned that the median value of people with fixed working hours is higher than median of the second group, and the mean values of these categories do not differ a lot, but the first group has the mean higher than the people with non-fixed working hours have.
Next we would like to provide some some plots are created in order to visualize correlations.
ggscatter(data, x = "agea", y = "hinctnta",
add = "reg.line", conf.int = T,
cor.coef = T, cor.method = "pearson",
xlab = "Age", ylab = "Income",
title = "Correlation plot: income vs age",
color = "slateblue4")
ggscatter(data, x = "eduyrs", y = "hinctnta",
add = "reg.line", conf.int = T,
cor.coef = T, cor.method = "pearson",
xlab = "Years of studying", ylab = "Income",
title = "Correlation plot: income vs years of studying",
color = "orangered3")
cor.test(data$hinctnta, data$agea)
##
## Pearson's product-moment correlation
##
## data: data$hinctnta and data$agea
## t = 3.192, df = 788, p-value = 0.001469
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04357362 0.18130486
## sample estimates:
## cor
## 0.1129819
cor.test(data$hinctnta, data$eduyrs)
##
## Pearson's product-moment correlation
##
## data: data$hinctnta and data$eduyrs
## t = 10.148, df = 788, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2767790 0.4002284
## sample estimates:
## cor
## 0.3399674
As can be seen from the first blue plot, there is a positive tiny correlation (R = 0.11) between total income of respondent’s household and respondent’s age in Germany. The p-value (p = 0.0015) gives us evidence that there is the relationship between our variables. The following might be formulated: the higher person’s age is, the higher household income he or she has.
As for the second graph, it can be seen that there is a positive weak correlation (R = 0.34) between total income of respondent’s household and years spent on education by this person. The p-value is less than 0.001 (p < 2.2e-16), and it gives us the evidence that the relationship between income and years of studying exists. The following might be formulated: the more years person spent on education, the higher household income he or she has.
Additionally, another ways to look at data taking into account distributions of variables are shown below.
set.seed(123)
plot <- ggscatterstats(
data = data,
x = agea,
y = hinctnta,
xlab = "Age",
ylab = "Income",
title = "Distribution and correlation: income vs age",
type = "pearson", conf.level = 0.95, method = "lm",
marginal = TRUE,
bf.message = TRUE,
messages = FALSE
)
print(plot)
This graph shows the summary of previous information in a very fancy way! Here are presented the age and income distribution and the correlation between it, which is depicted by scatterplot! The Pearson’s coefficient here is positive and equals to 0.11, which is not very big, but we will deal with it.
set.seed(123)
plot2 <- ggscatterstats(
data = data,
x = eduyrs,
y = hinctnta,
xlab = "Years spent on education",
ylab = "Income",
title = "Distribution and correlation: income vs years of studying",
type = "pearson", conf.level = 0.95, method = "lm",
marginal = TRUE,
bf.message = TRUE,
messages = FALSE
)
print(plot2)
Another beautiful graph giving the summary about aforesaid information and explanation of income and years of studying. There are presented two histograms, which present income and education years distribution and a scatterplot, showing the correlation between two variables. There is correlation coefficient presented in the graph. This coefficient equals 0,34, which is rather good in our case mainly due to complex variables presented in the data.
##The first model has income as the outcome and one predictor – years spent on education.
# Model 1
model1 <- lm(hinctnta ~ eduyrs, data = data)
summary(model1)
##
## Call:
## lm(formula = hinctnta ~ eduyrs, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7965 -1.4561 0.5439 1.8120 4.1523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.89890 0.41051 7.062 0.00000000000362 ***
## eduyrs 0.26807 0.02642 10.148 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.369 on 788 degrees of freedom
## Multiple R-squared: 0.1156, Adjusted R-squared: 0.1145
## F-statistic: 103 on 1 and 788 DF, p-value: < 0.00000000000000022
par(mfrow = c(2,3))
plot(density(model1$res), main = "Model 1. Residuals")
plot(model1)
Coefficients
$\hat{Income}$ = 2.9 + 0.27 ⋅ eduyrs + ei
As follows, beta coefficients are statistically significant (according to t-statistics and p-value that is < 0.001).
Residuals
Model fit
Case as interpretation
Let suppose that person has spent 13 years on education, it means that the income would be equal to 6,38401 (6 rounded). Recording: 6 means that income is from 1500 to 2000 euros monthly.
Checking hypotheses One of hypotheses is that the more years person spent on education, the bigger size of income he or she would get. For instance, person has spent 3 years on education, when his or her household’s total income would be 3.71 (rounded 4) in accordance with the first regression model. Recording: 4 stands for 500 to under 1000 euros per month. Comparing this peron with the previously mentioned one in section “case as interpretation”, it can be concluded that the mentioned above empirical hypothesis is proved.
In the second model just new variable is added. So, this model has two predictors – years spent on education and age of respondent.
# Model 2
model2 <- lm(hinctnta ~ eduyrs + agea, data = data)
summary(model2)
##
## Call:
## lm(formula = hinctnta ~ eduyrs + agea, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6719 -1.5637 0.4866 1.7940 4.0124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.493181 0.546080 2.734 0.006391 **
## eduyrs 0.272563 0.026213 10.398 < 0.0000000000000002 ***
## agea 0.029861 0.007735 3.860 0.000123 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.348 on 787 degrees of freedom
## Multiple R-squared: 0.132, Adjusted R-squared: 0.1298
## F-statistic: 59.85 on 2 and 787 DF, p-value: < 0.00000000000000022
par(mfrow = c(2,3))
plot(density(model2$res), main = "Model 2. Residuals")
plot(model2)
Coefficients
$\hat{Income}$ = 1.49 + 0.27 ⋅ eduyrs + 0.03 ⋅ agea + ei
As follows, beta coefficients are statistically significant (according to t-statistics and p-value that is < 0.001).
Residuals
Model fit
Case as interpretation
Suppose that person has spent 13 years on education and he or she is 20 years old, it means that the income would be equal to 5,632769 (6 rounded). Recording: 6 means that income is from 1500 to 2000 euros monthly.
Checking hypotheses One of hypotheses is that being older, a person earns more (= has more income). For instance, person whose age is 30 has spent 13 years on education, when his or her household’s total income would be 5.9 (rounded 6) in accordance with the first regression model. Recording: 6 means that income is from 1500 to 2000 euros monthly. Comparing this peron with the previously mentioned one in section “case as interpretation”, it can be concluded that the mentioned above empirical hypothesis is not fully proved.
The third model also has income as outcome, but for now there are three predictors – years spent on education, age and gender.
# Model 3
model3 <- lm(hinctnta ~ eduyrs + agea + gndr, data = data)
summary(model3)
##
## Call:
## lm(formula = hinctnta ~ eduyrs + agea + gndr, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9200 -1.5908 0.4021 1.8006 4.1356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.881150 0.551293 3.412 0.000677 ***
## eduyrs 0.274019 0.025999 10.540 < 0.0000000000000002 ***
## agea 0.027551 0.007696 3.580 0.000365 ***
## gndrFemale -0.625940 0.166319 -3.764 0.000180 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.329 on 786 degrees of freedom
## Multiple R-squared: 0.1474, Adjusted R-squared: 0.1441
## F-statistic: 45.29 on 3 and 786 DF, p-value: < 0.00000000000000022
par(mfrow = c(2,3))
plot(density(model3$res), main = "Model 3. Residuals")
plot(model3)
Coefficients
$\hat{Income}$ = 1.88 + 0.27 ⋅ eduyrs + 0.03 ⋅ agea − 0.63 ⋅ gndrFemale + ei
As follows, beta coefficients are statistically significant (according to t-statistics and p-value that is < 0.001).
Residuals
Model fit
Case as interpretation
The same person, who has spent 13 years on education and who is 20 years old, but now we define the gender of this person as male (gndrFemale = 0). It means that the income would be equal to 5,992651 (6 rounded). Recording: 6 means that income is from 1500 to 2000 euros monthly.
Checking hypotheses One of hypotheses is that males would get higher income than females. If the gender of previously mentioned person in section “case as interpretation” was female (gndrFemale = 1), the income would be equal to 5.36 (5 rounded). Recording: 5 means that income is from 1000 to 1500 euros monthly. So, gender inequality in income is noticed. The hypothesis is supported.
# Model 4
model4 <- lm(hinctnta ~ eduyrs + agea + gndr + ctzcntr, data = data)
summary(model4)
##
## Call:
## lm(formula = hinctnta ~ eduyrs + agea + gndr + ctzcntr, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9291 -1.4535 0.4041 1.7570 4.4567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.358474 0.565981 4.167 0.0000343 ***
## eduyrs 0.258953 0.026220 9.876 < 0.0000000000000002 ***
## agea 0.024091 0.007716 3.122 0.001861 **
## gndrFemale -0.648123 0.165381 -3.919 0.0000967 ***
## ctzcntrNo -1.049527 0.313357 -3.349 0.000849 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.314 on 785 degrees of freedom
## Multiple R-squared: 0.1594, Adjusted R-squared: 0.1551
## F-statistic: 37.21 on 4 and 785 DF, p-value: < 0.00000000000000022
par(mfrow = c(2,3))
plot(density(model4$res), main = "Model 4. Residuals")
plot(model4)
Coefficients
$\hat{Income}$ = 2.36 + 0.26 ⋅ eduyrs + 0.02 ⋅ agea − 0.65 ⋅ gndrFemale − 1.05 ⋅ ctzcntrNo + ei
As follows, beta coefficients are statistically significant (according to t-statistics and p-value that is < 0.001).
Residuals + Median equals to 0.41. + Minimum (-6.9) and maximum (4.5) are roughly closed to each other in their absolute values. + The Residual standard error (RSE) is 2.314 meaning that the observed values deviate from the true regression line by approximately 2 units in average.
Model fit
Case as interpretation
Male person (gndrFemale = 0), who is 20 years old and who has spent 13 years on education does not have a status of citizen (ctzcntrNo = 1). As follows, the income would be equal to 5,08 (5 rounded). But if this person has this statuse, the income would be 6.13 (rounded 6). Recording: 5 means that income is from 1000 to 1500 euros monthly, 6 means that income is from 1500 to 2000 euros monthly.
Checking hypotheses One of hypotheses is that people with status of citizen would get higher income than people without it. As illustrated above in section “case as interpretation”, this hypothesis is supported.
The specific point of the 5th model is a two-way interaction. This model includes income as outcome, and there are for some predictors – years spent on education, age, gender, presence or absence of German citizenship and working schedule, where gender is presented as moderator.
# Model 5
model5 <- lm(hinctnta ~ eduyrs + agea + gndr + ctzcntr + icwhct*gndr, data = data)
summary(model5)
##
## Call:
## lm(formula = hinctnta ~ eduyrs + agea + gndr + ctzcntr + icwhct *
## gndr, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9198 -1.4213 0.4053 1.7057 4.5123
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2.240270 0.566785 3.953
## eduyrs 0.264546 0.026239 10.082
## agea 0.024447 0.007769 3.147
## gndrFemale -0.530403 0.172285 -3.079
## ctzcntrNo -1.028761 0.312702 -3.290
## icwhctNon-fixed working hours 0.177163 0.412123 0.430
## gndrFemale:icwhctNon-fixed working hours -1.305154 0.579162 -2.254
## Pr(>|t|)
## (Intercept) 0.0000843 ***
## eduyrs < 0.0000000000000002 ***
## agea 0.00171 **
## gndrFemale 0.00215 **
## ctzcntrNo 0.00105 **
## icwhctNon-fixed working hours 0.66740
## gndrFemale:icwhctNon-fixed working hours 0.02450 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.306 on 783 degrees of freedom
## Multiple R-squared: 0.1676, Adjusted R-squared: 0.1613
## F-statistic: 26.28 on 6 and 783 DF, p-value: < 0.00000000000000022
par(mfrow = c(2,3))
plot(density(model5$res), main = "Model 5. Residuals")
plot(model5)
Coefficients
$\hat{Income}$ = 2.24 + 0.26 ⋅ eduyrs + 0.02 ⋅ agea − 0.5 ⋅ gndrFemale − 1.03 ⋅ ctzcntrNo + 0.18 ⋅ icwhctNo − 1.3 ⋅ gndrFemale ⋅ icwhctNo + ei
As follows, previously used variables have beta coefficients that are statistically significant (according to t-statistics and p-value that is < 0.001). Although a new added variable related to working hours has no direct impact in the model as it is insignificant, in case of interaction this variable and gender there is a significance. Also, we can conclude that there is a moderation.
Residuals + Median equals to 0.41. + Minimum (-6.9) and maximum (4.5) are roughly closed to each other in their absolute values. + The Residual standard error (RSE) is 2.315 meaning that the observed values deviate from the true regression line by approximately 2 units in average.
Model fit
Interaction
As can be seen from the first graph, males are associated with having higher household’s total income in both cases of having or not a set basic or contracted number of hours. Additionally, plots are created to demonstrate effects.
Marginal effects
Marginal effects show how the outcome would change in case of 1-unit change of the predictor, taking into account all the other predictors.
mrgns <- margins(model5)
summary(mrgns)
## factor AME SE z p lower
## agea 0.0244 0.0078 3.1469 0.0017 0.0092
## ctzcntrNo -1.0288 0.3127 -3.2899 0.0010 -1.6416
## eduyrs 0.2645 0.0262 10.0821 0.0000 0.2131
## gndrFemale -0.6460 0.1648 -3.9201 0.0001 -0.9691
## icwhctNon-fixed working hours -0.4622 0.2921 -1.5824 0.1135 -1.0347
## upper
## 0.0397
## -0.4159
## 0.3160
## -0.3230
## 0.1103
plot(mrgns)
The highest average value of marginal effects has variable related to studying years (eduyrs), the lowest one is gender that is female. All these variable are extremely important in the model as their effects demonstrate how value of the outcomes might be significantly changed.
Standardised coefficients
According to the plot presented above, years spent on education can predict household’s total income better than the other variables.
Case as interpretation
Here is again 20 year-old male, who has spent 13 years on education and he is not the citizen of Germany (ctzcntrNo = 1). Having added working hours schedule and gender, suppose, that he has schedule (icwhctNo = 1) and his gender is “male” (gndrFemale = 0). So, the income equals to 5.17 (rounded 5). Recording: 5 means that income is from 1000 to 1500 euros monthly.
Checking hypotheses
One of our hypothesis is that _ people with the same working schedule characteristics but different gender would get dissimilar income_. So, let us continue and assume that there is almost identical person, but the gender is female. In this case the income is 3.37 (rounded 3). Recording: 3 means that income is 300 to under 500 euros montly. Overall, the mentioned hypothesis is supported.
# ANOVA
anova(model1, model2, model3, model4, model5)
## Analysis of Variance Table
##
## Model 1: hinctnta ~ eduyrs
## Model 2: hinctnta ~ eduyrs + agea
## Model 3: hinctnta ~ eduyrs + agea + gndr
## Model 4: hinctnta ~ eduyrs + agea + gndr + ctzcntr
## Model 5: hinctnta ~ eduyrs + agea + gndr + ctzcntr + icwhct * gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 788 4422.6
## 2 787 4340.4 1 82.187 15.4609 0.00009169 ***
## 3 786 4263.6 1 76.831 14.4534 0.0001548 ***
## 4 785 4203.5 1 60.069 11.3001 0.0008126 ***
## 5 783 4162.2 2 41.255 3.8805 0.0210392 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As can be seen from the table above, the model is improving hierarchically due to new added variables. It is proved by less p-values (< 0.05) and RSS which tends to reduce. As for F-statistics, they also show that model is improving because the null hypothesis about absence of model improvment is rejected. Overall, the 5th model with interaction is better than the others.
tab_model(model1, model2, model3, model4, model5)
| hinctnta | hinctnta | hinctnta | hinctnta | hinctnta | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 2.90 | 2.09 – 3.70 | <0.001 | 1.49 | 0.42 – 2.56 | 0.006 | 1.88 | 0.80 – 2.96 | 0.001 | 2.36 | 1.25 – 3.47 | <0.001 | 2.24 | 1.13 – 3.35 | <0.001 |
| eduyrs | 0.27 | 0.22 – 0.32 | <0.001 | 0.27 | 0.22 – 0.32 | <0.001 | 0.27 | 0.22 – 0.32 | <0.001 | 0.26 | 0.21 – 0.31 | <0.001 | 0.26 | 0.21 – 0.32 | <0.001 |
| agea | 0.03 | 0.01 – 0.05 | <0.001 | 0.03 | 0.01 – 0.04 | <0.001 | 0.02 | 0.01 – 0.04 | 0.002 | 0.02 | 0.01 – 0.04 | 0.002 | |||
| Female | -0.63 | -0.95 – -0.30 | <0.001 | -0.65 | -0.97 – -0.32 | <0.001 | -0.53 | -0.87 – -0.19 | 0.002 | ||||||
| No | -1.05 | -1.66 – -0.44 | 0.001 | -1.03 | -1.64 – -0.42 | 0.001 | |||||||||
| Non-fixed working hours | 0.18 | -0.63 – 0.98 | 0.667 | ||||||||||||
| gndrFemale:icwhctNon-fixed working hours | -1.31 | -2.44 – -0.17 | 0.025 | ||||||||||||
| Observations | 790 | 790 | 790 | 790 | 790 | ||||||||||
| R2 / R2 adjusted | 0.116 / 0.114 | 0.132 / 0.130 | 0.147 / 0.144 | 0.159 / 0.155 | 0.168 / 0.161 | ||||||||||
This table depicts the all results of all models. It can be said that our models improve with add new predictors – it can be concluded grounding on the fact that with adding new predictor each time R-square increase, which means that each time our model become more explained. We start with explain 12% (rounded) of model and end with explaining 17% (rounded).
With this table we can conclude, whether our hypotheses were confirmed. We see from all models that more years person spent on education, the bigger size of income her or his household gets – the hypothesis 1 was confirmed. Starting from the second model we can conclude that older person is, the bigger size of income he or she has - hypotheses 2 was confirmed. We also see that being female decrease the income, so our hypothesis 3 that males get income higher than females was confirmed. We can conclude that non-migrants get higher income than migrants, and it confirms our hypothesis 4. We can also conclude, that people with who do not have fixed working hours have higher income – and it confirms our hypothesis 5 that different types of working hours schedule affect to different income. And at the end we can say that the type of working hours with gender together affect the income – which confirm our last hypothesis 6 of interactive effect.