3TB Portfolio

Project 1: Descriptive Statistics

Formal block

Research topic
Research question
Contribution

Data loading and preparation

Central Tendency Measures

Visualisation

Barplot of a univariate distribution
Histogram of a univariate distribution
Scatterplot of a bivariate distribution
Boxplot of a bivariate distribution
Stacked barplot of a bivariate distribution

Background & final conclusion
Bibliography

Project 2: Basic Tests

Formal block

Research topic
Research question
Contribution

Preparatory stage

Libraries attachment, data loading and preparation

Part I: Chi-squared test

Variables
Hypothesis
Staked barplot
Contingency tables
Fisher’s exact test
Results plot
Residuals
General conclusion for part I

Part II: T-test

Variables
Hypothesis
Additional data preporation
Boxplot
Normality assumption check, the first way: QQ Plot
Normality assumption check, the second way: Skew and Kurtosis
Normality assumption check, the third way: Histograms
Conclusion about normality assumption
Non-parametric T-test
Effect size for non-parametric T-test
Parametric T-test
Effect size for parametric T-test
Conclusion about double-checking by another test
Mean gross pays conclusion
General conclusion for part II

Part III: ANOVA

Variables
Hypothesis
Additional data preparation
Boxplot
Normality assumption
Another way of normality assumption check:
Equality or homogeneity of variances
Non-parametric test
Kruskal-Wallis test
F-test formula and ANOVA (var.equal = T)
Normality of residuals
General conclusion for part III

Overall project conclusion

Project 3: From Correlation to Linear Regression

Formal block

Research topic
Contribution

Variables

Variables choice explanation
Data loading and preparation

Descriptive statistics

Scatter plots

Correlation matrix and a boxplot
Regression models

Model 1: gross pay by years of education completed
Model 2: gross pay by years of education completed and health (continuous predictor)
Model 3: gross pay by years of education completed and health (continuous predictor) and gender (categorical predictor).

Comparison the models fits
Interpretation of the overall model fit and the coefficients
The regression equation
Output table and conclusion

Conclusion

Bibliography

Project 4: From additive models to interaction ones

Formal block

Research topic
Contribution
Theory & hypothesis

Data loading and preparation
Variables
Correlation matrix of chosen variables
Multiple regression models

Additive model
Interaction model
Comparison of models

Interaction plots
Marginal effects

Average marginal effects
Marginal effects plots

Standardised coefficients model
Project conclusion
Bibliography

Project 1: Descriptive Statistics

Formal block

Research topic

Bulgarian Muslims in the context of the European migration crisis.

Research question

Are local Muslims in Bulgaria discriminated against?

Contribution

Sharin Nikita: formation of the topic & research question, selection of the necessary variables, preparation of the scatter plot and the stacked barplot, graphs commenting, as well as the writing of the final conclusion.
Amaltdinova Dina: formation of the topic, partial participation in selection of necessary variables, preparation of the boxplot and descriptive statistic table (numeric and nominal variable descriptions) and comments for the graph.
Drobyshevska Ekaterina: formation of the topic, preparation of the histogram and the scatter plot, variables description, graphs commenting.
Teterina Ekaterina: formation of the topic; searching and describing variables, calculating Central Tendency Measures, building barchart, structuring and designing the report.

Many parts of the report were done in collaboration, so each member of the group made at least minor contribution to each task.

Data loading and preparation

First, we will change types of our variables to perform further actions and do some other preparatory stuff such as creating variables that we will need later.

As we have correct R data types and, it’s time to describe our variables.

Variable trstlgl: Trust in the legal system – interval, numeric in code
Variable trstplc: Trust in the police – interval, numeric in code
Variable eduyrs: Years of full-time education completed – ratio, numeric in code
Variable grspnum: Approximate monthly gross pay – ratio, numeric in code
Variable rlgdnm: Religion or denomination belonging to at present – nominal, we made it being factor in code
Variable dscrrlg: Discrimination of respondent’s group: religion – nominal, we made it being factor in code

Furtemore, at this preparatory stage we decided to leave only those people who attached to generalised groups of Muslims and Christians as other religion affiliations (Not Muslims & Not Christians) are represented by an insufficient number of observations (only 2).

Central Tendency Measures

In order to describe numeric variables we use Central Tendency Measures

Numeric variable discriptions
variable	category	mode	mean	median
trstlgl: Trust in the legal system	Numeric	0	3.06	3
trstplc: Trust in the police	Numeric	0	3.92	4
eduyrs: Years of full-time education completed	Numeric	12	11.71	12
grspnum: Approximate monthly gross pay	Numeric	510	865.14	760

For nominal variables, however, we should use only mode.

We create a mode function to know what are the most commonly observed value in a set of factor variables – dscrgrp (Discrimination of respondent’s group: religion) and rlgdnm (Religion or denomination belonging to at present).

Nominal variable discriptions
variable	category	mode
dscrrlg: Discrimination of respondent’s group: religion	Factor	No
rlgdnm: Religion or denomination belonging to at present	Factor	Eastern Orthodox

Visualisation

Barplot of a univariate distribution

Comment: Eastern Orthodox religion is dominant in Bulgaria (there are more than 1340 believers), Islam is the second dominant religion (there are almost 300 believers). Other religions are not so popular in Bulgaria and have not so many followers.

Context note: Islam becomes more and more popular in Bulgaria due to the big share of refugees who mostly move from Islamic countries to Europe.

Histogram of a univariate distribution

Comment: The graph shows that the majority of Muslims do not trust the legal system, another big share of Muslims (~50) trust the legal system partly. A small number of Muslims fully trust it.

Scatterplot of a bivariate distribution

Comment: The graph shows that there might be a discrimination in gross pay towards Muslims: even those longest-studied of them don’t get more than 1200, while a significant part of comparable Christians (with a comparable number of study years) gain more than 1400. In general, there are a lot of Christians that get more than Muslims, but here it is barely possible to talk about discrimination – Christians tend to study longer than Muslims, that according to trend lines, can influence the amount of gross pay.

Small note: we decided to limit “usual gross pay” to 3000 because there is one outlier that has an extremely big gross pay and may damage our results by that.

Boxplot of a bivariate distribution

Comment: The following boxplot illustrates the attitude towards the police among representatives of different religious groups. Christians and Muslims. As we can see, the median attitude towards the police among Muslims is higher than among Christians. For sure, this fact does not illustrate the targeted discrimination of Muslims in the context of another religious group presence.

Stacked barplot of a bivariate distribution

Comment: The share of Muslims who are discriminated against on the basis of religion (~5%) is relatively higher than according share in another religious group, Christians (~0%). Such difference allows us to carefully suggest (without any statements) the presence of exclusively anti-Muslim sentiments in Bulgaria.

Background & final conclusion

Bulgaria is a country that has become a transit point for refugees during the European migration crisis in 2015-2016 (1). Theoretically, this invasion of refugees could imply serious economic and social tensions. These tensions, in turn, can become the basis for xenophobic sentiments within the country. Most of these refugees are Muslims. At the same time, in Bulgaria, 78% (2) of the population consider themselves Christians. Thus, it is religion that can become the part of identity on the basis of which xenophobic views could be built. This mainly puts at risk the Muslims of Bulgaria, who make up 10% (2) of the total population of the strata. In this regard, it was interesting to look at the situation in Bulgaria regarding discrimination on the basis of religious affiliation.

Generally speaking, at this research phase it is difficult to state that discrimination is definitely present or absent – simply, because of the fact that support from the available graphs seems to be weak and incomplete (for ex. it mostly does not account contextually important details). However, what can these graphs say then? They can tentatively point out presence of possible discrimination – this is what exactly happened in the case of legal (histogram), job (scatterplot) and general religious discrimination (stacked barplot) graphs. But, at the same time, they can also point out an absence of any discrimination: graph of another area – policessian (boxplot), do not give even a little hint of discrimination against Muslims. Does these cues mean, for example, that there is no any Muslim discrimination in the policessian environment, but for sure there is one in the job area? Not necessarily: as it was said, graphs are pretty non-exhaustive. And due to such nature, graphs, as it seems, can only show where discrimination is more likely (exactly: as overall life experience, in job and legal sectors) and where – less (exactly: in policessian area). To say more, such information seems to be quite helpful in conditions of a limited budget, where it is not possible to explore all areas and they are needed to be prioritised according to the degree of expected usefulness.

Bibliography

Stoyanov, A. (2015). Bulgaria in the Context of the European Migrant Crisis–Challenges and Perspectives. Der Donauraum, 55(3-4), 119-126.
Bulgaria. National Statistical Institute (2011). CENSUS 2011(final data). Retrieved from [https://www.nsi.bg/census2011/PDOCS2/Census2011final_en.pdf].

Project 2: Basic Tests

Formal block

Research topic

Bulgarian Muslims in the context of the European migration crisis.

Research question

Are local Muslims in Bulgaria discriminated against?

Contribution

Sharin Nikita: preparation of regression models, comparison of correlation and regression, description of regression models, regression equation writing-out;
Amaltdinova Dina: analysis of correlation matrix, description of regression models, comparison of correlation and regression, comparison of regression models;
Drobyshevska Ekaterina: choice of topic and variables, preparation of background information, analysis of descriptive statistics, preparation of correlation matrix, final conclusion about model with the best fit;
Teterina Ekaterina: choice of topic and variables, preparation of descriptive statistics, preparation and analysis of descriptive statistics, preparation of correlation matrix, preparation of regression models.

Preparatory stage

Libraries attachment, data loading and preparation

Because our topic is tightly connected with one’s religious affiliation and we are especially interested in position of Muslims in relation to Christians (Christianity is the dominant religion in Bulgaria as we found out in the previous project), we decided to leave only those observations which consider Muslims and Christians.

We also changed class of religious affiliation (rlgdnm_new) and discrimination (dscrrlg_new) variables from character to factor as it is needed to their correct handling in the further blocks of code.

Part I: Chi-squared test

As we mentioned in our first project, Bulgaria is a transit point for refugees in 2015-2016, during the migration crisis in Europe [1]. The arrival of refugees and their assimilation into society can potentially affect the social and economic stability of the country. Majority of refugees are Muslims, 78% [2] of the population in Bulgaria are Christians, therefore we assume that religious affiliation can become a factor on the basis of which xenophobic views can be built. Therefore, we are looking at possible discrimination in Bulgaria on the basis of religion.

Variables

Categorical: religious affiliation and discrimination:
rlgdnm: religion or denomination belonging to at present.
dscrrlg: discrimination of respondent’s group: religion.

Hypothesis

H0: religion affiliation and discrimination are independent;
H1: religion and discrimination are dependent.

Staked barplot

The barplot shows that the overall level of discrimination is rather low, however, compared to Christians Muslims face more discrimination.

Contingency tables

The following table helps to see whether all requirements for observations are satisfied. There are no missing values and min number of it is 5.

Religion affiliation and discrimination
	Discriminated	Non-Discriminated
Christian	5	1383
Muslim	19	278

Now we should check the assumption about expected frequencies.

test <- chisq.test(t); test

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t
## X-squared = 59.28, df = 1, p-value = 1.368e-14

chisq.test(t)$expected

##            
##             Discriminated Non-Discriminated
##   Christian     19.769733         1368.2303
##   Muslim         4.230267          292.7697

Looking at the contingency table we can see that there is one cell below 5, so we should use the Fisher’s exact test instead of the Chi-square test.

Fisher’s exact test

data_fish <- fisher.test(data_work$dscrrlg_new, data_work$rlgdnm_new)
data_fish

## 
##  Fisher's Exact Test for Count Data
## 
## data:  data_work$dscrrlg_new and data_work$rlgdnm_new
## p-value = 5.24e-11
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.01534274 0.14846651
## sample estimates:
## odds ratio 
## 0.05305162

P value = 5.24e-11 - below threshold, that means that we reject H0. So, relationship between religion and discrimination exists.

Results plot

Residuals

Residuals
	Discriminated	Non-Discriminated
Christian	-7.969	7.969
Muslim	7.969	-7.969

From the table and graphs, wee can see that the value of standardized residual in the category “Christian” who were discriminated is lower than -2 which means that this cell contains fewer observations than it was expected (if variables were independent). In the same category but among those who were not discriminated the value of standardized residual is higher than 2 which means that this cell contains more observations that it was expected. The same situation with “Muslim” - value of residuals of those who were discriminated is higher than 2 therefore there are more observations that is was expected. And for those who were not the value is less than -2 - there are less observations that it was expected. This means that if variables were independent we would have higher percentage of discriminated Christian and lower percentage of discriminated Muslims.

General conclusion for part I

Due to the not-so-big numbers of observation and lack of contextually important details, it is difficult to conclude that in Bulgaria Muslims are systematically discriminated against, unlike Christians. However, we can see that despite the fact that the level of discrimination on religious grounds is low, Muslims are more likely to be subject of discrimination than Christians (barplot). The test also confirmed that there is a correlation between religious affiliation and the fact of discrimination.

Part II: T-test

The working environment is a possible field for religious discrimination. One type of such discrimination is unequal pay for work of equal value – inequality in the mean pay received by people of different religious groups may be the surface sign of such discrimination. So, it seems useful to study the presence of such inequality in the average pay. But, of course, it should be admitted that results received here will not give the final answer to the question of discrimination – they can only serve as a basis for a deeper analysis, analysis with accounting of many factors like education and job specifics.

Variables

Categorical: religious affinity.
rlgdnm: religion or denomination belonging to at present.
Continuous: gross pay.
grspnum: usual (weekly/monthly/annual) gross pay.

Hypothesis

H0: mean values of gross pay are equal between Muslims and Christians.
H1(two-sided): mean values of gross pay are not equal between Muslims and Christians.

If we confirm H0, it would mean that there is no a statistically significant difference between the mean values of gross pay between Muslims and Christians, else – if we fail to confirm H0, it would mean that there is a statistically significant difference between the mean values of gross pay between Muslims and Christians (actual H1 content).

Additional data preporation

In addition to the basic date filtering from NA, we also decided to remove one outlier with an extremely high value (that is 4000, the nearest to it is only 2600) – this decision will make the graphs more readable, since now their space will not be stretched heaviliy due to that outlier displaying.

Boxplot

Boxes & whiskers are not very different in size among boxplots presented – preliminarily, we can declare equality of variances. At the same time there are a lot of outliers – sign of non-normal distribution. Furtermore, situation with normality worsens when considering the second boxplot (attached to the Muslims) – a very strong right skew can be seen here. All in all, at this stage, we can almost confidently say that the data distribution is far from normal – the normality assumption is violated.

Normality assumption check, the first way: QQ Plot

Both in case of Muslims and Christians significant part of dots presented deviate from diagonal of the normal distribution. It is expressed most clearly in Christians heavy right tail representing positive skew. The presence of such noticeable deviations means that distribution of data is non-normal. So, after that it can be said that normality assumption is violated.

Normality assumption check, the second way: Skew and Kurtosis

describeBy(data_work_2$grspnum, data_work_2$rlgdnm_new)

## 
##  Descriptive statistics by group 
## group: Christian
##    vars   n  mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 213 890.1 453.01    800  822.31 296.52 150 2600  2450 1.62     3.12
##       se
## X1 31.04
## ------------------------------------------------------------ 
## group: Muslim
##    vars  n   mean     sd median trimmed    mad min  max range skew kurtosis
## X1    1 45 677.33 299.31    620  675.41 177.91 100 1500  1400 0.29     0.02
##       se
## X1 44.62

Skew and kurtosis of Muslims’ data are in the intervals consistent with the normal distribution: 0.29 < |0.5| – skew, and 0.02 < |1| – kurtosis, at the same time Christians’ data skew and kurtosis are on the contrary, outside the intervals corresponding to the normal distribution: 1.62 > |0.5| – skew, and 3.12 > |1| – kurtosis. Situation with skew and kurtosis of Christians’ data indicate non-normal distribution – so, normality assumption is violated.

Normality assumption check, the third way: Histograms

Histograms show that distribution in both groups is skewed to the right – so, again, normality assumption is violated.

Conclusion about normality assumption

Generally, there is expressed positive skew – distribution is non-normal, we should use non-parametric T-test.

Non-parametric T-test

wilcox.test(grspnum ~ rlgdnm_new, data = data_work_2)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  grspnum by rlgdnm_new
## W = 6103, p-value = 0.003879
## alternative hypothesis: true location shift is not equal to 0

Test p-value is ~ 0.0039 – value that is small enough to reject H0. So, the next is true: there is a statistically significant difference between the mean ranks in gross pay between Muslims and Christians.

Effect size for non-parametric T-test

wilcox_effsize(grspnum ~ rlgdnm_new, data = data_work_2, na.rm = T)

## # A tibble: 1 x 7
##   .y.     group1    group2 effsize    n1    n2 magnitude
## * <chr>   <chr>     <chr>    <dbl> <int> <int> <ord>    
## 1 grspnum Christian Muslim   0.180   213    45 small

Effect size is ~ 0.182, relative size of such effect is small.

Parametric T-test

t.test(data_work_2$grspnum ~ data_work_2$rlgdnm_new)

## 
##  Welch Two Sample t-test
## 
## data:  data_work_2$grspnum by data_work_2$rlgdnm_new
## t = 3.9146, df = 92.404, p-value = 0.0001731
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  104.8269 320.7130
## sample estimates:
## mean in group Christian    mean in group Muslim 
##                890.1033                677.3333

Due to data violation of normality assumption we have no right to use this type of test – parametric.
Test p-value is ~ 0.00017 – value that is small enough to reject H0. So we are, again, able to conclude that there is a statistically significant difference between the mean values in gross pay between Muslims and Christians. Here result of non-parametric test (that is performed earlier) is confirmed.

Effect size for parametric T-test

effsize::cohen.d(data_work_2$grspnum ~ data_work_2$rlgdnm_new, na.rm = T)

## 
## Cohen's d
## 
## d estimate: 0.494219 (small)
## 95 percent confidence interval:
##     lower     upper 
## 0.1683032 0.8201348

Effect size is ~ 0.49, relative size of such effect, as in the case of wilcox, is small, but now very close to medium (that is 0.5). However, it is still not medium – so, results related to non-parametric test are also confirmed here.

Conclusion about double-checking by another test

Parametric test consistent with non-parametric in the main – in concluding that there is a statistically significant difference between the mean values (ranks) in gross pay between Muslims and Christians, and in the subordinate – in effect size of stated difference: small for both cases.

Mean gross pays conclusion

Christians on average have a higher monthly gross pay than Muslims: $890 against $677 respectively. As it was shown earlier, this difference is statistically significant and small in its effect size.

## [1] "Mean monthly gross pay of Muslims: 677.33"

## [1] "Mean monthly gross pay of Christians: 890.1"

General conclusion for part II

There is an inequality in the mean gross pay amount received by Muslims and Christians: last earn more on average ($677 against $890 respectively). But, as it was mentioned earlier, in order to understand whether this inequality is the result of discrimination against Muslims, more in-depth research is needed – such ones that will adjust the data values taking into account factors that also can determine gross pay amount (education, job specifics, etc.).

Part III: ANOVA

Variables

Categorical: we decided to make gross pay a categorical variable. grspnum: usual (weekly/monthly/annual) gross pay. Continuous: trust in the legal system.
trstlgl: trust in the legal system.

Hypothesis

H0: mean values of trust in the legal system are equal across all groups.
H1(two-sided): there is at least one group which mean value is statistically significantly different from the others. We decided to look only on Muslims because, we are especially interested in their trust to legal system. We expect that all groups of muslims will have a low trust in legal system, but we also want to find if there are some differences between these groups.

If we confirm H0, it would mean that there is no statistically significant difference between mean values of trust in legal system across Muslims with above average, average and below average level of usual gross pay. According to this hypothesis, H1 would mean that there is at least one group which mean value is significantly different from others.

Additional data preparation

We decided to categorize usual gross pay in categories. For this purpose we looked at data description:

Gross pay among Muslims
GrossPayGroup	N	Mean	Median	Min	Max
Muslim	45	677.33	620	100	1500

Our result of data description analysis is the following categorization:

Above average is >= 1000$;
Average is 500$ - 1000$;
Below average is <= 500$.

Boxplot

The boxplot shows that median values and upper quartiles of boxes indicating people with average and below average income are visually the same, however lower quartile of ‘average’ is lower than that of ‘below average’. ‘Average’ box also has upper whisker, while others lack them. Last thing to note is the fact that overall level of trust in the legal system for ‘above average’ is higher.

Normality assumption

describeBy(data_anova$trstlgl, data_anova$grspnum_cat)

## 
##  Descriptive statistics by group 
## group: Above average
##    vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 9 4.11 2.93      5    4.11 2.97   0   7     7 -0.31    -1.78 0.98
## ------------------------------------------------------------ 
## group: Average
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 29 3.21 2.87      3       3 2.97   0  10    10 0.53    -0.79 0.53
## ------------------------------------------------------------ 
## group: Below average
##    vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 7 3.14 2.04      3    3.14 2.97   0   5     5 -0.37    -1.72 0.77

Skew is equal to -0.37 that is < |0.5| ,which means that distribution approximately symmetric, kurtosis is equal to -1.72 that is > |1|, which means that the distribution is too flat. So we can state that the distribution is non-normal.

Another way of normality assumption check:

Data distribution is far from bell-shaped one – so, the distribution is non-normal, so we have to apply non-parametric test.

Equality or homogeneity of variances

leveneTest(data_anova$trstlgl ~ data_anova$grspnum_cat)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  0.9653 0.3892
##       42

P-value is more than 0.05, so variances are equal. We will use appropriate tests taking into account that variances are equal.

Non-parametric test

Non-parametric test is used if residuals are non-normal – that is exactly our case.

Kruskal-Wallis test

kruskal.test(trstlgl ~ grspnum_cat, data = data_anova)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  trstlgl by grspnum_cat
## Kruskal-Wallis chi-squared = 0.96269, df = 2, p-value = 0.618

P-value is more than 0.05, so F-test is not significant. We can make an assumption that there is no significant difference between mean ranks of trust in legal system across Muslims with above average, average and below average level of usual gross pay. So we confirm the H0.

To double-check the results we will use a parametric test.

F-test formula and ANOVA (var.equal = T)

aov.out <- aov(data_anova$trstlgl ~ data_anova$grspnum_cat)
summary(aov.out)

##                        Df Sum Sq Mean Sq F value Pr(>F)
## data_anova$grspnum_cat  2    6.1   3.037   0.393  0.677
## Residuals              42  324.5   7.726

P-value is more than 0.05, so F-test is not significant. We again can confirm the H0. Non-parametric test consistent with a parametric in the main – in confirming that there is no significant difference between mean values of trust in legal system across Muslims with above average, average and below average level of usual gross pay.

Normality of residuals

anova.res <- residuals(object = aov.out)
describe(anova.res)

##    vars  n mean   sd median trimmed  mad   min  max range skew kurtosis  se
## X1    1 45    0 2.72  -0.14   -0.15 3.01 -4.11 6.79  10.9  0.3    -0.82 0.4

shapiro.test(x = anova.res)

## 
##  Shapiro-Wilk normality test
## 
## data:  anova.res
## W = 0.94207, p-value = 0.02563

Skew and kurtosis give information that the distribution of residuals is normal. However, Shapiro-Wilk test says that it is not normal. That’s why to conclude if the distribution is really normal or not we will make a histogram to visually see it.

The histogram shows that the distribution of residuals is not normal. This proves that we should have used a non-parametric test.

General conclusion for part III

According to boxplot, we see that there is some visual difference of mean values of trust in legal sustem between groups of usual gross pay among Muslims. Also we can spot that trust in legal system is pretty low among all of them. Non-parametric test and parametric test, however, prove that there is no significant difference among groups divided by usual gross pay. So we can make a conclusion that trust in legal system is not associated with Muslim’s usual gross pay and all of Muslims do not trust legal system not depending of their economic status. That may be caused by presence of structural discrimination against Muslims. However we cannot prove it yet as there should be more variables took into account, further research should be done.

Overall project conclusion

As we can see all statistical tests conducted within our project suggest an presence of possible discrimination against Muslims – there are some superficial signs of it (for example, inequality in mean amount of gross pay between Muslims and Christians). But! Due to the lack of the contextually important details accounting, we cannot state for sure if any discrimination against Muslims actually exist or not – some deeper studies are needed for this.

Project 3: From Correlation to Linear Regression

Formal block

Research topic

Wage inequality in Bulgaria.

Contribution

Sharin Nikita: preparation of regression models, comparison of correlation and regression, description of regression models, regression equation writing-out;
Amaltdinova Dina: analysis of correlation matrix, description of regression models, comparison of correlation and regression, comparison of regression models;
Drobyshevska Ekaterina: choice of topic and variables, preparation of background information, analysis of descriptive statistics, preparation of correlation matrix, final conclusion about model with the best fit;
Teterina Ekaterina: choice of topic and variables, preparation of descriptive statistics, preparation and analysis of descriptive statistics, preparation of correlation matrix, preparation of regression models.

Variables

Сontinuous outcome:

grspnum: Usual (weekly/monthly/annual) gross pay* - ratio;
*All money units presented within the project are implied to be in American dollars.

Сontinuous predictors:

eduyrs: Years of full-time education completed - ratio;
health: Subjective general health - interval;
nbthcld: Number of children ever given birth to/ fathered - ratio;

Сategorical predictor:

gndr: Gender - nominal.

Variables choice explanation

Initially we wanted to study how number of education years, health, number of children and religion influence one’s gross pay, but as we were getting deeper into analysis, we realised that religion is not a good predictor as its coefficient is not statistically significant. So, we had to step aside from the topic of our previous projects which was tightly connected with religion. We decided to choose gender as our new categorical predictor.

We expect a positive correlation between education years and gross pay, because more time spent on education implies that this person is a qualified professional and thus earns more than those who studied less.

We also expect a negative correlation between gross pay and number of children, because the more children one has, the more time he (but mainly she) has to spend on them and less time is left for the job. We also found the article, which provides information on such phenomenon: Coverman, S. (1983), the findings of such article say: “The negative effect of domestic labor time on both sexes’ income persists in the within-class analyses” [1]

It is hard to say what to expect from gross pay and health. First, the correlation may be positive as richer people are able to spend more money on their health and thus take care of themselves more. However, people with higher wages may experience a lot of stress because of pressure and responsibility. Moreover, they may work extra hours, so in this case correlation may be negative.

And, traditionally, we expect to see that women will have lower gross pay compared to men. The study of Jolliffe, D. (2002) shows “using separate wage regression estimates for men and women, an Oaxaca decomposition indicates that men’s wages are 24% higher than women’s wages and 86 to 105% of this differential is due to differences in how men and women are rewarded for the same characteristics.” [2]

Data loading and preparation

Initially the health variable had the following scale: 1 – very good and 5 – very bad. But we changed it, because the descending scale is understood better intuitively. Finally, we got the following:

0 – very bad;
1 – bad;
2 – fair;
3 – good;
4 – very good.

Descriptive statistics

describe(ess)

##           vars    n   mean     sd median trimmed    mad min  max range  skew
## gross_pay    1  346 862.73 470.47    755  794.86 363.24 100 4000  3900  2.14
## edu_years    2 2174  11.67   4.02     12   11.63   2.97   0   30    30  0.22
## health       3 2198   2.57   0.95      3    2.63   1.48   0    4     4 -0.49
## child_n      4 1757   1.85   0.86      2    1.73   0.00   1    9     8  2.18
## gender*      5 2198   1.44   0.50      1    1.43   0.00   1    2     1  0.23
##           kurtosis    se
## gross_pay     7.70 25.29
## edu_years     0.70  0.09
## health        0.06  0.02
## child_n      10.43  0.02
## gender*      -1.95  0.01

This table with descriptive statistics provides information about continuos variables, being a categorical variable gender will be described separately.

Our outcome variable - usual gross (gross_pay) pay has non-normal distribution and strongly right-skewed. As for education years (edu_years) variable, kurtosis and skew are within the threshold, so the distribution is normal. According to skew (-0.49 - on the border) and kurtosis, health’s distribution is close to normal. Number of children (child_n) variable has non-normal distribution.

So now we proceed to build several histograms using these variables:

Histograms reaffirm that the variable of “Years of full-time education” has normal distribution, while “Usual gross pay” has non-normal distribution and right-skewed, “Subjective general health” is close to normal but left-skewed and “Number of children” is also non-normal

Gender variable has 2 levels: “Male” and “Female”. Also, there are more females than males in our data, so the mode is “Female”.

getmode <- function(v) {
  uniqv <- unique(v)
  paste("Mode:", uniqv[which.max(tabulate(match(v, uniqv)))], collapse=" ")
}

getmode(ess$gender)

## [1] "Mode: Female"

## # A tibble: 2 x 2
## # Groups:   gender [2]
##   gender     n
##   <fct>  <int>
## 1 Female  1222
## 2 Male     976

Scatter plots

Scatterplot of “Years of full-time education” and “Usual gross pay” shows strong positive correlation between variables. Scatterplot of “Subjective general health” and “Usual gross pay” also shows small positive correlation between variables. While scatterplot of “Number of children” and “Usual gross pay” shows negative correlation between variables.

Correlation matrix and a boxplot

As our outcome variable is distributed non-normally, we will use spearman correlation method.

tab_corr(ess[, 1:4], corr.method = "spearman")

	gross_pay	edu_years	health	child_n
gross_pay		0.440***	0.192**	-0.218***
edu_years	0.440***		0.095	-0.277***
health	0.192**	0.095		-0.149*
child_n	-0.218***	-0.277***	-0.149*
Computed correlation used spearman-method with listwise-deletion.

From the matrix we see that the strongest correlation is between variables “edu_years” and “gross_pay” (r = 0.440) and thus we choose this correlation for further exploration

Despite the fact that the correlations with the “child_n” variable are stronger than with the “health” variable, we take “health” as a continuous predictor as model with “child_n” variable turned out to be not significant when we tested models.

So, our strongest (significant) correlation of the three is gross_pay (outcome) and edu_year (strongest covariate as a predictor). “health” – continuous predictor and “gender” – categorical one.

Boxplot shows that females and males have different medians in gross pay. There are outliers in both females and males, but males have them more.

Regression models

Model 1: gross pay by years of education completed

m1 <- lm(gross_pay ~ edu_years, data = ess)
summary(m1)

## 
## Call:
## lm(formula = gross_pay ~ edu_years, data = ess)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -817.64 -286.21  -68.35  149.51 2932.36 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  199.040     86.372   2.304   0.0218 *  
## edu_years     54.288      6.804   7.979 2.22e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 432.8 on 344 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1537 
## F-statistic: 63.67 on 1 and 344 DF,  p-value: 2.217e-14

Intercept = 199.040 which shows the gross pay for the person who has 0 years of education completed. From the correlation matrix coefficient between “edu_years” and “gross_pay” is equal to 0.440, so the correlation is positive. The slope = 54.288 and shows that with the increase of years of full-time education completed by 1 the pay increases by 54.288 (positive direction of the correlation as the correlation coefficient already seen). P-value < 0.05, which means that variables are significant for prediction of outcome. Adjusted R-squared = 0.1537 means that the model has explained about 15% of the variations.

Extent to which variance in the dependent variable can be attributed to the independent variable is higher in case of correlation (rho ~ 0.44, rho^2 = 0.194) in comparison with case of same-variables regression (R^2 = 0.156). We suppose that observed difference can be explained by the fact that calculated correlation is not regular Pearson’s, but rank Spearman’s one.

Model 2: gross pay by years of education completed and health (continuous predictor)

m2 <- lm(gross_pay ~ edu_years + health, data = ess)
summ(m2, digits = 4, center = T)

Observations	346
Dependent variable	gross_pay
Type	OLS linear regression

F(2,343)	35.0261
R²	0.1696
Adj. R²	0.1648

	Est.	S.E.	t val.	p
(Intercept)	862.7312	23.1152	37.3231	0.0000
edu_years	54.1906	6.7594	8.0171	0.0000
health	75.2640	31.9601	2.3549	0.0191
Standard errors: OLS; Continuous predictors are mean-centered.

Predictors were centred for better interpretation
Intercept = 862.7312. It shows gross pay or the person who has average (calculated mean number) education years completed and has average (calculated mean number) health on the scale. The slope for years of full-time education completed = 54.1906 implies that with an increase of years of full-time education completed by 1 year, the gross pay increases by 54.1906. From the correlation matrix coefficient between “health” and “gross_pay” r = 0.192 (positive correlation). And the slope value for health = 75.264 shows the increase in the gross pay by this value with every following point (positive correlation). P-values < 0.05, thus variables are significant for prediction of outcome. Adjusted R-squared = 0.1648 which is higher than in the previous model (0.1537) and that means that about 17% of the variance is explained on the model.

Model 3: gross pay by years of education completed and health (continuous predictor) and gender (categorical predictor).

m3 <- lm(gross_pay ~ edu_years + health + gender, data = ess)
summary(m3)

## 
## Call:
## lm(formula = gross_pay ~ edu_years + health + gender, data = ess)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -888.22 -231.16  -68.05  152.51 2830.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    57.858    127.773   0.453 0.650965    
## edu_years      57.240      6.683   8.565 3.75e-16 ***
## health         65.186     31.480   2.071 0.039136 *  
## genderFemale -172.585     45.919  -3.758 0.000201 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 422 on 342 degrees of freedom
## Multiple R-squared:  0.2025, Adjusted R-squared:  0.1955 
## F-statistic: 28.95 on 3 and 342 DF,  p-value: < 2.2e-16

Intercept = 57.858 means the gross pay for the male who has 0 years of education completed and 0 on the scale of health. The slope for gender = -172.585 implies that if the person is female, then she is expected to has lower gross pay than male.

Comparison the models fits

Our next step is to compare the models to define which fits better for further interpretation and check it by conducting ANOVA test comparison.

anova(m1, m2, m3)

## Analysis of Variance Table
## 
## Model 1: gross_pay ~ edu_years
## Model 2: gross_pay ~ edu_years + health
## Model 3: gross_pay ~ edu_years + health + gender
##   Res.Df      RSS Df Sum of Sq      F   Pr(>F)    
## 1    344 64436366                                 
## 2    343 63411116  1   1025250  5.758 0.016949 *  
## 3    342 60895857  1   2515259 14.126 0.000201 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In all cases p-value < 0.05, thus it can be stated that with each addition of predictors to our initial reduced model it stays significant, but the end model with two predictor variables (health and gender) fits better for further investigation. This better fit can be illustrated by comparison of R^2 adjusted: the third model has the largest value of this measure = 0.1955 against 0.1648 and 0.1536 of the first and the second models respectively. I.e. it is possible to say that third model has the greatest explanatory power of dependent variable variance among all the models created. We use R^2 adjusted to compare models as such measure solves the regular (non-adjusted) R^2 problem of unreasonable increase from adding new variables to the model.

Interpretation of the overall model fit and the coefficients

P-value: In all cases, the p-value is extremely small (< 0.05), therefore it can be conducted that all three variables (edu_years, health, and gender) predict gross pay.

Estimate:

Edu_years: With an increase of years of full-time education completed by 1 the pay increases by 57.240. Thus, it can be said that the more educated the person is the more likely their gross pay will be higher.

Health: The slope value for health shows the increase in the gross pay by 65.186 with every one additional health point. The healthier the person is, the higher the gross pay.

Gender: The slope for gender shows that if the person is female, then she is expected to have gross pay on 172.585 less than male. Men earn more than women – we see gender inequality in the labour market.

R-squared: It indicates the relation of variance explained by the model to total variance of the outcome variable, R-squared is equal to 0.2025 (or if account for problem of unreasonable increase from adding new variables to the model and use adjusted version of R-squared, 0.1955). Such value of R-squared (in each its version: adjusted or non-adjusted) means that model 3 explains an about 20% of the variance.

The regression equation

coef(m3)

##  (Intercept)    edu_years       health genderFemale 
##     57.85812     57.24024     65.18596   -172.58536

Based on output coefficients, we can write out the regression equation:

Exact version: Gross Pay = 57.85812 + 57.24024* Education years + 65.18596* Health - 172.58536* Female gender.

Version with rounded coefficients: Gross Pay = 57.9 + 57.2* Education years + 65.2* Health - 172.6* Female gender.

Output table and conclusion

summ(m3)

Observations	346
Dependent variable	gross_pay
Type	OLS linear regression

F(3,342)	28.95
R²	0.20
Adj. R²	0.20

	Est.	S.E.	t val.	p
(Intercept)	57.86	127.77	0.45	0.65
edu_years	57.24	6.68	8.56	0.00
health	65.19	31.48	2.07	0.04
genderFemale	-172.59	45.92	-3.76	0.00
Standard errors: OLS

Conclusion

In our project we analysed different numeric variables and their correlation, variables like gross_pay, edu_years, health and child_n were chosen out of all as they showed a good correlation between gross_pay and edu_years, gross_pay and health and gross_pay and child_n. However, correlation between gross_pay and child_n was not significant, that is why we decided to remove it from next analysis and modeling. As for categorical variable we’ve chosen gender because gender is good for predicting wage, so we wanted to prove this idea for Bulgaria. Next, we made a correlation matrix and build a regression model, from which we’ve chosen one best fitting model (with two predictor variables: health and gender and outcome variable: gross pay).

As for final step we build the output table, which shows that health, education years and gender predict people’s wages:

Education: people with more years studied are expected to get bigger gross pay in comparison with those ones with less years (one additional year -> 57.24 additional units of money in gross pay);
Health people who feel themselves healthier are expected to get bigger gross pay in comparison with those ones who feel less healthier (one additional health point -> 65.19 additional units of money in gross pay);
Gender males are expected to get bigger gross pay in comparison to females (being male -> 172.59 additional units of money in gross pay).

As for our expectations stated in the beginning of the project, most of them were confirmed by correlation tests and multiple regression model. Actually, only one of our assumptions (about health and gross pay connection) was not confirmed and that is only because we have voiced it in an exhaustive form, implying two opposite situations: both negative and positive associations.

Finally, being united by one model, named predictors can explain an about 20% of gross pay variance.

Bibliography

Coverman, S. (1983). Gender, domestic labor time, and wage inequality. American Sociological Review, 623-637.
Jolliffe, D. (2002). The gender wage gap in Bulgaria: A semiparametric estimation of discrimination. Journal of Comparative Economics, 30(2), 276-295.

Project 4: From additive models to interaction ones

Formal block

Research topic

Predictors of happiness in Bulgaria.

Contribution

Sharin Nikita: preparation of moderation model, interpretation of moderation model coefficients, interaction plots drawing and description, calculation + interpretation of standardized coefficients and average marginal effects;
Amaltdinova Dina: choice of topic and variables, preparation of background information, interpretation of models and coefficients, interaction plots description;
Drobyshevska Ekaterina: preparation of background information, interpretation of the interaction plots, general conclusion;
Teterina Ekaterina: data preparation, descriptive statistics, models preparation, interaction plots drawing and description.

Theory & hypothesis

In our 4th project we are interested what causes the general happiness of people.

Our first assumption is that frequency of meetings with friends, relatives, and colleagues can predict the level of happiness of a person. According to researches, time spent with friends and having more friends are important components in the prediction of a person’s level of happiness, as well as how strong the social network is (Leung A. et al., 2011, Lelkes 2006; Powdthavee 2008;). “Furthermore, having more relatives that a person feels close to was also found to be positively related to happiness.” (Leung A. et al., 2011). Participation in social relationships affects happiness levels (Quoidbach J. et al., 2019). So, we expect a positive correlation between how happy people generally feel and the amount of time they spend with their friends family and colleagues. Since we are talking about the social relationships, we also expect a positive correlation between happiness and the number of people a person lives with.

Our second assumption is that people who are living together with relatives permanently as a household tend to be more happier than those, who live alone. The study of Chinese elderly people shows that “elderly Chinese who live with grandchildren are associated with a much higher degree of happiness than their counterparts”. (Chyi, H., & Mao, S. 2012) The data emphasises the impact of living together on happiness among elderly population, but we want to test this hypotheses on people of every age. Also the research by revealed that “SSR from children living together and SSP to spouse, children living together and friends and neighbors reduce loneliness <…> SSR from children living together and SSP to children (living together and apart) increases SWB-life stability” where SSR and SSP mean “significant in social support (received (SSR) and provided (SSP))” and SWB mean subjective well-being (SWB). (Chalise, H. N., Saito, T., Takahashi, M., & Kai, I. 2007).

Our third assumption is that happiness has a negative relation to the age of a person. The study conducted by Vera-Villarroel and others shows that “Specifically, the results indicate that higher age within the sample, predicts lower levels of happiness.” We assume that during childhood people are more happy, while becoming an adult the stress increases and because of that the level of happiness might decrease.

Also, we assume that the happiness of a person might be negatively influenced by discrimination against the group in which he/she belongs. It is pretty obvious that feeling discrimination might decrease the level of happiness of a person.

Data loading and preparation

We changed class of discr_group from numeric to factor and renamed its levels: from “1” & “2” to “Yes” & “No”, at the same time NA was turned into “No answer”.

Variables

happy (outcome): How happy are you – interval;
agea (predictor): Age of respondent, calculated – ratio;
hhmmb (predictor): Number of people living regularly as member of household – ratio;
sclmeet (predictor): How often socially meet with friends, relatives or colleagues – interval;
dscrgrp (predictor): Member of a group discriminated against in this country – nominal.

describe(ess_4)

##              vars    n  mean    sd median trimmed   mad min max range  skew
## happy           1 2182  5.55  2.46      6    5.61  2.97   0  10    10 -0.25
## age             2 2197 54.55 18.12     57   55.43 19.27  15  90    75 -0.36
## house_memb      3 2198  2.62  1.45      2    2.43  1.48   1  11    10  1.45
## socio_meet      4 2168  4.65  1.80      5    4.72  1.48   1   7     6 -0.24
## discr_group*    5 2198  1.17  0.54      1    1.00  0.00   1   3     2  2.94
##              kurtosis   se
## happy           -0.42 0.05
## age             -0.67 0.39
## house_memb       3.62 0.03
## socio_meet      -1.16 0.04
## discr_group*     6.92 0.01

This table with descriptive statistics provides information about continuous variables, being a categorical variable discr_group will be described separately.

Normality assumption:
Our outcome variable happy and predictor age both have normal distribution, in both cases skew and kurtosis are within the threshold. However, other variables are distributed non-normally. House_memb is skewed to the right (1.45) and kurtosis (3.62) also indicates non-normality. In socio_meet skew is all right (-0.24), but kurtosis is closer to non-normality (-1.16).

Discr_group variable has 3 levels: “Yes”, “No answer” and “No”. Before building the model, we shall exclude level “No answer” in discr_group variable because it does not give any substantial information about respondents and will not help in interpretation.

Furthermore, in discr_group variables there are more “No” than “Yes” – so, the mode is “No”.

## # A tibble: 2 x 2
## # Groups:   discr_group [2]
##   discr_group     n
##   <fct>       <int>
## 1 No           1980
## 2 Yes           161

## [1] "Mode: No"

As we changed our variables from the 3rd project in order the effects of predictors and interaction effect were statistically significant, we need to build a correlation matrix with new variables in order to make sure there is a correlation between the outcome and predictors.

Correlation matrix of chosen variables

tab_corr(ess_4[, 1:4], corr.method = "spearman")

	happy	age	house_memb	socio_meet
happy		-0.287***	0.185***	0.320***
age	-0.287***		-0.440***	-0.297***
house_memb	0.185***	-0.440***		0.121***
socio_meet	0.320***	-0.297***	0.121***
Computed correlation used spearman-method with listwise-deletion.

From the correlation matrix, we can see that correlation is significant between outcome (happiness) and all predictors – age, number of people living with, and frequencies of meeting with friend, relatives, and colleagues.

Multiple regression models

Additive model

ess_4$age = center(ess_4$age)
ess_4$house_memb = center(ess_4$house_memb)
ess_4$socio_meet = center(ess_4$socio_meet)

m_a <- lm(happy ~ age + house_memb + socio_meet + discr_group, data = ess_4)
summ(m_a, digits = 4)

Observations	2106 (35 missing obs. deleted)
Dependent variable	happy
Type	OLS linear regression

F(4,2101)	110.8111
R²	0.1742
Adj. R²	0.1726

	Est.	S.E.	t val.	p
(Intercept)	5.6909	0.0503	113.1895	0.0000
age	-0.0242	0.0031	-7.9054	0.0000
house_memb	0.1268	0.0362	3.5040	0.0005
socio_meet	0.3588	0.0284	12.6352	0.0000
discr_groupYes	-1.3824	0.1832	-7.5443	0.0000
Standard errors: OLS

Interpretations:

Intercept = 5.6909, this value is the level of happiness for a person not attached to discriminated group with other variables are mean centered;
Slope for age = -0.0242, this value indicates negative relationships of age and happiness: each additional year -> 0.0244 fewer points of happy;
Slope for house_memb = 0.1268, this value indicates positive relationships of household members amount and happiness: each additional house member -> 0.1268 additional points of happy;
Slope for socio_meet = 0.3588, this value indicates positive relationships of social meetings frequency and happiness: each additional point on the scale of socio_meet -> 0.3588 additional points of happy. It can be additionally interpreted in the next way: person is happier if he or she meets with his or her f/r/c more often);
The slope for discr_group = -1.3824, this value indicates that if person is the member of a discriminated group, than his or her happiness is expected to be lower at 1.3824;
P-value < 0.05 in all cases means that all variables are significant for the prediction of outcome. R-squared = 0.1726 means that the model has explained about 17% of the happiness variations.

Interaction model

m_i <- lm(happy ~ age + house_memb + socio_meet * discr_group, data = ess_4)
m_i2 <- lm(happy ~ age + house_memb + discr_group * socio_meet, data = ess_4)
summary(m_i)

## 
## Call:
## lm(formula = happy ~ age + house_memb + socio_meet * discr_group, 
##     data = ess_4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5515 -1.5044  0.0049  1.5479  6.7700 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.690858   0.050226 113.305  < 2e-16 ***
## age                       -0.024422   0.003055  -7.993 2.15e-15 ***
## house_memb                 0.124645   0.036163   3.447 0.000578 ***
## socio_meet                 0.338187   0.029760  11.364  < 2e-16 ***
## discr_groupYes            -1.359807   0.183318  -7.418 1.72e-13 ***
## socio_meet:discr_groupYes  0.210610   0.091692   2.297 0.021720 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.215 on 2100 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1743 
## F-statistic: 89.88 on 5 and 2100 DF,  p-value: < 2.2e-16

Interpretations:

Intercept = 5.6909, this is the value of happy for a person not attached to discriminated group with other continuous variables are mean centered;
Slope for age = -0.0244, this value indicates negative relationships of age and happiness: each additional year -> 0.0244 fewer points of happiness;
Slope for house_memb = 0.1246, this value indicates positive relationships of household members amount and happiness: each additional house member -> 0.1246 additional points of happiness;
Slope for socio_meet = 0.3382, this value indicates that if a person is not related to any discriminated group, than each additional point on the scale of socio_meet is expected to increase his or her happiness by 0.3382;
Slope for discr_groupYes = -1.3598, this value indicates that if a person with mean value of social meeting frequency is attached to a discriminated group, than happy is expected to be lower at 1.3598, keeping all other variables constant;
The interaction between the frequency of social meetings (socio_meet) and being attached to discriminated group (discr_groupYes) appears to be significant with p-value < 0.05. Because of this fact, relationships between happy and socio_meet depends on the value of discr_groupYes and/or relationships between happy and discr_groupYes depends on socio_meet value;
Provided that social meeting frequency of a person is not mean average (socio_meet = 0), the fact of being attached to discriminated group (discr_group = “Yes”) make expected value of happy lower at 1.1492 units (-1.3598 + 0.2106);
Provided that a person is attached to discriminated group (discr_group = “Yes”), with one unit increase in frequency of social meetings (socio_meet), expected value of happy is expected to be bigger at 0.5488 units (0.3382 + 0.2106).

Comparison of models

anova(m_a, m_i)

## Analysis of Variance Table
## 
## Model 1: happy ~ age + house_memb + socio_meet + discr_group
## Model 2: happy ~ age + house_memb + socio_meet * discr_group
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1   2101 10327                              
## 2   2100 10302  1    25.881 5.2759 0.02172 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the improvement in R^2 and ANOVA results (p-value < 0.05) the model with interaction effect fits best to predict the level of happiness.

Interaction plots

plot_model(m_i, type = "int", 
           title = "Predicted values of happiness", 
           legend.title = "Member of a discriminated group", 
           axis.title = c(
             "How often socially meet", 
             "Level of happiness"))

Plot interpretation: Happiness value is expected to be higher for people who more often meet with friends, relatives or colleagues. Furthermore, if person is the member of discriminated group, happiness is expected to increase at a higher rate (expected to show steeper increase) with the increase of meetings frequency. Adding everything together, the positive relationship between level of happiness and frequency of social meetings is apparently dependent on the fact of belonging to the discriminated group.

plot_model(m_i2, type = "int", 
           title = "Predicted values of happiness", 
           legend.title = "How often socially meet", 
           axis.title = c(
             "Member of a discriminated group", 
             "Level of happiness"))

Plot interpretation: Looking at this plot make possible to say that frequency of social meetings is significant predictor of level of happiness for both attached and not attached to discriminated group. The more frequent social meetings of the person, the more happier this person is. Furthermore this correlation is moderated by belonging to discriminated group – with the same frequency of social meetings, people from discriminated groups are expected to be less happy than those who are not from. Stated interaction effect is significant as confidence intervals do not overlap.

Large confidence intervals show that we have small amount of observations which increases the probability of error.

Marginal effects

Average marginal effects

margins(m_i)

##       age house_memb socio_meet discr_groupYes
##  -0.02442     0.1246     0.3542         -1.361

There is no point in age and house_memb average marginal effects: model simply doesn’t imply variations of their coefficients, they are constant – so, age and house_memb average marginal effects will be basically coincide with their base coefficients, already interpreted earlier. Socio_meet and discr_groupYes are another matter: these variables are in moderation relationships towards happy, i.e. the way each of them is associating with output differs depending on the value of the relationship colleague. Because of it, their pure effects are not constant – fact, giving sense to the average values of marginal effects. Well, let’s consider average marginal effect of each interaction variable:

0.35 for socio_meet. Such value means that higher social meeting frequency on average (among all possible conditioning options of discr_groupYes) will be purely associated with bigger value of happy: one additional point of social meeting frequency -> 0.35 additional point of happiness;
-1.36 for discr_groupYes. This value means that being attached to discriminated group on average (among all possible conditioning options of socio_meet) will be purely associated with lower value of happy: belonging to discriminated group -> 1.36 fewer points of happiness.

Marginal effects plots

socio_meet_margins <- interplot(
  m_i, var1 = "socio_meet", 
  var2 = "discr_group") + 
     xlab('age') +
     ylab('Estimated Coefficient for Periodicity of social meetings') +
     geom_hline(yintercept = 0, linetype = "dashed")

discrimination_margins <- interplot(
  m_i, 
  var1 = "discr_group", 
  var2 = "socio_meet") + 
     xlab('periodicity of social meetings') +
     ylab('Estimated Coefficient for Being member of disc. group') +
  geom_hline(yintercept = 0, linetype = "dashed")

grid.arrange(socio_meet_margins, discrimination_margins, nrow = 1)

Marginal effects plots allow us to see how the first predictor’s different values are conditioning pure effect of its moderation colleague – the second predictor. As for the case of discr_group conditioning socio_meet, here confidence intervals (of discr_group possible values) are overlapping each other – in fact, discr_group doesn’t moderate socio_meet. It make doubtful average marginal effect of socio_meet, calculated earlier. On contrary, when socio_meet conditioning discr_group, moderation is actually presented: although all possible values of socio_meet is associated with negative coefficient of discr_group, this association is positive. Thus, discr_group coefficient is always negative, but tend to be higher under condition of higher frequency of social meetings.

Standardised coefficients model

ess_4$age = scale(ess_4$age)
ess_4$house_memb = scale(ess_4$house_memb)
ess_4$socio_meet = scale(ess_4$socio_meet)

m_a_scaled <- lm(happy ~ age + house_memb + socio_meet * discr_group, data = ess_4)
summ(m_a_scaled, digits = 4)

Observations	2106 (35 missing obs. deleted)
Dependent variable	happy
Type	OLS linear regression

F(5,2100)	89.8845
R²	0.1763
Adj. R²	0.1743

	Est.	S.E.	t val.	p
(Intercept)	5.6909	0.0502	113.3047	0.0000
age	-0.4401	0.0551	-7.9930	0.0000
house_memb	0.1809	0.0525	3.4468	0.0006
socio_meet	0.6072	0.0534	11.3639	0.0000
discr_groupYes	-1.3598	0.1833	-7.4178	0.0000
socio_meet:discr_groupYes	0.3781	0.1646	2.2969	0.0217
Standard errors: OLS

The standardization of continuous predictors makes them consistent with each other – it allows us to compare them in terms of the association magnitude with output variable. Well, let’s actually do it: in case of a person not attached to any discriminated group socio_meet effect is 0.61, that is the biggest value among all other continuous predictors. Furthermore, this value increases up to 0.99 (0.38 + 0.61), if the person joins a discriminated group: now we account not only direct coefficient of socio_meet on happiness, but also interaction one, that is socio_meet:discr_groupYes. No doubt, it is the biggest possible association in the model, moving on. Age, its coefficient (-0.44) is the most negative and at the same time the second biggest, if we look at the modules – honorary second place in our happiness association contest. And finally… the last place, house_memb: its coefficient (0.18) is both the smallest positive and the smallest modulo.

Project conclusion

After analyzing the models, we see that all our assumptions stated at the beginning were confirmed. Such predictors as frequency of social meetings, age, fact of being discriminated group member and number of people living with appeared to be significant for the prediction of happiness. Going into details, we have found that frequency of social meetings and number of people living with are positively correlated with level of happiness, while age of the person is negatively correlated to the level of happiness. Moving forward, happiness is also predicted by discrimination: members of discriminated groups are less happier than non-members of such groups. Moreover, we discovered, that the correlation between happiness and frequency of meetings with friends, relatives or colleagues is moderated by being attached to discriminated group and/or the correlation between happiness and being attached to discriminated group is moderated, on the contrary, by frequency of social meetings.

Bibliography

Leung, A., Kier, C., Fung, T., Fung, L., & Sproule, R. (2011). Searching for happiness: The importance of social capital. Journal of Happiness Studies, 12(3), 443-462;
Lelkes, O. (2006). Knowing what is good for you: Empirical analysis of personal preferences and the “objective good”. The Journal of Socio-Economics, 35(2), 285-307;
Powdthavee, N. (2008). Putting a price tag on friends, relatives, and neighbours: Using surveys of life satisfaction to value social relationships. The Journal of Socio-Economics, 37(4), 1459-1480;
Quoidbach, J., Taquet, M., Desseilles, M., de Montjoye, Y. A., & Gross, J. J. (2019). Happiness and social behavior. Psychological science, 30(8), 1111-1122;
Vera-Villarroel, P., Celis-Atenas, K., Pavez, P., Lillo, S., Bello, F., Díaz, N., & López, W. (2012). Money, age and happiness: association of subjective wellbeing with socio-demographic variables. Revista latinoamericana de Psicología, 44(2), 155-163;
Chyi, H., & Mao, S. (2012). The determinants of happiness of China’s elderly population. Journal of Happiness Studies, 13(1), 167-185;
Chalise, H. N., Saito, T., Takahashi, M., & Kai, I. (2007). Relationship specialization amongst sources and receivers of social support and its correlations with loneliness and subjective well-being: A cross sectional study of Nepalese older adults. Archives of gerontology and geriatrics, 44(3), 299-314.

3TB Portfolio

Third Temple Builders

16 June 2021

Project 1: Descriptive Statistics

Formal block

Research topic

Research question

Contribution

Data loading and preparation

Central Tendency Measures

Visualisation

Barplot of a univariate distribution

Histogram of a univariate distribution

Scatterplot of a bivariate distribution

Boxplot of a bivariate distribution

Stacked barplot of a bivariate distribution

Background & final conclusion

Bibliography

Project 2: Basic Tests

Formal block

Research topic

Research question

Contribution

Preparatory stage

Libraries attachment, data loading and preparation

Part I: Chi-squared test

Variables

Hypothesis

Staked barplot

Contingency tables

Fisher’s exact test

Results plot

Residuals

General conclusion for part I

Part II: T-test

Variables

Hypothesis

Additional data preporation

Boxplot

Normality assumption check, the first way: QQ Plot

Normality assumption check, the second way: Skew and Kurtosis

Normality assumption check, the third way: Histograms

Conclusion about normality assumption

Non-parametric T-test

Effect size for non-parametric T-test

Parametric T-test

Effect size for parametric T-test

Conclusion about double-checking by another test

Mean gross pays conclusion

General conclusion for part II

Part III: ANOVA

Variables

Hypothesis

Additional data preparation

Boxplot

Normality assumption

Another way of normality assumption check:

Equality or homogeneity of variances

Non-parametric test

Kruskal-Wallis test

F-test formula and ANOVA (var.equal = T)

Normality of residuals

General conclusion for part III

Overall project conclusion

Project 3: From Correlation to Linear Regression

Formal block

Research topic

Contribution

Variables

Variables choice explanation

Data loading and preparation

Descriptive statistics

Scatter plots

Correlation matrix and a boxplot

Regression models

Model 1: gross pay by years of education completed

Model 2: gross pay by years of education completed and health (continuous predictor)

Model 3: gross pay by years of education completed and health (continuous predictor) and gender (categorical predictor).

Comparison the models fits

Interpretation of the overall model fit and the coefficients