Team: Rteam&DA aka Артемида

Areas of members’ responsibilities:

Chi-squared test - Artyom Kulikov & Nadezhda Bykova
T-test - Anastasia Vlasenko & Aleksey Artyushin

While the country of our interest is still Germany, in this particular project we use data from two themes, namely Social Demographics and Climate Change, in order to investigate the connection between them. The data was collected in 2016.

library(foreign)
library(ggplot2)
library(corrplot)
library(gridExtra)
library(sjPlot)
library(knitr)

#Uploading data
data <- read.spss("ESS8DE.sav", use.value.labels = TRUE, to.data.frame = TRUE)

Chi-squared test

Variables

Let’s begin with Chi-squared test. As two categorical variables, which are necessary to apply chi-square test, we decided to take gender and people’s attitudes towards increasing taxes on fossil fuels, such as oil, gas and coal with the aim to reduce climate change.

#Removing NAs from dataset which is used in analysis
data1 <- data[!is.na(data$inctxff),]
#Creating variables and assigning values to them
gender <- data1$gndr
att <- data1$inctxff

Here is a short description of our variables:

Gender - categorical, nominal variable. The values of variable are below.

##   Male Female 
##   1493   1321

People’s attitudes towards increasing taxes on fossil fuels - categorical, ordinal variable. The scale and values are below.

##            Strongly in favour            Somewhat in favour 
##                           241                           816 
## Neither in favour nor against              Somewhat against 
##                           668                           798 
##              Strongly against 
##                           291

Visualization and discription of data

Now, in order to make assumptions we need to visualize data regarding taken variable. Here is a stacked barplot visualizing gender composition for each point of view and bar chart which shows frequences of answers in each category by gender.

#Creating plots 
set_theme(legend.pos = "top", legend.inside = TRUE, axis.textsize = 0.8, title.align = "center")

plot1 <- sjp.xtab(att,gender, bar.pos = "stack", legend.title = "Gender", 
                  axis.titles = "Attitude towards tax increase", title = "Gender composition of taxation attitude", show.total = FALSE, margin = "row", 
                  geom.colors = (palette = "Pastel2"))

plot2 <- sjp.grpfrq(att, gender, type = "bar", legend.title = "Gender", geom.spacing = - 1, 
                    axis.titles = "Attitude towards tax increase", title = "Attitude distribution by gender", show.prc = FALSE, geom.colors = (palette = "Pastel2"))

grid.arrange(plot1, plot2, ncol=2)

As it can be seen from the graphs, the proportion of men and proportion of women in all five of the options are not equal. The only option in which the proportion of women is bigger is “Neither in favour nor against”, while the biggest difference is observed in the “Strongly against” category. The second biggest difference is observed in “Strongly in favour” category, while remaining two categories differ by around 10%. “Somewhat against” was the most popular option for men, “Neither in favour nor against” – for women. “Strongly in favour” was chosen by the fewest number of men and women.

Overall, women tend to stick to the “Neither in favour nor against” option, remaining neutral, while men are likely to express their opinion and chose a side.

Assumptions and hypotheses

We then make sure to match all necessary assumptions of chi-square test:

Data in contingency table is presented in counts (not in percent)
All cells contain more than 5 observations
Each observation contributes to one group only
Groups are independent
The variables under study are categorical
The sample is, supposedly, reasonably random

Here are hypotheses for the test:

H0: In the population, the two categorical variables (gender and taxation attitude) are independent.
H1: In the population, two categorical variables (gender and taxation attitude) are dependent.

Test

In order to check dependece of chosen variables we need to apply Pearson’s Chi-squared test. For this it is necessary to create a contingency table which contains observed frequencies.

ct<-table(gender, att)
kable(ct)

	Strongly in favour	Somewhat in favour	Neither in favour nor against	Somewhat against	Strongly against
Male	143	440	281	446	183
Female	98	376	387	352	108

Then, we run Pearson’s Chi-squared test using function chisq.test().

test <-chisq.test(ct)
test

## 
##  Pearson's Chi-squared test
## 
## data:  ct
## X-squared = 50.32, df = 4, p-value = 3.096e-10

We obtained the Chi-square statistic of 50.32 and a p-value equal to 3.096e-10 (0.0000000003096) having the degree of freedom 4. The critical value of x2 with degree of freedom 4 and significance level 0.05 is 9.49. Obtained Chi-square statistic exceeds such critical value, and p-value is a lot smaller than the significance level of 0.05, meaning that the probability to obtain the observed, or more extreme, results if the null hypothesis (H0) of a study question is true (variables are independent) is extremely low. Therefore, since we have a strong evidence of dependence between variables, we cannot accept the null hypothesis.

Since the Chi-square test statistic is significant, we would like to take a look on residuals. So, let’s create tables with expected and observed freaquences and then with residuals.

Expected frequences
	Strongly in favour	Somewhat in favour	Neither in favour nor against	Somewhat against	Strongly against
Male	127.8653	432.9382	354.4151	423.3881	154.3934
Female	113.1347	383.0618	313.5849	374.6119	136.6066

Observed frequences
	Strongly in favour	Somewhat in favour	Neither in favour nor against	Somewhat against	Strongly against
Male	143	440	281	446	183
Female	98	376	387	352	108

Residuals
	Strongly in favour	Somewhat in favour	Neither in favour nor against	Somewhat against	Strongly against
Male	2.042912	0.5878676	-6.517588	1.894942	3.548674
Female	-2.042912	-0.5878676	6.517588	-1.894942	-3.548674

According to the table, there are residuals with absolut value bigger than 2. We then proceed to draw an association plot in order to take a closer look at residuals.

We observe that in regards to male respondents all cells except for “Neither in favour nor against” category contain more observations than we would expect in case of variables independence. For women it’s the other way around: we would have expected fewer observations in all four remaining categories. While in case of “Somewhat in favour” category the difference between expected and observed observations is less significant, a considerable difference can be observed even in its opposite category “Somewhat against”.

The same situation can be observed by using a Correlation plot drawn below. There is a strong positive association between female respondents and “Neither in favour nor against” category while for males it’s the only category with a negative association.

Conclusion

Overall, we can conclude that chosen variables turned out to be dependent: attitude towards the increase in taxation on fossil fuels, such as oil, gas and coal with the aim to reduce climate change depends on the gender of a respondent. In particular, females tend to choose the “Neither in favour nor against” option, staying neutral, while males prefer to choose either of two sides of the argument, still having a tendency to be against the increase of taxes, more than we would expect them to in case of variable independence.

T-test

Since the analysis of a variable gender has already been done, let us turn to the analysis of the variable eduyrs which is the total amount of years spent on education. We decided to take gender and amount of years spent on education, so our test has independent sample but not paired one because in our case we just have two categories which are males/females where we measure the same thing. Let us get a grip on our data:

summary(data1$eduyrs)

##    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##    2    3    1    5    4    0   47   67  138  226  421  427  257  276  272 
##   17   18   19   20   21   22   23   24   25   26   27   28 NA's 
##  191  165  113   86   37   27   26    7    9    3    1    1    2

So, first of all, our variable is of a factor class. Later during the analysis we will turn it into a numeric one, so all in all, it is a continuous variable, ratio scale. Moving on to the histogram of distribution of the variable, on the x-axis we see the values which show amount of years, while on the y-axis it is the number of times this value has been encountered. So, it can be seen from the graph that distribution seems normal, but we still will prove this assumption via QQ plot.

ggplot(data1, aes(x = as.numeric(eduyrs), fill = gndr)) +
  geom_histogram(binwidth = 1, position = "identity", alpha = .8) +
  theme_minimal() +
  scale_fill_brewer(palette = "Pastel2") +
  ggtitle("Variable's distribution") +
  xlab("Years spent on education") +
  ylab("Frequency") +
  guides(fill=guide_legend(title="Gender"))

Our next step is a graphical depiction of numerical data groups through their quartiles.

ggplot(data, aes (x = gndr, y = as.numeric(eduyrs))) +
  geom_boxplot() +
  ggtitle("Time spent on education for different sexes") +
  xlab("Gender") +
  ylab("Years spent") +
  theme_minimal()

Here, we can see that the median figures of male and female are slightly different - with men having 13 years and women having 12 years.Though we do have some outliers - there are not many of them so we can go on to analysis without bootstrapping.

Let us check the normality a second time - with qqplots. Here we see QQ plot which compares two probability distributions due to plotting their quantiles against each other. So, the QQ plot shows that two compared distributions are similar and normal.

female <- subset(data1, data1$gndr == "Female")
male <- subset(data1, data1$gndr == "Male")
plot3 <- qqnorm(as.numeric(female$eduyrs)); qqline(as.numeric(female$eduyrs, col = 2))

plot4 <- qqnorm(as.numeric(male$eduyrs)); qqline(as.numeric(male$eduyrs, col = 2))

Assumptions and hypotheses

We then make sure to match all necessary assumptions of t-test - let’s test homogeneity of our variances:

H0: Variances do not differ.
H1: Variances differ.

var.test(as.numeric(data1$eduyrs) ~ data1$gndr)

## 
##  F test to compare two variances
## 
## data:  as.numeric(data1$eduyrs) by data1$gndr
## F = 1.0052, num df = 1492, denom df = 1318, p-value = 0.923
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.905048 1.116140
## sample estimates:
## ratio of variances 
##           1.005241

Having done the test for testifying variances, we conclude that p-value is large (bigger than 0.05) and we have no right to reject H0, thus, variances are equal.

*The groups are sampled from normal distributions with equal variances - the assumption holds.

Here are hypotheses for the test:

H0: In population, two means of years spent on education (gendered) DO NOT differ.
H1: In population, two means of years spent on education (gendered) DO differ.

T-test

t.test(as.numeric(data1$eduyrs) ~ data1$gndr)

## 
##  Welch Two Sample t-test
## 
## data:  as.numeric(data1$eduyrs) by data1$gndr
## t = 2.6906, df = 2769.2, p-value = 0.007175
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.09112623 0.58082024
## sample estimates:
##   mean in group Male mean in group Female 
##             13.44742             13.11145

Conclusion

Since our p-value is very small (smaller than 0.05), we have to reject H0 and accept H1 which states that our means are significantly different. According to the test, men get 13.4 years of education and women - 13.1.

Here we used the distribution-free nonparametric test, which is generally defined as the hypothesis test which is not based on underlying assumptions, because our independent variables are non-metric. So, here we presented Wilcoxon Test to check our results.

H0: In population, two means of years spent on education (gendered) DO NOT differ.
H1: In population, two means of years spent on education (gendered) DO differ.

Non-parametric (Wilcoxon) test

wilcox.test(as.numeric(data1$eduyrs) ~ data1$gndr)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  as.numeric(data1$eduyrs) by data1$gndr
## W = 1050800, p-value = 0.001959
## alternative hypothesis: true location shift is not equal to 0

Conclusion

Again, having conducted the test, we come to the conclusion that our means are significantly different since p-value is really small and we reject H0.

Finally

Overall, we state that, on average, men get more education (if we are to measure it in years). Since we have already discovered that men tend to choose to vote either for increasing taxes or not, whereas women tend to stay neutral, we can draw an assumption that one of the factors that influences such decision is their education. Hypothetically, bigger mean in terms of education in years for men might be used to explain their attitude towards the increase of taxation. For instance, education might allow them to gain more knowledge about both taxation system and environment problems, which may result in them being able to chose a side: to be against increasing taxes or support such a change. On the other hand, women who are less educated, might lack knowledge to make such a decision and prefer to stay neutral.

Project 2. Chi-squared and t-test

Team: Rteam&DA aka Артемида

Chi-squared test

Variables

Visualization and discription of data

Assumptions and hypotheses

Test

Conclusion

T-test

Assumptions and hypotheses

T-test

Conclusion

Non-parametric (Wilcoxon) test

Conclusion

Finally