Areas of members’ responsibilities:
While the country of our interest is still Germany, in this particular project we use data from two themes, namely Social Demographics and Climate Change, in order to investigate the connection between them. The data was collected in 2016.
library(foreign)
library(ggplot2)
library(corrplot)
library(gridExtra)
library(sjPlot)
library(knitr)
#Uploading data
data <- read.spss("ESS8DE.sav", use.value.labels = TRUE, to.data.frame = TRUE)
Let’s begin with Chi-squared test. As two categorical variables, which are necessary to apply chi-square test, we decided to take gender and people’s attitudes towards increasing taxes on fossil fuels, such as oil, gas and coal with the aim to reduce climate change.
#Removing NAs from dataset which is used in analysis
data1 <- data[!is.na(data$inctxff),]
#Creating variables and assigning values to them
gender <- data1$gndr
att <- data1$inctxff
Here is a short description of our variables:
## Male Female
## 1493 1321
## Strongly in favour Somewhat in favour
## 241 816
## Neither in favour nor against Somewhat against
## 668 798
## Strongly against
## 291
Now, in order to make assumptions we need to visualize data regarding taken variable. Here is a stacked barplot visualizing gender composition for each point of view and bar chart which shows frequences of answers in each category by gender.
#Creating plots
set_theme(legend.pos = "top", legend.inside = TRUE, axis.textsize = 0.8, title.align = "center")
plot1 <- sjp.xtab(att,gender, bar.pos = "stack", legend.title = "Gender",
axis.titles = "Attitude towards tax increase", title = "Gender composition of taxation attitude", show.total = FALSE, margin = "row",
geom.colors = (palette = "Pastel2"))
plot2 <- sjp.grpfrq(att, gender, type = "bar", legend.title = "Gender", geom.spacing = - 1,
axis.titles = "Attitude towards tax increase", title = "Attitude distribution by gender", show.prc = FALSE, geom.colors = (palette = "Pastel2"))
grid.arrange(plot1, plot2, ncol=2)
As it can be seen from the graphs, the proportion of men and proportion of women in all five of the options are not equal. The only option in which the proportion of women is bigger is “Neither in favour nor against”, while the biggest difference is observed in the “Strongly against” category. The second biggest difference is observed in “Strongly in favour” category, while remaining two categories differ by around 10%. “Somewhat against” was the most popular option for men, “Neither in favour nor against” – for women. “Strongly in favour” was chosen by the fewest number of men and women.
Overall, women tend to stick to the “Neither in favour nor against” option, remaining neutral, while men are likely to express their opinion and chose a side.
We then make sure to match all necessary assumptions of chi-square test:
Here are hypotheses for the test:
In order to check dependece of chosen variables we need to apply Pearson’s Chi-squared test. For this it is necessary to create a contingency table which contains observed frequencies.
ct<-table(gender, att)
kable(ct)
Strongly in favour | Somewhat in favour | Neither in favour nor against | Somewhat against | Strongly against | |
---|---|---|---|---|---|
Male | 143 | 440 | 281 | 446 | 183 |
Female | 98 | 376 | 387 | 352 | 108 |
Then, we run Pearson’s Chi-squared test using function chisq.test()
.
test <-chisq.test(ct)
test
##
## Pearson's Chi-squared test
##
## data: ct
## X-squared = 50.32, df = 4, p-value = 3.096e-10
We obtained the Chi-square statistic of 50.32 and a p-value equal to 3.096e-10 (0.0000000003096) having the degree of freedom 4. The critical value of x2 with degree of freedom 4 and significance level 0.05 is 9.49. Obtained Chi-square statistic exceeds such critical value, and p-value is a lot smaller than the significance level of 0.05, meaning that the probability to obtain the observed, or more extreme, results if the null hypothesis (H0) of a study question is true (variables are independent) is extremely low. Therefore, since we have a strong evidence of dependence between variables, we cannot accept the null hypothesis.
Since the Chi-square test statistic is significant, we would like to take a look on residuals. So, let’s create tables with expected and observed freaquences and then with residuals.
Strongly in favour | Somewhat in favour | Neither in favour nor against | Somewhat against | Strongly against | |
---|---|---|---|---|---|
Male | 127.8653 | 432.9382 | 354.4151 | 423.3881 | 154.3934 |
Female | 113.1347 | 383.0618 | 313.5849 | 374.6119 | 136.6066 |
Strongly in favour | Somewhat in favour | Neither in favour nor against | Somewhat against | Strongly against | |
---|---|---|---|---|---|
Male | 143 | 440 | 281 | 446 | 183 |
Female | 98 | 376 | 387 | 352 | 108 |
Strongly in favour | Somewhat in favour | Neither in favour nor against | Somewhat against | Strongly against | |
---|---|---|---|---|---|
Male | 2.042912 | 0.5878676 | -6.517588 | 1.894942 | 3.548674 |
Female | -2.042912 | -0.5878676 | 6.517588 | -1.894942 | -3.548674 |
According to the table, there are residuals with absolut value bigger than 2. We then proceed to draw an association plot in order to take a closer look at residuals.
We observe that in regards to male respondents all cells except for “Neither in favour nor against” category contain more observations than we would expect in case of variables independence. For women it’s the other way around: we would have expected fewer observations in all four remaining categories. While in case of “Somewhat in favour” category the difference between expected and observed observations is less significant, a considerable difference can be observed even in its opposite category “Somewhat against”.
The same situation can be observed by using a Correlation plot drawn below. There is a strong positive association between female respondents and “Neither in favour nor against” category while for males it’s the only category with a negative association.
Overall, we can conclude that chosen variables turned out to be dependent: attitude towards the increase in taxation on fossil fuels, such as oil, gas and coal with the aim to reduce climate change depends on the gender of a respondent. In particular, females tend to choose the “Neither in favour nor against” option, staying neutral, while males prefer to choose either of two sides of the argument, still having a tendency to be against the increase of taxes, more than we would expect them to in case of variable independence.
Since the analysis of a variable gender has already been done, let us turn to the analysis of the variable eduyrs which is the total amount of years spent on education. We decided to take gender and amount of years spent on education, so our test has independent sample but not paired one because in our case we just have two categories which are males/females where we measure the same thing. Let us get a grip on our data:
summary(data1$eduyrs)
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2 3 1 5 4 0 47 67 138 226 421 427 257 276 272
## 17 18 19 20 21 22 23 24 25 26 27 28 NA's
## 191 165 113 86 37 27 26 7 9 3 1 1 2
So, first of all, our variable is of a factor class. Later during the analysis we will turn it into a numeric one, so all in all, it is a continuous variable, ratio scale. Moving on to the histogram of distribution of the variable, on the x-axis we see the values which show amount of years, while on the y-axis it is the number of times this value has been encountered. So, it can be seen from the graph that distribution seems normal, but we still will prove this assumption via QQ plot.
ggplot(data1, aes(x = as.numeric(eduyrs), fill = gndr)) +
geom_histogram(binwidth = 1, position = "identity", alpha = .8) +
theme_minimal() +
scale_fill_brewer(palette = "Pastel2") +
ggtitle("Variable's distribution") +
xlab("Years spent on education") +
ylab("Frequency") +
guides(fill=guide_legend(title="Gender"))
Our next step is a graphical depiction of numerical data groups through their quartiles.
ggplot(data, aes (x = gndr, y = as.numeric(eduyrs))) +
geom_boxplot() +
ggtitle("Time spent on education for different sexes") +
xlab("Gender") +
ylab("Years spent") +
theme_minimal()
Here, we can see that the median figures of male and female are slightly different - with men having 13 years and women having 12 years.Though we do have some outliers - there are not many of them so we can go on to analysis without bootstrapping.
Let us check the normality a second time - with qqplots. Here we see QQ plot which compares two probability distributions due to plotting their quantiles against each other. So, the QQ plot shows that two compared distributions are similar and normal.
female <- subset(data1, data1$gndr == "Female")
male <- subset(data1, data1$gndr == "Male")
plot3 <- qqnorm(as.numeric(female$eduyrs)); qqline(as.numeric(female$eduyrs, col = 2))
plot4 <- qqnorm(as.numeric(male$eduyrs)); qqline(as.numeric(male$eduyrs, col = 2))
We then make sure to match all necessary assumptions of t-test - let’s test homogeneity of our variances:
var.test(as.numeric(data1$eduyrs) ~ data1$gndr)
##
## F test to compare two variances
##
## data: as.numeric(data1$eduyrs) by data1$gndr
## F = 1.0052, num df = 1492, denom df = 1318, p-value = 0.923
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.905048 1.116140
## sample estimates:
## ratio of variances
## 1.005241
Having done the test for testifying variances, we conclude that p-value is large (bigger than 0.05) and we have no right to reject H0, thus, variances are equal.
*The groups are sampled from normal distributions with equal variances - the assumption holds.
Here are hypotheses for the test:
t.test(as.numeric(data1$eduyrs) ~ data1$gndr)
##
## Welch Two Sample t-test
##
## data: as.numeric(data1$eduyrs) by data1$gndr
## t = 2.6906, df = 2769.2, p-value = 0.007175
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.09112623 0.58082024
## sample estimates:
## mean in group Male mean in group Female
## 13.44742 13.11145
Since our p-value is very small (smaller than 0.05), we have to reject H0 and accept H1 which states that our means are significantly different. According to the test, men get 13.4 years of education and women - 13.1.
Here we used the distribution-free nonparametric test, which is generally defined as the hypothesis test which is not based on underlying assumptions, because our independent variables are non-metric. So, here we presented Wilcoxon Test to check our results.
wilcox.test(as.numeric(data1$eduyrs) ~ data1$gndr)
##
## Wilcoxon rank sum test with continuity correction
##
## data: as.numeric(data1$eduyrs) by data1$gndr
## W = 1050800, p-value = 0.001959
## alternative hypothesis: true location shift is not equal to 0
Again, having conducted the test, we come to the conclusion that our means are significantly different since p-value is really small and we reject H0.
Overall, we state that, on average, men get more education (if we are to measure it in years). Since we have already discovered that men tend to choose to vote either for increasing taxes or not, whereas women tend to stay neutral, we can draw an assumption that one of the factors that influences such decision is their education. Hypothetically, bigger mean in terms of education in years for men might be used to explain their attitude towards the increase of taxation. For instance, education might allow them to gain more knowledge about both taxation system and environment problems, which may result in them being able to chose a side: to be against increasing taxes or support such a change. On the other hand, women who are less educated, might lack knowledge to make such a decision and prefer to stay neutral.