To use T-test test we need to choose 2 variables: 1 categorical and 1 continuous.
H0 hypothesis: there is no correlation between concernment about climate change and the years spend on education.
H1 hypothesis: there is a correlation between concernment about climate change and the years spend on education.
In order to check these hypotheses, we used T-test.
First of all we converted variable, that measures the level of concernment about climate change into two categories: ([1,2] – «not concerned», [3,4] – «concerned».
H0 hypothesis: variable is distributed normally.
H1 hypothesis: variable is not distributed normally.
shapiro.test(ESS1$eduyrs1)
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1
## W = 0.98536, p-value = 5.087e-13
Conclusion: According to the Shapiro-Wilk normality test, the general distribution of all variables is distributed not normally, because p-value is low < 0.05 (p-value = 5.087e-13).
ggplot() +
geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Distribution of years of full-time education completed") +
theme_bw()
Conclusion: So, distribution seems to be normal, so we don’t need to find a logarithm.
To check the equality of variances we used Levene Test and Bartlett Test.
H0 hypothesis: the variances are equal.
H1 hypothesis: the variances are not equal.
We used Levene’s test test as it is considered as a better option in such situations (less sensitive to non-normally distributed data).
leveneTest(eduyrs1 ~ wrclmch2, data=ESS1)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 5.5123 0.01899 *
## 1898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: According to the Levene’s test, the variances are not equal, because P-value is lower than > 0.05((for Levene-test, p-value = 0.01899 *)
t = t.test(eduyrs1 ~ wrclmch2, data = ESS1, var.equal = F)
t
##
## Welch Two Sample t-test
##
## data: eduyrs1 by wrclmch2
## t = -6.615, df = 631.37, p-value = 7.928e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.884770 -1.021899
## sample estimates:
## mean in group Not concerned mean in group Concerned
## 10.64885 12.10219
Conclusion:As the result we got p-value, which is lower than 0,05(p-value = 7.928e-11), so we tend to REJECT the H0, H1 is more true. And we can suppose that there is a correlation between concernment about climate change and the years spend on education
ggplot() +
geom_boxplot(data = ESS1, aes(x = wrclmch2, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") +
ylab("Years of full-time education") +
ggtitle("Years of full-time education completed and Degree of concernment") +
theme_bw()
Conclusion:We created a boxplot. It provides information, that people, who spend slightly more time on education(~9-15 years) VS (~8-13 years) tend to be concerned on climate change. However, there are outlines: people who have studied for (20-25) years are not concerned, while people with 25+ years of education are concerned on climate change.
To use Chi-square test test we need to choose 2 categorical
table(ESS1$rdcenr, ESS1$gndr)
##
## Male Female
## Never 42 27
## Sometimes 219 177
## Often 347 337
## Very often 237 313
## Always 102 94
Conclusion: this table proves that we have enough observations to use Chi-square test.
ch = chisq.test(ESS1$rdcenr, ESS1$gndr)
ch
##
## Pearson's Chi-squared test
##
## data: ESS1$rdcenr and ESS1$gndr
## X-squared = 18.689, df = 4, p-value = 0.0009044
Conclusion:As the result we received that p-value is less than 0,05(p-value = 0.0009044), so we cannot accept H0, so concernment on climate change is related with gender.
df_resid = as.data.frame(ch$residuals)
df_resid
## ESS1.rdcenr ESS1.gndr Freq
## 1 Never Male 1.2803222
## 2 Sometimes Male 1.5002264
## 3 Often Male 0.2802019
## 4 Very often Male -2.2833378
## 5 Always Male 0.4093931
## 6 Never Female -1.2796468
## 7 Sometimes Female -1.4994349
## 8 Often Female -0.2800541
## 9 Very often Female 2.2821332
## 10 Always Female -0.4091771
df_count = as.data.frame(ch$observed)
df_count
## ESS1.rdcenr ESS1.gndr Freq
## 1 Never Male 42
## 2 Sometimes Male 219
## 3 Often Male 347
## 4 Very often Male 237
## 5 Always Male 102
## 6 Never Female 27
## 7 Sometimes Female 177
## 8 Often Female 337
## 9 Very often Female 313
## 10 Always Female 94
ggplot() +
geom_raster(data = df_resid, aes(x = ESS1.gndr, y = ESS1.rdcenr, fill = Freq), hjust = 0.5, vjust = 0.5) +
scale_fill_gradient2("Pearson residuals", low = "#2166ac", mid = "#f7f7f7", high = "#b2182b", midpoint = 0) +
geom_text(data = df_count, aes(x = ESS1.gndr, y = ESS1.rdcenr, label = Freq)) +
xlab("Gender") +
ylab("How often do things to reduce energy use") +
theme_bw()
Conclusion:There are significantly more female who do very often do things to reduce energy use than it is expected, and less male who do very often do things to reduce energy use than it is expected.
To use ANOVA test we need to choose 2 variables: 1 categorical and 1 continuous.
ggplot() +
geom_boxplot(data = ESS1, aes(x = rdcenr, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") +
ylab("Years of full-time education") +
xlab("How often do things to reduce energy use")+
theme_bw()
We made a visualisation of our data.According to the boxplots the highest medians in years of education are shown for people who answer “often” and “very often”. And the lowest are medians in years of education are shown for people who answer “never”
As we have few observations in our data (less than 5000), we are going to test assumptions before ANOVA command.
H0 hypothesis: variable is distributed normally.
H1 hypothesis: variable is not distributed normally.
shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Often"])
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1[ESS1$rdcenr == "Often"]
## W = 0.98715, p-value = 1.026e-05
shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Very often"])
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1[ESS1$rdcenr == "Very often"]
## W = 0.98776, p-value = 0.0001455
shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Always"])
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1[ESS1$rdcenr == "Always"]
## W = 0.97496, p-value = 0.001399
shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Never"])
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1[ESS1$rdcenr == "Never"]
## W = 0.96919, p-value = 0.08606
shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Sometimes"])
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1[ESS1$rdcenr == "Sometimes"]
## W = 0.96886, p-value = 1.805e-07
Using the Shapiro-Wilk normality test we checked the normality of variable in each category using Shapiro-Wilk normality test again.The following result were found:
ggplot() +
geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Distribution of years of full-time education completed") +
facet_grid(~rdcenr) +
theme_bw()
That is why we created a histogram to see whether our distribution is close to normal or not. It is clearly seen, that the distribution is close to normal.
To check the equality of variances we used Levene Test and Bartlett Test.
H0 hypothesis: the variances are equal.
H1 hypothesis: the variances are not equal.
bartlett.test(eduyrs1 ~ rdcenr, data=ESS1)
##
## Bartlett test of homogeneity of variances
##
## data: eduyrs1 by rdcenr
## Bartlett's K-squared = 8.5218, df = 4, p-value = 0.07423
leveneTest(eduyrs1 ~ rdcenr, data=ESS1)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.6735 0.1535
## 1890
According to the Levene’s test and Bartlett Test, the variances are equal, because P-value is bigger than > 0.05((for Levene-test, p-value = 0.1653; for Bartlett-test, p-value =0.08433).
H0 hypothesis: there are no differences between people with different duration of education in readiness to reduce the energy-consumption.
H1 hypothesis: there are differences between people with different duration of education in readiness to reduce the energy-consumption.
oneway.test(eduyrs1 ~ rdcenr,data=ESS1, var.equal = T)
##
## One-way analysis of means
##
## data: eduyrs1 and rdcenr
## F = 4.1317, num df = 4, denom df = 1890, p-value = 0.002456
According to the result of the ANOVA, the F-ratio is significant here.
Conclusion: people with different duration of education have different readiness for reducing energy consumption ((F(4,1902)=4.6851, p-value = 0.0009171)
aov.out <- aov(ESS1$eduyrs1 ~ as.factor(ESS1$rdcenr))
layout(matrix(1:1,2,2))
par(mar=c(6, 25, 3, 2))
plot(TukeyHSD(aov.out), las = 2)
As we have equal variances, we can not use Bonferroni or Games-Howell tests. So, we use Tukey Post-hoc test.
Conclusion: According to the test Very often - Never, Very often - Sometimes, Always - Very often have the differences in the means.
layout(matrix(1:4,2,2)); plot(aov.out)
As the boxplot and graph confirm, our data is not distributed normally (red lines are not oriented straightly gorizontal). We have the next outliers: 1692, 475 and 96.
kruskal.test(ESS1$eduyrs1 ~ as.factor(ESS1$rdcenr))
##
## Kruskal-Wallis rank sum test
##
## data: ESS1$eduyrs1 by as.factor(ESS1$rdcenr)
## Kruskal-Wallis chi-squared = 17.12, df = 4, p-value = 0.001832
Then we used non parametric test (the Kruskal-Wallis test) to check our ANOVA results . P-value = 0.0006534, so we can consider that the differences are significant as it was in the ANOVA
After conducted analysis we finally get the answer to our question and confirm our hypothesis:
* people who were getting education for a longer duration are more ready to reduce their energy-consumption. * And moreover, we can claim that these people are more concerned on climate change.