T-test conducting

To use T-test test we need to choose 2 variables: 1 categorical and 1 continuous.

H0 hypothesis: there is no correlation between concernment about climate change and the years spend on education.

H1 hypothesis: there is a correlation between concernment about climate change and the years spend on education.

In order to check these hypotheses, we used T-test.

First of all we converted variable, that measures the level of concernment about climate change into two categories: ([1,2] – «not concerned», [3,4] – «concerned».

Moving to preparation for t-test

Normality of distribution

H0 hypothesis: variable is distributed normally.

H1 hypothesis: variable is not distributed normally.

shapiro.test(ESS1$eduyrs1)
## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1
## W = 0.98536, p-value = 5.087e-13

Conclusion: According to the Shapiro-Wilk normality test, the general distribution of all variables is distributed not normally, because p-value is low < 0.05 (p-value = 5.087e-13).

Histogram

ggplot() +
  geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
  ggtitle("Distribution of years of full-time education completed") + 
  theme_bw()

Conclusion: So, distribution seems to be normal, so we don’t need to find a logarithm.

The equality of variances

To check the equality of variances we used Levene Test and Bartlett Test.

H0 hypothesis: the variances are equal.

H1 hypothesis: the variances are not equal.

We used Levene’s test test as it is considered as a better option in such situations (less sensitive to non-normally distributed data).

leveneTest(eduyrs1 ~ wrclmch2, data=ESS1)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value  Pr(>F)  
## group    1  5.5123 0.01899 *
##       1898                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: According to the Levene’s test, the variances are not equal, because P-value is lower than > 0.05((for Levene-test, p-value = 0.01899 *)

T-test

T-test

t = t.test(eduyrs1  ~ wrclmch2, data = ESS1, var.equal = F)
t
## 
##  Welch Two Sample t-test
## 
## data:  eduyrs1 by wrclmch2
## t = -6.615, df = 631.37, p-value = 7.928e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.884770 -1.021899
## sample estimates:
## mean in group Not concerned     mean in group Concerned 
##                    10.64885                    12.10219

Conclusion:As the result we got p-value, which is lower than 0,05(p-value = 7.928e-11), so we tend to REJECT the H0, H1 is more true. And we can suppose that there is a correlation between concernment about climate change and the years spend on education

Boxplot

ggplot() +
  geom_boxplot(data = ESS1, aes(x = wrclmch2, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") + 
  ylab("Years of full-time education") + 
  ggtitle("Years of full-time education completed and Degree of concernment") + 
  theme_bw()

Conclusion:We created a boxplot. It provides information, that people, who spend slightly more time on education(~9-15 years) VS (~8-13 years) tend to be concerned on climate change. However, there are outlines: people who have studied for (20-25) years are not concerned, while people with 25+ years of education are concerned on climate change.

Chi-square test

To use Chi-square test test we need to choose 2 categorical

Chi-square test

Table

table(ESS1$rdcenr, ESS1$gndr)
##             
##              Male Female
##   Never        42     27
##   Sometimes   219    177
##   Often       347    337
##   Very often  237    313
##   Always      102     94

Conclusion: this table proves that we have enough observations to use Chi-square test.

Chi-square test

ch = chisq.test(ESS1$rdcenr, ESS1$gndr)
ch
## 
##  Pearson's Chi-squared test
## 
## data:  ESS1$rdcenr and ESS1$gndr
## X-squared = 18.689, df = 4, p-value = 0.0009044

Conclusion:As the result we received that p-value is less than 0,05(p-value = 0.0009044), so we cannot accept H0, so concernment on climate change is related with gender.

Pearson residuals

df_resid = as.data.frame(ch$residuals)
df_resid
##    ESS1.rdcenr ESS1.gndr       Freq
## 1        Never      Male  1.2803222
## 2    Sometimes      Male  1.5002264
## 3        Often      Male  0.2802019
## 4   Very often      Male -2.2833378
## 5       Always      Male  0.4093931
## 6        Never    Female -1.2796468
## 7    Sometimes    Female -1.4994349
## 8        Often    Female -0.2800541
## 9   Very often    Female  2.2821332
## 10      Always    Female -0.4091771
df_count = as.data.frame(ch$observed)
df_count
##    ESS1.rdcenr ESS1.gndr Freq
## 1        Never      Male   42
## 2    Sometimes      Male  219
## 3        Often      Male  347
## 4   Very often      Male  237
## 5       Always      Male  102
## 6        Never    Female   27
## 7    Sometimes    Female  177
## 8        Often    Female  337
## 9   Very often    Female  313
## 10      Always    Female   94
ggplot() + 
  geom_raster(data = df_resid, aes(x = ESS1.gndr, y = ESS1.rdcenr, fill = Freq), hjust = 0.5, vjust = 0.5) + 
  scale_fill_gradient2("Pearson residuals", low = "#2166ac", mid = "#f7f7f7", high = "#b2182b", midpoint = 0) +
  geom_text(data = df_count, aes(x = ESS1.gndr, y = ESS1.rdcenr, label = Freq)) +
  xlab("Gender") +
  ylab("How often do things to reduce energy use") +
  theme_bw()

Conclusion:There are significantly more female who do very often do things to reduce energy use than it is expected, and less male who do very often do things to reduce energy use than it is expected.

ANOVA

To use ANOVA test we need to choose 2 variables: 1 categorical and 1 continuous.

ggplot() +
  geom_boxplot(data = ESS1, aes(x = rdcenr, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") + 
  ylab("Years of full-time education") + 
  xlab("How often do things to reduce energy use")+
  theme_bw()

We made a visualisation of our data.According to the boxplots the highest medians in years of education are shown for people who answer “often” and “very often”. And the lowest are medians in years of education are shown for people who answer “never”

Moving to preparation for ANOVA.

As we have few observations in our data (less than 5000), we are going to test assumptions before ANOVA command.

Normality of distribution

H0 hypothesis: variable is distributed normally.

H1 hypothesis: variable is not distributed normally.

Often

shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Often"])
## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1[ESS1$rdcenr == "Often"]
## W = 0.98715, p-value = 1.026e-05

Very often

shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Very often"])
## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1[ESS1$rdcenr == "Very often"]
## W = 0.98776, p-value = 0.0001455

Always

shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Always"])
## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1[ESS1$rdcenr == "Always"]
## W = 0.97496, p-value = 0.001399

Never

shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Never"])
## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1[ESS1$rdcenr == "Never"]
## W = 0.96919, p-value = 0.08606

Sometimes

shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Sometimes"])
## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1[ESS1$rdcenr == "Sometimes"]
## W = 0.96886, p-value = 1.805e-07

Conclusions:

Using the Shapiro-Wilk normality test we checked the normality of variable in each category using Shapiro-Wilk normality test again.The following result were found:

  • “Sometimes” p-value is lower than < 0.05 ==> the distribution is not normal
  • “Often” p-value is lower than < 0.05 ==> the distribution is not normal
  • “Very often” p-value is lower than < 0.05 ==> the distribution is not normal
  • “Always” p-value is lower than < 0.05 ==> the distribution is not normal
  • “Never” p-value is lower than < 0.05 ==> the distribution is not normal
ggplot() +
  geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
      ggtitle("Distribution of years of full-time education completed") + 
      facet_grid(~rdcenr) +
        theme_bw()

That is why we created a histogram to see whether our distribution is close to normal or not. It is clearly seen, that the distribution is close to normal.

The equality of variances

To check the equality of variances we used Levene Test and Bartlett Test.

H0 hypothesis: the variances are equal.

H1 hypothesis: the variances are not equal.

Bartlett.test

bartlett.test(eduyrs1 ~ rdcenr, data=ESS1)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  eduyrs1 by rdcenr
## Bartlett's K-squared = 8.5218, df = 4, p-value = 0.07423

LeveneTest

leveneTest(eduyrs1 ~ rdcenr, data=ESS1)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.6735 0.1535
##       1890

Conclusions:

According to the Levene’s test and Bartlett Test, the variances are equal, because P-value is bigger than > 0.05((for Levene-test, p-value = 0.1653; for Bartlett-test, p-value =0.08433).

ANOVA

H0 hypothesis: there are no differences between people with different duration of education in readiness to reduce the energy-consumption.

H1 hypothesis: there are differences between people with different duration of education in readiness to reduce the energy-consumption.

ANOVA test

oneway.test(eduyrs1 ~ rdcenr,data=ESS1, var.equal = T)
## 
##  One-way analysis of means
## 
## data:  eduyrs1 and rdcenr
## F = 4.1317, num df = 4, denom df = 1890, p-value = 0.002456

According to the result of the ANOVA, the F-ratio is significant here.

Conclusion: people with different duration of education have different readiness for reducing energy consumption ((F(4,1902)=4.6851, p-value = 0.0009171)

Tukey Post-hoc test

aov.out <- aov(ESS1$eduyrs1 ~ as.factor(ESS1$rdcenr))
layout(matrix(1:1,2,2))
par(mar=c(6, 25, 3, 2))
plot(TukeyHSD(aov.out), las = 2)

As we have equal variances, we can not use Bonferroni or Games-Howell tests. So, we use Tukey Post-hoc test.

Conclusion: According to the test Very often - Never, Very often - Sometimes, Always - Very often have the differences in the means.

Outliers

Outliers

layout(matrix(1:4,2,2));  plot(aov.out)

As the boxplot and graph confirm, our data is not distributed normally (red lines are not oriented straightly gorizontal). We have the next outliers: 1692, 475 and 96.

Kruskal-Wallis test

kruskal.test(ESS1$eduyrs1 ~ as.factor(ESS1$rdcenr)) 
## 
##  Kruskal-Wallis rank sum test
## 
## data:  ESS1$eduyrs1 by as.factor(ESS1$rdcenr)
## Kruskal-Wallis chi-squared = 17.12, df = 4, p-value = 0.001832

Then we used non parametric test (the Kruskal-Wallis test) to check our ANOVA results . P-value = 0.0006534, so we can consider that the differences are significant as it was in the ANOVA

Final conclusions:

After conducted analysis we finally get the answer to our question and confirm our hypothesis:
* people who were getting education for a longer duration are more ready to reduce their energy-consumption. * And moreover, we can claim that these people are more concerned on climate change.