To use T-test test we need to choose 2 variables: 1 categorical and 1 continuous.
H0 hypothesis: there is no correlation between concernment about climate change and the years spend on education.
H1 hypothesis: there is a correlation between concernment about climate change and the years spend on education.
In order to check these hypotheses, we used T-test.
First of all we converted variable, that measures the level of concernment about climate change into two categories: ([1,2] – «not concerned», [3,4] – «concerned».
H0 hypothesis: variable is distributed normally.
H1 hypothesis: variable is not distributed normally.
shapiro.test(ESS1$eduyrs1)
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1
## W = 0.98609, p-value = 3.419e-12
Conclusion: According to the Shapiro-Wilk normality test, the general distribution of all variables is distributed not normally, because p-value is low < 0.05 (p-value = 5.087e-13).
ggplot() +
geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Distribution of years of full-time education completed") +
theme_bw()
Conclusion: So, distribution seems to be normal, so we don’t need to find a logarithm.
To check the equality of variances we used Levene Test and Bartlett Test.
H0 hypothesis: the variances are equal.
H1 hypothesis: the variances are not equal.
We used Levene’s test test as it is considered as a better option in such situations (less sensitive to non-normally distributed data).
leveneTest(eduyrs1 ~ wrclmch2, data=ESS1)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 4.4706 0.03462 *
## 1799
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: According to the Levene’s test, the variances are not equal, because P-value is lower than > 0.05((for Levene-test, p-value = 0.01899 *)
t = t.test(eduyrs1 ~ wrclmch2, data = ESS1, var.equal = F)
t
##
## Welch Two Sample t-test
##
## data: eduyrs1 by wrclmch2
## t = -6.2532, df = 595.1, p-value = 7.679e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.8294878 -0.9549693
## sample estimates:
## mean in group Not concerned mean in group Concerned
## 10.88679 12.27902
Conclusion:As the result we got p-value, which is lower than 0,05(p-value = 7.928e-11), so we tend to REJECT the H0, H1 is more true. And we can suppose that there is a correlation between concernment about climate change and the years spend on education
ggplot() +
geom_boxplot(data = ESS1, aes(x = wrclmch2, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") +
ylab("Years of full-time education") +
ggtitle("Years of full-time education completed and Degree of concernment") +
theme_bw()
Conclusion:We created a boxplot. It provides information, that people, who spend slightly more time on education(~9-15 years) VS (~8-13 years) tend to be concerned on climate change. However, there are outlines: people who have studied for (20-25) years are not concerned, while people with 25+ years of education are concerned on climate change.
To use Chi-square test we need to choose 2 categorical variables
table(ESS1$rdcenr, ESS1$gndr)
##
## Male Female
## Never 39 22
## Sometimes 206 167
## Often 340 313
## Very often 231 299
## Always 94 86
Conclusion: this table proves that we have enough observations to use Chi-square test in every category.
ch = chisq.test(ESS1$rdcenr, ESS1$gndr)
ch
##
## Pearson's Chi-squared test
##
## data: ESS1$rdcenr and ESS1$gndr
## X-squared = 18.721, df = 4, p-value = 0.0008918
Conclusion:As the result we received that p-value is less than 0,05(p-value = 0.0009044), so we cannot accept H0, so concernment on climate change is related with gender.
df_resid = as.data.frame(ch$residuals)
df_resid
## ESS1.rdcenr ESS1.gndr Freq
## 1 Never Male 1.4591143
## 2 Sometimes Male 1.2451573
## 3 Often Male 0.5125822
## 4 Very often Male -2.2823979
## 5 Always Male 0.2983110
## 6 Never Female -1.4779107
## 7 Sometimes Female -1.2611975
## 8 Often Female -0.5191853
## 9 Very often Female 2.3117999
## 10 Always Female -0.3021539
df_count = as.data.frame(ch$observed)
df_count
## ESS1.rdcenr ESS1.gndr Freq
## 1 Never Male 39
## 2 Sometimes Male 206
## 3 Often Male 340
## 4 Very often Male 231
## 5 Always Male 94
## 6 Never Female 22
## 7 Sometimes Female 167
## 8 Often Female 313
## 9 Very often Female 299
## 10 Always Female 86
ggplot() +
geom_raster(data = df_resid, aes(x = ESS1.gndr, y = ESS1.rdcenr, fill = Freq), hjust = 0.5, vjust = 0.5) +
scale_fill_gradient2("Pearson residuals", low = "#2166ac", mid = "#f7f7f7", high = "#b2182b", midpoint = 0) +
geom_text(data = df_count, aes(x = ESS1.gndr, y = ESS1.rdcenr, label = Freq)) +
xlab("Gender") +
ylab("How often do things to reduce energy use") +
theme_bw()
Conclusion:There are significantly more female who do very often do things to reduce energy use than it is expected, and less male who do very often do things to reduce energy use than it is expected.
To use ANOVA test we need to choose 2 variables: 1 categorical and 1 continuous.
ggplot() +
geom_boxplot(data = ESS1, aes(x = rdcenr, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") +
ylab("Years of full-time education") +
xlab("How often do things to reduce energy use")+
theme_bw()
We made a visualisation of our data.According to the boxplots the highest medians in years of education are shown for people who answer “often” and “very often”. And the lowest are medians in years of education are shown for people who answer “never”
As we have few observations in our data (less than 5000), we are going to test assumptions before ANOVA command.
H0 hypothesis: variable is distributed normally.
H1 hypothesis: variable is not distributed normally.
shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Often"])
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1[ESS1$rdcenr == "Often"]
## W = 0.98758, p-value = 2.405e-05
shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Very often"])
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1[ESS1$rdcenr == "Very often"]
## W = 0.98695, p-value = 0.0001111
shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Always"])
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1[ESS1$rdcenr == "Always"]
## W = 0.98038, p-value = 0.01228
shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Never"])
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1[ESS1$rdcenr == "Never"]
## W = 0.97221, p-value = 0.1792
shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Sometimes"])
##
## Shapiro-Wilk normality test
##
## data: ESS1$eduyrs1[ESS1$rdcenr == "Sometimes"]
## W = 0.97031, p-value = 6.685e-07
Using the Shapiro-Wilk normality test we checked the normality of variable in each category using Shapiro-Wilk normality test again.The following result were found:
ggplot() +
geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Distribution of years of full-time education completed") +
facet_grid(~rdcenr) +
theme_bw()
That is why we created a histogram to see whether our distribution is close to normal or not. It is clearly seen, that the distribution is not normal.
To check the equality of variances we used Levene Test and Bartlett Test.
H0 hypothesis: the variances are equal.
H1 hypothesis: the variances are not equal.
bartlett.test(eduyrs1 ~ rdcenr, data=ESS1)
##
## Bartlett test of homogeneity of variances
##
## data: eduyrs1 by rdcenr
## Bartlett's K-squared = 7.5051, df = 4, p-value = 0.1115
leveneTest(eduyrs1 ~ rdcenr, data=ESS1)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.2109 0.3041
## 1792
According to the Levene’s test and Bartlett Test, the variances are equal, because P-value is bigger than > 0.05((for Levene-test, p-value = 0.1653; for Bartlett-test, p-value =0.08433).
H0 hypothesis: there are no differences between people with different duration of education in readiness to reduce the energy-consumption.
H1 hypothesis: there are differences between people with different duration of education in readiness to reduce the energy-consumption.
oneway.test(eduyrs1 ~ rdcenr,data=ESS1, var.equal = T)
##
## One-way analysis of means
##
## data: eduyrs1 and rdcenr
## F = 3.3267, num df = 4, denom df = 1792, p-value = 0.01005
According to the result of the ANOVA, the F-ratio is significant here.
Conclusion: people with different duration of education have different readiness for reducing energy consumption ((F(4,1902)=4.6851, p-value = 0.0009171)
aov.out <- aov(ESS1$eduyrs1 ~ as.factor(ESS1$rdcenr))
layout(matrix(1:1,2,2))
par(mar=c(6, 25, 3, 2))
plot(TukeyHSD(aov.out), las = 2)
As we have equal variances, we can not use Bonferroni or Games-Howell tests. So, we use Tukey Post-hoc test.
Conclusion: According to the test Very often - Sometimes, Always - Very often have the differences in the means.
layout(matrix(1:4,2,2)); plot(aov.out)
As the boxplot and graph confirm, our data is not distributed normally (red lines are not oriented straightly gorizontal). We have the next outliers: 1692, 475 and 96.
kruskal.test(ESS1$eduyrs1 ~ as.factor(ESS1$rdcenr))
##
## Kruskal-Wallis rank sum test
##
## data: ESS1$eduyrs1 by as.factor(ESS1$rdcenr)
## Kruskal-Wallis chi-squared = 13.211, df = 4, p-value = 0.01029
Then we used non parametric test (the Kruskal-Wallis test) to check our ANOVA results . P-value = 0.0006534, so we can consider that the differences are significant as it was in the ANOVA
Research question:
Whether the duration of the educational period, gender and income of the respondents affect their readiness to buy and consume energy efficient staff or .
Hypotheses:
H1:People with longer educational duration and respectively who have higher income have higher level of willingness to use energy efficient stuff.
H2: women with longer period of education more tend to use energy efficient staff.
Literature:
We also tried to find a theoretical frameworks to justify our hypothesis. The first article: “The Effects of gender on climate change knowledge and concern in the American public,” was written by Aaron M. McCright and published in the journal “Population and Environments”. The author of the article tries to study the American community from the point of view of gender differences in scientific knowledge and problems of the environment. According to the scientists’ expectations, there should not be any significant differences in this area, however, according to the results of the study, women demonstrated a higher level of knowledge of climate change, and moreover, women more than men underestimated their knowledge in this area. Scientists also managed to find out that women are much more concerned about climate change than men, and, according to scientists, this gender gap can not be explained by differences in values, beliefs and social roles. You can find more information here: *https://link.springer.com/article/10.1007/s11111-010-0113-1*
Tamara Shapiro Ledley wrotes that it is important to get education, to learn as much as possible. Learning about climate change and it’s consequences from experience is too late, so it’s better to study more, therefore a person will be more concerned on climate change and do things to prevent it. In our case it would be things to reduce energy use. *http://environmentalscience.oxfordre.com/view/10.1093/acrefore/9780199389414.001.0001/acrefore-9780199389414-e-56*
In our case, the predictor variables will be the number of years of education, gender and income, and the Outcome will be the degree of readiness to consume energy efficient staff.
Years of education
ggplot() +
geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Years of full-time education completed") +
theme_bw()
summary(ESS1$eduyrs1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 9 12 12 15 27
On this histogram we can see that this distribution is close to normal. The majority of respondents have approximately 9-years or 14-years of education.
Income
ggplot() +
geom_histogram(data = ESS1, aes(x = hinctnta1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Income") +
theme_bw()
summary(ESS1$hinctnta1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 6.000 5.753 8.000 10.000
1 - min, 10 - max income. We can see that the majority of respondents consider their income from 5,5 to 8.
Gender
ggplot() +
geom_histogram(data = ESS1, aes(x = gndr), stat="count", fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Gender distribution") +
theme_bw()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
There are approximately equal numbers of respondents of both gender.
Readiness to buy and consume energy efficient staff
ggplot() +
geom_histogram(data = ESS1, aes(x = eneffap1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Readiness of the most people to buy energy efficient staff") +
theme_bw()
summary(ESS1$eneffap1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 9.00 8.89 10.00 11.00
This ghaph shows that there is high level of readiness of the most people to buy energy efficient staff. Meanings: 1 - not likely at all, 11 - extrimely likely.
Years of education and gender
ggplot() +
geom_boxplot(data = ESS1, aes(x = gndr, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") +
ylab("Years of full-time education") +
ggtitle("Years of full-time education completed and Gender") +
theme_bw()
These boxplots shows that median meaning of the educational duration of female is higher: it is located approximately at 13 years, while males’ median is near to 11.
ggplot( data = ESS1, aes(x=hinctnta1, y=eduyrs1)) + geom_jitter() +
xlab("Income") +
ylab("Years of full-time education completed") +
theme_bw()
This graph shows, that there is almost no relation between the number of years, spent on education and income. But it is important to note, that there are outliers where the years of education is more than 20, so income is relatively high in this area. On the contrary, very small income is observed in cases, where there are less than 5 years, spent on education.
ggplot( data = ESS1, aes(x=eneffap1, y=eduyrs1)) + geom_jitter() +
xlab("How likely to buy most energy efficient home appliance") +
ylab("Years of full-time education completed") +
theme_bw()
This scatterplot provides the information about how the number of years of full-time education completed influence on the people readiness to buy most energy efficient home appliance.
In the result we can see that people who get about 15 years of education (completed bachelor or masters degree) tend to be more ready to buy energy efficient staff.
We tried to check whether the number of years, spent on education, is related with willingness to use energy efficient stuff.
m1 <- lm(eneffap1 ~ eduyrs1, data = ESS1)
summary(m1)
##
## Call:
## lm(formula = eneffap1 ~ eduyrs1, data = ESS1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1115 -0.7787 0.3322 1.2767 2.6650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.22411 0.15651 52.546 < 2e-16 ***
## eduyrs1 0.05546 0.01239 4.477 8.03e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.073 on 1795 degrees of freedom
## Multiple R-squared: 0.01105, Adjusted R-squared: 0.01049
## F-statistic: 20.05 on 1 and 1795 DF, p-value: 8.029e-06
We get an intersection on the rate 8,19, coefficient b is equal 0,06. We can compose the regression equation: Y=0,06x+8,19.
This means, that change on one year of the number of years spent on education leads to increasing the willingness to use energy efficient stuff by 0,06 (from 8,19).
R squared helps to evaluate the quality of the regression model, we can see, that about 1 percent of the whole sample can be predicted by the model.
The next step is to check how gender can affect the willingness of individives to use energy efficient stuff.
m2 <- lm(eneffap1 ~ eduyrs1 + gndr1, data = ESS1)
summary(m2)
##
## Call:
## lm(formula = eneffap1 ~ eduyrs1 + gndr1, data = ESS1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1554 -0.7317 0.3734 1.3209 2.7415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.93996 0.20386 38.948 < 2e-16 ***
## eduyrs1 0.05258 0.01244 4.225 2.51e-05 ***
## gndr1 0.21340 0.09825 2.172 0.03 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.07 on 1794 degrees of freedom
## Multiple R-squared: 0.01364, Adjusted R-squared: 0.01254
## F-statistic: 12.4 on 2 and 1794 DF, p-value: 4.468e-06
We have an intersection about 7.9. Also we can see what the coefficient b is equal to 0,23 for gndr1, and 0.05 for eduyrs1.
Thus we can compose the regression equation:
Y = 0,23(gndr1) + 0.05(eduyrs1)7.9+error
By interpreting this equation, we can say that the willingness to consume the energy efficient stuff will be increased by 0,23 depending on gender and by 0.05 depending on eduyrs1 starting from the interception point 7.9.
It is also necessary to pay attention on the meaning of R squred in order to evaluate the quality of the regression model.
We see that R squred is approximately 0.014, which tells us that about 1 percent of the observations can be predicted by this model.
m3 <- lm(eneffap1 ~ eduyrs1 + gndr1 + hinctnta1, data = ESS1)
summary(m3)
##
## Call:
## lm(formula = eneffap1 ~ eduyrs1 + gndr1 + hinctnta1, data = ESS1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3948 -0.7275 0.4772 1.3702 2.8270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.61991 0.21732 35.063 < 2e-16 ***
## eduyrs1 0.03498 0.01311 2.669 0.00767 **
## gndr1 0.26399 0.09858 2.678 0.00748 **
## hinctnta1 0.07921 0.01923 4.119 3.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.061 on 1793 degrees of freedom
## Multiple R-squared: 0.02289, Adjusted R-squared: 0.02125
## F-statistic: 14 on 3 and 1793 DF, p-value: 5.064e-09
In our final model we decided to check how all predictor variables influence on respondents’ willingness to use energy efficient products.
Thus, we get the final regression equation
Y = 7.6 + 0.28(gndr) + 0.04(eduyrs) + 0.08(hinctnta)+error
F-ratio is equal to 14.95 and we consider it significant, p-value is less than 0.05, so we can assume that our linear model is significant and effective.
Adjusted R squared is equal to 0.02. This fact tells us that we can predict about 2% of observations using this model
sjt.lm(m3)
| eneffap1 | ||||
| B | CI | p | ||
| (Intercept) | 7.62 | 7.19 – 8.05 | <.001 | |
| eduyrs1 | 0.03 | 0.01 – 0.06 | .008 | |
| gndr1 | 0.26 | 0.07 – 0.46 | .007 | |
| hinctnta1 | 0.08 | 0.04 – 0.12 | <.001 | |
| Observations | 1797 | |||
| R2 / adj. R2 | .023 / .021 | |||
This is our summary table. It also shows that the gender plays the most significant role in readiness of respondent to use energy efficient staff in our case.
anova(m1, m2)
## Analysis of Variance Table
##
## Model 1: eneffap1 ~ eduyrs1
## Model 2: eneffap1 ~ eduyrs1 + gndr1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1795 7710.1
## 2 1794 7689.8 1 20.224 4.7181 0.02998 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As a result, we see that m2 is better than first one.
anova(m2, m3)
## Analysis of Variance Table
##
## Model 1: eneffap1 ~ eduyrs1 + gndr1
## Model 2: eneffap1 ~ eduyrs1 + gndr1 + hinctnta1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1794 7689.8
## 2 1793 7617.8 1 72.09 16.968 3.975e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that m3 seems to be better.
layout(matrix(1:4,2,2)); plot(fit)
As the graphs confirm, our data is not distributed normally (red lines are not oriented straightly gorizontal). We have the next outliers: 1407, 1134, 417
H0: distribution is normal
H1: distribution is not normal
model=aov(ESS1$eduyrs1 ~ ESS1$eneffap1)
res=model$residuals
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.99212, p-value = 3.028e-08
model=aov(ESS1$gndr1 ~ ESS1$eneffap1)
res=model$residuals
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.68675, p-value < 2.2e-16
model=aov(ESS1$hinctnta1 ~ ESS1$eneffap1)
res=model$residuals
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.96681, p-value < 2.2e-16
According to the Shapiro-Wilk normality test, distributions are not normal, because p-value is low < 0.05.
crPlots(fit)
According to the graphs everything is more or less linear.
Ho: the variance of the residuals is constant
H1: the variance of the residuals is not constant
ncvTest(fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 29.29541 Df = 1 p = 6.214279e-08
p-value is higher than 0.05, so our H 0 tend to be rejected, so the variance of the residuals is not constant and there is heteroscedasticity.
spreadLevelPlot(fit)
##
## Suggested power transformation: 5.012736
This graph provide us with the information that our residuals is not constant and there is heteroscedasticity.
vif(fit)
## eduyrs1 gndr1 hinctnta1
## 1.131735 1.027436 1.126846
Our VIFs are equal to about 1, so we can assume that we do not have perfect multicollinearity.
H0 : there is no autocorrelation H1 : there is an autocorrelation
durbinWatsonTest(fit)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02207406 2.043277 0.394
## Alternative hypothesis: rho != 0
p-value is higher than 0.05, so H0 is accepted. It means that there is no autocorrelation.
ggplot(data = ESS1, aes(x = eduyrs1, y = eneffap1)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
ylab("How likely to buy most energy efficient home appliance")+
xlab("Years of full-time education completed")+
theme_bw()
As the number of educational years increase, we can observe that the willingness for using energy efficient stuff rises,too.
ggplot(data = ESS1, aes(x = hinctnta1, y = eneffap1)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
ylab("How likely to buy most energy efficient home appliance")+
xlab("Income")+
theme_bw()
As the income goes up, the more a person is likely to buy energy efficient appliance.
Women tend to use energy efficient stuff more frequently, especially those, with relatively long period of education.
People with a big number of years, spent on education, who have a relatively high income, would use energy efficient stuff with a higher probability.
So, both of our hypothesis were accepted.
Research question: In our interaction effect, we decided to check whether the gender and duration of a person’s education affects how often he/she is willing to do things that positively affect nature, in our case how often respondents is willing to do things that reduce energy consumption.
Thus, as predictors, we will use variables “gndr” - gender of the respondent and “eduyrs” - years of full-time education, and our outcome-variable will be “rdcenr” which means “How often do things to reduce energy use” - it is an Ordinal Categorical variable. There are 7 categories: “never”, “hardly ever”, “sometimes”, “often”, “very often”, “always”, “can not reduce energy use”.
Hypotheses:
We put forward the following hypothesis:
Women with a longer education period are more likely to do things that reduce energy consumption.
People with longer period of education, who have a high income, would do things to reduce energy use with a higher probability.
ggplot() +
geom_histogram(data = ESS1, aes(x = rdcenr1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("How often do things to reduce energy use") +
theme_bw()
The distribution seems to be almost normal.
ggplot( data = ESS1, aes(x=rdcenr, y=eduyrs1)) + geom_boxplot() +
xlab("How often do things to reduce energy use") +
ylab("Years of full-time education completed") +
theme_bw()
The barcharts represent the ditribution of frequency of doing things to reduce climate change according to number of years spent on education. It seems that almost all the categories are located near the line of 13 years of education, with outliers above 20 years or under 5 years. There is a category “cannot reduce energy usage”, which is distributed among people, who spent about 8 years on education.
We tried to check whether the number of years, spent on education, is related with frequency of things done to reduce energy use.
m11 <- lm(rdcenr1 ~ eduyrs1, data = ESS1)
summary(m11)
##
## Call:
## lm(formula = rdcenr1 ~ eduyrs1, data = ESS1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2910 -0.3170 -0.1871 0.7739 1.9168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.057187 0.076472 53.055 <2e-16 ***
## eduyrs1 0.012992 0.006052 2.147 0.032 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.013 on 1795 degrees of freedom
## Multiple R-squared: 0.002561, Adjusted R-squared: 0.002005
## F-statistic: 4.608 on 1 and 1795 DF, p-value: 0.03195
The intersection is on the point 4.09, coefficient b is equal to 0.01, so we can build a regression equation, which is Y=0.01x+4.09.
The interpretation of this model can be the following: change in number of years spent on education increases the willingness to do things to reduce energy use by 0.01.
As adjusted R-squared is 0.0012, the model can describe about 0,12% of the whole sample.
The next step is to check how gender and amount of years spent on education can affect the frequency of things done to reduce energy use.
m22 <- lm(rdcenr1 ~ eduyrs1 * gndr1, data = ESS1)
summary(m22)
##
## Call:
## lm(formula = rdcenr1 ~ eduyrs1 * gndr1, data = ESS1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3154 -0.3322 -0.1478 0.7203 2.1316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.27373 0.23980 13.652 < 2e-16 ***
## eduyrs1 0.06065 0.01938 3.129 0.001783 **
## gndr1 0.53874 0.15318 3.517 0.000447 ***
## eduyrs1:gndr1 -0.03271 0.01213 -2.696 0.007084 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.009 on 1793 degrees of freedom
## Multiple R-squared: 0.01173, Adjusted R-squared: 0.01008
## F-statistic: 7.093 on 3 and 1793 DF, p-value: 9.752e-05
The intersection is located at the point of 3,35, every additional year spent on education increases the frequency of these actions by 0.06, while referring yourself to a certain gender can increase frequency by 0,5. We can construct an equation:
Y=3.34+0.06(eduyrs1)+0.5(gndr1)-0.3(eduyrs1:gndr1)+error
Adjusted R-squared is 0,008, which means that this model can be applied to 0,8% of the whole sample.
m33 <- lm(rdcenr1 ~ eduyrs1 * gndr1 * hinctnta1, data = ESS1)
summary(m33)
##
## Call:
## lm(formula = rdcenr1 ~ eduyrs1 * gndr1 * hinctnta1, data = ESS1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4002 -0.4354 -0.1330 0.7606 2.1171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2884796 0.5318687 6.183 7.78e-10 ***
## eduyrs1 0.0696745 0.0459078 1.518 0.129
## gndr1 0.5869146 0.3306182 1.775 0.076 .
## hinctnta1 -0.0021752 0.0896809 -0.024 0.981
## eduyrs1:gndr1 -0.0343155 0.0280316 -1.224 0.221
## eduyrs1:hinctnta1 -0.0011620 0.0071254 -0.163 0.870
## gndr1:hinctnta1 -0.0181495 0.0585703 -0.310 0.757
## eduyrs1:gndr1:hinctnta1 0.0007829 0.0045381 0.173 0.863
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 1789 degrees of freedom
## Multiple R-squared: 0.01705, Adjusted R-squared: 0.0132
## F-statistic: 4.432 on 7 and 1789 DF, p-value: 6.768e-05
In this model we tried to check whether the number of years spent on education, gender, willingness to use energy efficient stuff is connected with frequency of things done to reduce energy use. According to the table we can construct an equation:
Y(frequency of things done)=3,34+0,06(eduyrs1)+0,06(gndr)+0,005(hinctnta1)-0,03(eduyrs1:gndr1)-0,001(eduyrs1:hinctnta1)-0,03(gndr1:hinctnta1)+0,001(eduyrs1:gndr1:hinctnta1)+error.
Adjusted R-squares is 0,01, so this model can suit 1% of the sample.
sjt.lm(m33)
| rdcenr1 | ||||
| B | CI | p | ||
| (Intercept) | 3.29 | 2.25 – 4.33 | <.001 | |
| eduyrs1 | 0.07 | -0.02 – 0.16 | .129 | |
| gndr1 | 0.59 | -0.06 – 1.24 | .076 | |
| hinctnta1 | -0.00 | -0.18 – 0.17 | .981 | |
| eduyrs1:gndr1 | -0.03 | -0.09 – 0.02 | .221 | |
| eduyrs1:hinctnta1 | -0.00 | -0.02 – 0.01 | .870 | |
| gndr1:hinctnta1 | -0.02 | -0.13 – 0.10 | .757 | |
| eduyrs1:gndr1:hinctnta1 | 0.00 | -0.01 – 0.01 | .863 | |
| Observations | 1797 | |||
| R2 / adj. R2 | .017 / .013 | |||
This is our summary table. It also shows that the gender plays the most significant role in readiness of respondent to use energy efficient staff in this case.
anova(m11, m22)
## Analysis of Variance Table
##
## Model 1: rdcenr1 ~ eduyrs1
## Model 2: rdcenr1 ~ eduyrs1 * gndr1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1795 1840.6
## 2 1793 1823.7 2 16.919 8.3168 0.0002539 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As a result, we see that m22 is better than first one.
anova(m22, m33)
## Analysis of Variance Table
##
## Model 1: rdcenr1 ~ eduyrs1 * gndr1
## Model 2: rdcenr1 ~ eduyrs1 * gndr1 * hinctnta1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1793 1823.7
## 2 1789 1813.9 4 9.8111 2.4191 0.04665 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, m33 seems to be better.
layout(matrix(1:4,2,2)); plot(fit1)
As the graphs confirm, our data is not distributed normally (red lines are not oriented straightly gorizontal). We have the next outliers: ///////////
H0: distribution is normal
H1: distribution is not normal
model=aov(ESS1$eduyrs1 ~ ESS1$rdcenr1)
res=model$residuals
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.99136, p-value = 8.008e-09
model=aov(ESS1$gndr1 ~ ESS1$rdcenr1)
res=model$residuals
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.70239, p-value < 2.2e-16
model=aov(ESS1$hinctnta1 ~ ESS1$rdcenr1)
res=model$residuals
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.96042, p-value < 2.2e-16
According to the Shapiro-Wilk normality test, distributions are not normal, because p-value is low < 0.05.
Ho: the variance of the residuals is constant
H1: the variance of the residuals is not constant
ncvTest(fit1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.0144896 Df = 1 p = 0.9041879
p-value is higher than 0.05, so our H 0 tend to be rejected, so the variance of the residuals is not constant and there is heteroscedasticity.
spreadLevelPlot(fit1)
##
## Suggested power transformation: -5.515853
This graph provide us with the information that our residuals is not constant and there is heteroscedasticity.
vif(fit1)
## eduyrs1 gndr1 hinctnta1
## 58.19193 48.42437 102.69253
## eduyrs1:gndr1 eduyrs1:hinctnta1 gndr1:hinctnta1
## 116.06432 188.28040 146.57584
## eduyrs1:gndr1:hinctnta1
## 240.70349
H0 : there is no autocorrelation H1 : there is an autocorrelation
durbinWatsonTest(fit1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.01406075 1.969818 0.548
## Alternative hypothesis: rho != 0
p-value is higher than 0.05, so H0 is accepted. It means that there is no autocorrelation.
ggplot(data = ESS1, aes(x = eduyrs1, y = rdcenr1)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
ylab("How often do things to reduce energy use")+
xlab("Years of full-time education completed")+
theme_bw()
We can conclude that number of years spent on education can define the frequency of things done to reduce energy use. There is a positive slope, the more years the respondent learned, the more often he/she do activities to reduce energy use.
ggplot(data = ESS1, aes(x = hinctnta1, y = rdcenr1)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
ylab("How often do things to reduce energy use")+
xlab("Income")+
theme_bw()
The graph shows, that the higher income, the less person do to reduce energy use, but the relation is not so strong, as the slope is amost flat.
fit2 <- lm(rdcenr1 ~ eduyrs1 * hinctnta1, data=ESS1)
sjPlot::plot_model(fit2, type = "int", show.ci = T, mdrt.values = "all")+theme_bw() + ylab("How often do things to reduce energy use")+ xlab("Years of full-time education completed")+
ggtitle("Predicted values for readiness of the most people to reduce energy use")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning: Removed 27 rows containing missing values (geom_path).
This graph shows that people with higher income and longer period of education more likely tend to do things to reduce energy use.
fit3 <- lm(rdcenr1 ~ eduyrs1 * gndr1, data=ESS1)
sjPlot::plot_model(fit3, type = "int", show.ci = T, mdrt.values = "all")+theme_bw() + ylab("How often do things to reduce energy use")+ xlab("Years of full-time education completed")+
ggtitle("Predicted values for readiness of the most people to reduce energy use")
The graph shows that the more women spend time on full-time education, the more frequently she do things to reduce energy use. As regards men, they do these things more frequently than women if they spent less than about 18 years on education. If they study for more than 18 years, they do such things less frequently than women.
So, our first hypothesis was accepted: women with a longer education period are more likely to do things that reduce energy consumption.
And our second hypothesis was accepted: people with higher income and longer period of education more likely tend to do things to reduce energy use
Aaron M. McCright (2010). The effects of gender on climate change knowledge and concern in the American public
Tamara Shapiro Ledley, Juliette Rooney-Varga, and Frank Niepold Subject: Environmental Issues and Problems, Sustainability and Solutions Online Publication