T-test conducting

To use T-test test we need to choose 2 variables: 1 categorical and 1 continuous.

The first will be “wrclmch” which means How worried about climate change - it is an Ordinal Categorical variable.
The second variable is “eduyrs” which means Years of full-time education completed - it is an Interval Continuous Numeric variable.

H0 hypothesis: there is no correlation between concernment about climate change and the years spend on education.

H1 hypothesis: there is a correlation between concernment about climate change and the years spend on education.

In order to check these hypotheses, we used T-test.

First of all we converted variable, that measures the level of concernment about climate change into two categories: ([1,2] – «not concerned», [3,4] – «concerned».

Moving to preparation for t-test

H0 hypothesis: variable is distributed normally.

H1 hypothesis: variable is not distributed normally.

shapiro.test(ESS1$eduyrs1)

## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1
## W = 0.98609, p-value = 3.419e-12

Conclusion: According to the Shapiro-Wilk normality test, the general distribution of all variables is distributed not normally, because p-value is low < 0.05 (p-value = 5.087e-13).

ggplot() +
  geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
  ggtitle("Distribution of years of full-time education completed") + 
  theme_bw()

Conclusion: So, distribution seems to be normal, so we don’t need to find a logarithm.

The equality of variances

To check the equality of variances we used Levene Test and Bartlett Test.

H0 hypothesis: the variances are equal.

H1 hypothesis: the variances are not equal.

We used Levene’s test test as it is considered as a better option in such situations (less sensitive to non-normally distributed data).

leveneTest(eduyrs1 ~ wrclmch2, data=ESS1)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value  Pr(>F)  
## group    1  4.4706 0.03462 *
##       1799                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: According to the Levene’s test, the variances are not equal, because P-value is lower than > 0.05((for Levene-test, p-value = 0.01899 *)

T-test

t = t.test(eduyrs1  ~ wrclmch2, data = ESS1, var.equal = F)
t

## 
##  Welch Two Sample t-test
## 
## data:  eduyrs1 by wrclmch2
## t = -6.2532, df = 595.1, p-value = 7.679e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.8294878 -0.9549693
## sample estimates:
## mean in group Not concerned     mean in group Concerned 
##                    10.88679                    12.27902

Conclusion:As the result we got p-value, which is lower than 0,05(p-value = 7.928e-11), so we tend to REJECT the H0, H1 is more true. And we can suppose that there is a correlation between concernment about climate change and the years spend on education

Boxplot

ggplot() +
  geom_boxplot(data = ESS1, aes(x = wrclmch2, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") + 
  ylab("Years of full-time education") + 
  ggtitle("Years of full-time education completed and Degree of concernment") + 
  theme_bw()

Conclusion:We created a boxplot. It provides information, that people, who spend slightly more time on education(~9-15 years) VS (~8-13 years) tend to be concerned on climate change. However, there are outlines: people who have studied for (20-25) years are not concerned, while people with 25+ years of education are concerned on climate change.

Chi-square test

To use Chi-square test we need to choose 2 categorical variables

The first will be “rdcenr” which means How often do things to reduce energy use - it is an Ordinal Categorical variable. There are 7 categories: “never”, “hardly ever”, “sometimes”, “often”, “very often”, “always”, “cannot reduce energy use”.
The second variable is “gndr”.

Chi-square test

Table

table(ESS1$rdcenr, ESS1$gndr)

##             
##              Male Female
##   Never        39     22
##   Sometimes   206    167
##   Often       340    313
##   Very often  231    299
##   Always       94     86

Conclusion: this table proves that we have enough observations to use Chi-square test in every category.

Chi-square test

ch = chisq.test(ESS1$rdcenr, ESS1$gndr)
ch

## 
##  Pearson's Chi-squared test
## 
## data:  ESS1$rdcenr and ESS1$gndr
## X-squared = 18.721, df = 4, p-value = 0.0008918

Conclusion:As the result we received that p-value is less than 0,05(p-value = 0.0009044), so we cannot accept H0, so concernment on climate change is related with gender.

Pearson residuals

df_resid = as.data.frame(ch$residuals)
df_resid

##    ESS1.rdcenr ESS1.gndr       Freq
## 1        Never      Male  1.4591143
## 2    Sometimes      Male  1.2451573
## 3        Often      Male  0.5125822
## 4   Very often      Male -2.2823979
## 5       Always      Male  0.2983110
## 6        Never    Female -1.4779107
## 7    Sometimes    Female -1.2611975
## 8        Often    Female -0.5191853
## 9   Very often    Female  2.3117999
## 10      Always    Female -0.3021539

df_count = as.data.frame(ch$observed)
df_count

##    ESS1.rdcenr ESS1.gndr Freq
## 1        Never      Male   39
## 2    Sometimes      Male  206
## 3        Often      Male  340
## 4   Very often      Male  231
## 5       Always      Male   94
## 6        Never    Female   22
## 7    Sometimes    Female  167
## 8        Often    Female  313
## 9   Very often    Female  299
## 10      Always    Female   86

ggplot() + 
  geom_raster(data = df_resid, aes(x = ESS1.gndr, y = ESS1.rdcenr, fill = Freq), hjust = 0.5, vjust = 0.5) + 
  scale_fill_gradient2("Pearson residuals", low = "#2166ac", mid = "#f7f7f7", high = "#b2182b", midpoint = 0) +
  geom_text(data = df_count, aes(x = ESS1.gndr, y = ESS1.rdcenr, label = Freq)) +
  xlab("Gender") +
  ylab("How often do things to reduce energy use") +
  theme_bw()

Conclusion:There are significantly more female who do very often do things to reduce energy use than it is expected, and less male who do very often do things to reduce energy use than it is expected.

ANOVA

To use ANOVA test we need to choose 2 variables: 1 categorical and 1 continuous.

The first will be “rdcenr” which means How often do things to reduce energy use - it is an Ordinal Categorical variable.
The second variable is “eduyrs” which means Years of full-time education completed - it is an Interval Continuous Numeric variable.

ggplot() +
  geom_boxplot(data = ESS1, aes(x = rdcenr, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") + 
  ylab("Years of full-time education") + 
  xlab("How often do things to reduce energy use")+
  theme_bw()

We made a visualisation of our data.According to the boxplots the highest medians in years of education are shown for people who answer “often” and “very often”. And the lowest are medians in years of education are shown for people who answer “never”

Moving to preparation for ANOVA.

As we have few observations in our data (less than 5000), we are going to test assumptions before ANOVA command.

Normality of distribution

H0 hypothesis: variable is distributed normally.

H1 hypothesis: variable is not distributed normally.

Often

shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Often"])

## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1[ESS1$rdcenr == "Often"]
## W = 0.98758, p-value = 2.405e-05

Very often

shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Very often"])

## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1[ESS1$rdcenr == "Very often"]
## W = 0.98695, p-value = 0.0001111

Always

shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Always"])

## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1[ESS1$rdcenr == "Always"]
## W = 0.98038, p-value = 0.01228

Never

shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Never"])

## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1[ESS1$rdcenr == "Never"]
## W = 0.97221, p-value = 0.1792

Sometimes

shapiro.test(ESS1$eduyrs1[ESS1$rdcenr == "Sometimes"])

## 
##  Shapiro-Wilk normality test
## 
## data:  ESS1$eduyrs1[ESS1$rdcenr == "Sometimes"]
## W = 0.97031, p-value = 6.685e-07

Conclusions:

Using the Shapiro-Wilk normality test we checked the normality of variable in each category using Shapiro-Wilk normality test again.The following result were found:

“Sometimes” p-value is lower than < 0.05 ==> the distribution is not normal
“Often” p-value is lower than < 0.05 ==> the distribution is not normal
“Very often” p-value is lower than < 0.05 ==> the distribution is not normal
“Always” p-value is lower than < 0.05 ==> the distribution is not normal
“Never” p-value is lower than < 0.05 ==> the distribution is not normal

ggplot() +
  geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
      ggtitle("Distribution of years of full-time education completed") + 
      facet_grid(~rdcenr) +
        theme_bw()

That is why we created a histogram to see whether our distribution is close to normal or not. It is clearly seen, that the distribution is not normal.

The equality of variances

To check the equality of variances we used Levene Test and Bartlett Test.

H0 hypothesis: the variances are equal.

H1 hypothesis: the variances are not equal.

Bartlett.test

bartlett.test(eduyrs1 ~ rdcenr, data=ESS1)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  eduyrs1 by rdcenr
## Bartlett's K-squared = 7.5051, df = 4, p-value = 0.1115

LeveneTest

leveneTest(eduyrs1 ~ rdcenr, data=ESS1)

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    4  1.2109 0.3041
##       1792

Conclusions:

According to the Levene’s test and Bartlett Test, the variances are equal, because P-value is bigger than > 0.05((for Levene-test, p-value = 0.1653; for Bartlett-test, p-value =0.08433).

ANOVA

H0 hypothesis: there are no differences between people with different duration of education in readiness to reduce the energy-consumption.

H1 hypothesis: there are differences between people with different duration of education in readiness to reduce the energy-consumption.

ANOVA test

oneway.test(eduyrs1 ~ rdcenr,data=ESS1, var.equal = T)

## 
##  One-way analysis of means
## 
## data:  eduyrs1 and rdcenr
## F = 3.3267, num df = 4, denom df = 1792, p-value = 0.01005

According to the result of the ANOVA, the F-ratio is significant here.

Conclusion: people with different duration of education have different readiness for reducing energy consumption ((F(4,1902)=4.6851, p-value = 0.0009171)

Tukey Post-hoc test

aov.out <- aov(ESS1$eduyrs1 ~ as.factor(ESS1$rdcenr))
layout(matrix(1:1,2,2))
par(mar=c(6, 25, 3, 2))
plot(TukeyHSD(aov.out), las = 2)

As we have equal variances, we can not use Bonferroni or Games-Howell tests. So, we use Tukey Post-hoc test.

Conclusion: According to the test Very often - Sometimes, Always - Very often have the differences in the means.

Outliers

layout(matrix(1:4,2,2));  plot(aov.out)

As the boxplot and graph confirm, our data is not distributed normally (red lines are not oriented straightly gorizontal). We have the next outliers: 1692, 475 and 96.

Kruskal-Wallis test

kruskal.test(ESS1$eduyrs1 ~ as.factor(ESS1$rdcenr))

## 
##  Kruskal-Wallis rank sum test
## 
## data:  ESS1$eduyrs1 by as.factor(ESS1$rdcenr)
## Kruskal-Wallis chi-squared = 13.211, df = 4, p-value = 0.01029

Then we used non parametric test (the Kruskal-Wallis test) to check our ANOVA results . P-value = 0.0006534, so we can consider that the differences are significant as it was in the ANOVA

Conclusions:

People with different duration of education have different readiness for reducing energy consumption.
Concernment on climate change is related with gender

Regression

Research question:

Whether the duration of the educational period, gender and income of the respondents affect their readiness to buy and consume energy efficient staff or .

Hypotheses:

H1:People with longer educational duration and respectively who have higher income have higher level of willingness to use energy efficient stuff.

H2: women with longer period of education more tend to use energy efficient staff.

Literature:

We also tried to find a theoretical frameworks to justify our hypothesis. The first article: “The Effects of gender on climate change knowledge and concern in the American public,” was written by Aaron M. McCright and published in the journal “Population and Environments”. The author of the article tries to study the American community from the point of view of gender differences in scientific knowledge and problems of the environment. According to the scientists’ expectations, there should not be any significant differences in this area, however, according to the results of the study, women demonstrated a higher level of knowledge of climate change, and moreover, women more than men underestimated their knowledge in this area. Scientists also managed to find out that women are much more concerned about climate change than men, and, according to scientists, this gender gap can not be explained by differences in values, beliefs and social roles. You can find more information here: *https://link.springer.com/article/10.1007/s11111-010-0113-1*
Tamara Shapiro Ledley wrotes that it is important to get education, to learn as much as possible. Learning about climate change and it’s consequences from experience is too late, so it’s better to study more, therefore a person will be more concerned on climate change and do things to prevent it. In our case it would be things to reduce energy use. *http://environmentalscience.oxfordre.com/view/10.1093/acrefore/9780199389414.001.0001/acrefore-9780199389414-e-56*

In our case, the predictor variables will be the number of years of education, gender and income, and the Outcome will be the degree of readiness to consume energy efficient staff.

Descriptive statistics

Years of education

ggplot() +
  geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
  ggtitle("Years of full-time education completed") +
  theme_bw()

summary(ESS1$eduyrs1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       9      12      12      15      27

On this histogram we can see that this distribution is close to normal. The majority of respondents have approximately 9-years or 14-years of education.

Income

ggplot() +
  geom_histogram(data = ESS1, aes(x = hinctnta1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
  ggtitle("Income") +
  theme_bw()

summary(ESS1$hinctnta1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   6.000   5.753   8.000  10.000

1 - min, 10 - max income. We can see that the majority of respondents consider their income from 5,5 to 8.

Gender

ggplot() +
  geom_histogram(data = ESS1, aes(x = gndr), stat="count", fill="#008080", col="#483D8B", alpha = 0.5)+
  ggtitle("Gender distribution") +
  theme_bw()

## Warning: Ignoring unknown parameters: binwidth, bins, pad

There are approximately equal numbers of respondents of both gender.

Readiness to buy and consume energy efficient staff

ggplot() +
  geom_histogram(data = ESS1, aes(x = eneffap1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
  ggtitle("Readiness of the most people to buy energy efficient staff") +
  theme_bw()

summary(ESS1$eneffap1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00    9.00    8.89   10.00   11.00

This ghaph shows that there is high level of readiness of the most people to buy energy efficient staff. Meanings: 1 - not likely at all, 11 - extrimely likely.

Years of education and gender

ggplot() +
  geom_boxplot(data = ESS1, aes(x = gndr, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") + 
  ylab("Years of full-time education") + 
  ggtitle("Years of full-time education completed and Gender") + 
  theme_bw()

These boxplots shows that median meaning of the educational duration of female is higher: it is located approximately at 13 years, while males’ median is near to 11.

ggplot( data = ESS1, aes(x=hinctnta1, y=eduyrs1)) + geom_jitter() +
  xlab("Income") +
  ylab("Years of full-time education completed") +
  theme_bw()

This graph shows, that there is almost no relation between the number of years, spent on education and income. But it is important to note, that there are outliers where the years of education is more than 20, so income is relatively high in this area. On the contrary, very small income is observed in cases, where there are less than 5 years, spent on education.

ggplot( data = ESS1, aes(x=eneffap1, y=eduyrs1)) + geom_jitter() +
  xlab("How likely to buy most energy efficient home appliance") +
  ylab("Years of full-time education completed") +
  theme_bw()

This scatterplot provides the information about how the number of years of full-time education completed influence on the people readiness to buy most energy efficient home appliance.

In the result we can see that people who get about 15 years of education (completed bachelor or masters degree) tend to be more ready to buy energy efficient staff.

Regression 1

1

We tried to check whether the number of years, spent on education, is related with willingness to use energy efficient stuff.

m1 <- lm(eneffap1 ~ eduyrs1, data = ESS1)
summary(m1)

## 
## Call:
## lm(formula = eneffap1 ~ eduyrs1, data = ESS1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1115 -0.7787  0.3322  1.2767  2.6650 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.22411    0.15651  52.546  < 2e-16 ***
## eduyrs1      0.05546    0.01239   4.477 8.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.073 on 1795 degrees of freedom
## Multiple R-squared:  0.01105,    Adjusted R-squared:  0.01049 
## F-statistic: 20.05 on 1 and 1795 DF,  p-value: 8.029e-06

We get an intersection on the rate 8,19, coefficient b is equal 0,06. We can compose the regression equation: Y=0,06x+8,19.

This means, that change on one year of the number of years spent on education leads to increasing the willingness to use energy efficient stuff by 0,06 (from 8,19).

R squared helps to evaluate the quality of the regression model, we can see, that about 1 percent of the whole sample can be predicted by the model.

2

The next step is to check how gender can affect the willingness of individives to use energy efficient stuff.

m2 <- lm(eneffap1 ~ eduyrs1 + gndr1, data = ESS1)
summary(m2)

## 
## Call:
## lm(formula = eneffap1 ~ eduyrs1 + gndr1, data = ESS1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1554 -0.7317  0.3734  1.3209  2.7415 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.93996    0.20386  38.948  < 2e-16 ***
## eduyrs1      0.05258    0.01244   4.225 2.51e-05 ***
## gndr1        0.21340    0.09825   2.172     0.03 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.07 on 1794 degrees of freedom
## Multiple R-squared:  0.01364,    Adjusted R-squared:  0.01254 
## F-statistic:  12.4 on 2 and 1794 DF,  p-value: 4.468e-06

We have an intersection about 7.9. Also we can see what the coefficient b is equal to 0,23 for gndr1, and 0.05 for eduyrs1.

Thus we can compose the regression equation:

Y = 0,23(gndr1) + 0.05(eduyrs1)7.9+error

By interpreting this equation, we can say that the willingness to consume the energy efficient stuff will be increased by 0,23 depending on gender and by 0.05 depending on eduyrs1 starting from the interception point 7.9.

It is also necessary to pay attention on the meaning of R squred in order to evaluate the quality of the regression model.

We see that R squred is approximately 0.014, which tells us that about 1 percent of the observations can be predicted by this model.

3

m3 <- lm(eneffap1 ~ eduyrs1 + gndr1 + hinctnta1, data = ESS1)
summary(m3)

## 
## Call:
## lm(formula = eneffap1 ~ eduyrs1 + gndr1 + hinctnta1, data = ESS1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3948 -0.7275  0.4772  1.3702  2.8270 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.61991    0.21732  35.063  < 2e-16 ***
## eduyrs1      0.03498    0.01311   2.669  0.00767 ** 
## gndr1        0.26399    0.09858   2.678  0.00748 ** 
## hinctnta1    0.07921    0.01923   4.119 3.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.061 on 1793 degrees of freedom
## Multiple R-squared:  0.02289,    Adjusted R-squared:  0.02125 
## F-statistic:    14 on 3 and 1793 DF,  p-value: 5.064e-09

In our final model we decided to check how all predictor variables influence on respondents’ willingness to use energy efficient products.

Thus, we get the final regression equation

Y = 7.6 + 0.28(gndr) + 0.04(eduyrs) + 0.08(hinctnta)+error

F-ratio is equal to 14.95 and we consider it significant, p-value is less than 0.05, so we can assume that our linear model is significant and effective.

Adjusted R squared is equal to 0.02. This fact tells us that we can predict about 2% of observations using this model

sjt.lm(m3)

	eneffap1
	B	CI	p
(Intercept)	7.62	7.19 – 8.05	<.001
eduyrs1	0.03	0.01 – 0.06	.008
gndr1	0.26	0.07 – 0.46	.007
hinctnta1	0.08	0.04 – 0.12	<.001
Observations	1797
R² / adj. R²	.023 / .021

This is our summary table. It also shows that the gender plays the most significant role in readiness of respondent to use energy efficient staff in our case.

Anova

anova(m1, m2)

## Analysis of Variance Table
## 
## Model 1: eneffap1 ~ eduyrs1
## Model 2: eneffap1 ~ eduyrs1 + gndr1
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   1795 7710.1                              
## 2   1794 7689.8  1    20.224 4.7181 0.02998 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As a result, we see that m2 is better than first one.

anova(m2, m3)

## Analysis of Variance Table
## 
## Model 1: eneffap1 ~ eduyrs1 + gndr1
## Model 2: eneffap1 ~ eduyrs1 + gndr1 + hinctnta1
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1794 7689.8                                  
## 2   1793 7617.8  1     72.09 16.968 3.975e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that m3 seems to be better.

Assumptions

Outliers

layout(matrix(1:4,2,2));  plot(fit)

As the graphs confirm, our data is not distributed normally (red lines are not oriented straightly gorizontal). We have the next outliers: 1407, 1134, 417

Normality of the residuals

H0: distribution is normal

H1: distribution is not normal

model=aov(ESS1$eduyrs1 ~ ESS1$eneffap1) 
res=model$residuals 
shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.99212, p-value = 3.028e-08

model=aov(ESS1$gndr1 ~ ESS1$eneffap1) 
res=model$residuals 
shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.68675, p-value < 2.2e-16

model=aov(ESS1$hinctnta1 ~ ESS1$eneffap1) 
res=model$residuals 
shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.96681, p-value < 2.2e-16

According to the Shapiro-Wilk normality test, distributions are not normal, because p-value is low < 0.05.

Linearity

crPlots(fit)

According to the graphs everything is more or less linear.

Homoscedasticity

Ho: the variance of the residuals is constant

H1: the variance of the residuals is not constant

ncvTest(fit)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 29.29541    Df = 1     p = 6.214279e-08

p-value is higher than 0.05, so our H 0 tend to be rejected, so the variance of the residuals is not constant and there is heteroscedasticity.

spreadLevelPlot(fit)

## 
## Suggested power transformation:  5.012736

This graph provide us with the information that our residuals is not constant and there is heteroscedasticity.

Multicollinearity

vif(fit)

##   eduyrs1     gndr1 hinctnta1 
##  1.131735  1.027436  1.126846

Our VIFs are equal to about 1, so we can assume that we do not have perfect multicollinearity.

Autocorrelation

H0 : there is no autocorrelation H1 : there is an autocorrelation

durbinWatsonTest(fit)

##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02207406      2.043277   0.394
##  Alternative hypothesis: rho != 0

p-value is higher than 0.05, so H0 is accepted. It means that there is no autocorrelation.

Vizualization

1

ggplot(data = ESS1, aes(x = eduyrs1, y = eneffap1)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y~x) +
  ylab("How likely to buy most energy efficient home appliance")+
  xlab("Years of full-time education completed")+
  theme_bw()

As the number of educational years increase, we can observe that the willingness for using energy efficient stuff rises,too.

2

ggplot(data = ESS1, aes(x = hinctnta1, y = eneffap1)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y~x) +
  ylab("How likely to buy most energy efficient home appliance")+
  xlab("Income")+
  theme_bw()

As the income goes up, the more a person is likely to buy energy efficient appliance.

Conclusions:

Women tend to use energy efficient stuff more frequently, especially those, with relatively long period of education.
People with a big number of years, spent on education, who have a relatively high income, would use energy efficient stuff with a higher probability.

So, both of our hypothesis were accepted.

Regression with interaction effect

Research question: In our interaction effect, we decided to check whether the gender and duration of a person’s education affects how often he/she is willing to do things that positively affect nature, in our case how often respondents is willing to do things that reduce energy consumption.

Thus, as predictors, we will use variables “gndr” - gender of the respondent and “eduyrs” - years of full-time education, and our outcome-variable will be “rdcenr” which means “How often do things to reduce energy use” - it is an Ordinal Categorical variable. There are 7 categories: “never”, “hardly ever”, “sometimes”, “often”, “very often”, “always”, “can not reduce energy use”.

Hypotheses:

We put forward the following hypothesis:

Women with a longer education period are more likely to do things that reduce energy consumption.
People with longer period of education, who have a high income, would do things to reduce energy use with a higher probability.

Descriptive statistics

ggplot() +
  geom_histogram(data = ESS1, aes(x = rdcenr1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
  ggtitle("How often do things to reduce energy use") +
  theme_bw()

The distribution seems to be almost normal.

ggplot( data = ESS1, aes(x=rdcenr, y=eduyrs1)) + geom_boxplot() +
  xlab("How often do things to reduce energy use") +
  ylab("Years of full-time education completed") +
  theme_bw()

The barcharts represent the ditribution of frequency of doing things to reduce climate change according to number of years spent on education. It seems that almost all the categories are located near the line of 13 years of education, with outliers above 20 years or under 5 years. There is a category “cannot reduce energy usage”, which is distributed among people, who spent about 8 years on education.

Regression

1

We tried to check whether the number of years, spent on education, is related with frequency of things done to reduce energy use.

m11 <- lm(rdcenr1 ~ eduyrs1, data = ESS1)
summary(m11)

## 
## Call:
## lm(formula = rdcenr1 ~ eduyrs1, data = ESS1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2910 -0.3170 -0.1871  0.7739  1.9168 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.057187   0.076472  53.055   <2e-16 ***
## eduyrs1     0.012992   0.006052   2.147    0.032 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.013 on 1795 degrees of freedom
## Multiple R-squared:  0.002561,   Adjusted R-squared:  0.002005 
## F-statistic: 4.608 on 1 and 1795 DF,  p-value: 0.03195

The intersection is on the point 4.09, coefficient b is equal to 0.01, so we can build a regression equation, which is Y=0.01x+4.09.

The interpretation of this model can be the following: change in number of years spent on education increases the willingness to do things to reduce energy use by 0.01.

As adjusted R-squared is 0.0012, the model can describe about 0,12% of the whole sample.

2

The next step is to check how gender and amount of years spent on education can affect the frequency of things done to reduce energy use.

m22 <- lm(rdcenr1 ~ eduyrs1 * gndr1, data = ESS1)
summary(m22)

## 
## Call:
## lm(formula = rdcenr1 ~ eduyrs1 * gndr1, data = ESS1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3154 -0.3322 -0.1478  0.7203  2.1316 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.27373    0.23980  13.652  < 2e-16 ***
## eduyrs1        0.06065    0.01938   3.129 0.001783 ** 
## gndr1          0.53874    0.15318   3.517 0.000447 ***
## eduyrs1:gndr1 -0.03271    0.01213  -2.696 0.007084 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.009 on 1793 degrees of freedom
## Multiple R-squared:  0.01173,    Adjusted R-squared:  0.01008 
## F-statistic: 7.093 on 3 and 1793 DF,  p-value: 9.752e-05

The intersection is located at the point of 3,35, every additional year spent on education increases the frequency of these actions by 0.06, while referring yourself to a certain gender can increase frequency by 0,5. We can construct an equation:

Y=3.34+0.06(eduyrs1)+0.5(gndr1)-0.3(eduyrs1:gndr1)+error

Adjusted R-squared is 0,008, which means that this model can be applied to 0,8% of the whole sample.

3

m33 <- lm(rdcenr1 ~ eduyrs1 * gndr1 * hinctnta1, data = ESS1)
summary(m33)

## 
## Call:
## lm(formula = rdcenr1 ~ eduyrs1 * gndr1 * hinctnta1, data = ESS1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4002 -0.4354 -0.1330  0.7606  2.1171 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              3.2884796  0.5318687   6.183 7.78e-10 ***
## eduyrs1                  0.0696745  0.0459078   1.518    0.129    
## gndr1                    0.5869146  0.3306182   1.775    0.076 .  
## hinctnta1               -0.0021752  0.0896809  -0.024    0.981    
## eduyrs1:gndr1           -0.0343155  0.0280316  -1.224    0.221    
## eduyrs1:hinctnta1       -0.0011620  0.0071254  -0.163    0.870    
## gndr1:hinctnta1         -0.0181495  0.0585703  -0.310    0.757    
## eduyrs1:gndr1:hinctnta1  0.0007829  0.0045381   0.173    0.863    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 1789 degrees of freedom
## Multiple R-squared:  0.01705,    Adjusted R-squared:  0.0132 
## F-statistic: 4.432 on 7 and 1789 DF,  p-value: 6.768e-05

In this model we tried to check whether the number of years spent on education, gender, willingness to use energy efficient stuff is connected with frequency of things done to reduce energy use. According to the table we can construct an equation:

Y(frequency of things done)=3,34+0,06(eduyrs1)+0,06(gndr)+0,005(hinctnta1)-0,03(eduyrs1:gndr1)-0,001(eduyrs1:hinctnta1)-0,03(gndr1:hinctnta1)+0,001(eduyrs1:gndr1:hinctnta1)+error.

Adjusted R-squares is 0,01, so this model can suit 1% of the sample.

sjt.lm(m33)

	rdcenr1
	B	CI	p
(Intercept)	3.29	2.25 – 4.33	<.001
eduyrs1	0.07	-0.02 – 0.16	.129
gndr1	0.59	-0.06 – 1.24	.076
hinctnta1	-0.00	-0.18 – 0.17	.981
eduyrs1:gndr1	-0.03	-0.09 – 0.02	.221
eduyrs1:hinctnta1	-0.00	-0.02 – 0.01	.870
gndr1:hinctnta1	-0.02	-0.13 – 0.10	.757
eduyrs1:gndr1:hinctnta1	0.00	-0.01 – 0.01	.863
Observations	1797
R² / adj. R²	.017 / .013

This is our summary table. It also shows that the gender plays the most significant role in readiness of respondent to use energy efficient staff in this case.

Anova

anova(m11, m22)

## Analysis of Variance Table
## 
## Model 1: rdcenr1 ~ eduyrs1
## Model 2: rdcenr1 ~ eduyrs1 * gndr1
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1795 1840.6                                  
## 2   1793 1823.7  2    16.919 8.3168 0.0002539 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As a result, we see that m22 is better than first one.

anova(m22, m33)

## Analysis of Variance Table
## 
## Model 1: rdcenr1 ~ eduyrs1 * gndr1
## Model 2: rdcenr1 ~ eduyrs1 * gndr1 * hinctnta1
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   1793 1823.7                              
## 2   1789 1813.9  4    9.8111 2.4191 0.04665 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So, m33 seems to be better.

Assumptions

Outliers

layout(matrix(1:4,2,2));  plot(fit1)

As the graphs confirm, our data is not distributed normally (red lines are not oriented straightly gorizontal). We have the next outliers: ///////////

Normality of the residuals

H0: distribution is normal

H1: distribution is not normal

model=aov(ESS1$eduyrs1 ~ ESS1$rdcenr1) 
res=model$residuals 
shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.99136, p-value = 8.008e-09

model=aov(ESS1$gndr1 ~ ESS1$rdcenr1) 
res=model$residuals 
shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.70239, p-value < 2.2e-16

model=aov(ESS1$hinctnta1 ~ ESS1$rdcenr1) 
res=model$residuals 
shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.96042, p-value < 2.2e-16

According to the Shapiro-Wilk normality test, distributions are not normal, because p-value is low < 0.05.

Homoscedasticity

Ho: the variance of the residuals is constant

H1: the variance of the residuals is not constant

ncvTest(fit1)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.0144896    Df = 1     p = 0.9041879

p-value is higher than 0.05, so our H 0 tend to be rejected, so the variance of the residuals is not constant and there is heteroscedasticity.

spreadLevelPlot(fit1)

## 
## Suggested power transformation:  -5.515853

This graph provide us with the information that our residuals is not constant and there is heteroscedasticity.

Multicollinearity

vif(fit1)

##                 eduyrs1                   gndr1               hinctnta1 
##                58.19193                48.42437               102.69253 
##           eduyrs1:gndr1       eduyrs1:hinctnta1         gndr1:hinctnta1 
##               116.06432               188.28040               146.57584 
## eduyrs1:gndr1:hinctnta1 
##               240.70349

Autocorrelation

H0 : there is no autocorrelation H1 : there is an autocorrelation

durbinWatsonTest(fit1)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.01406075      1.969818   0.548
##  Alternative hypothesis: rho != 0

p-value is higher than 0.05, so H0 is accepted. It means that there is no autocorrelation.

Vizualization

1

ggplot(data = ESS1, aes(x = eduyrs1, y = rdcenr1)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y~x) +
  ylab("How often do things to reduce energy use")+
  xlab("Years of full-time education completed")+
  theme_bw()

We can conclude that number of years spent on education can define the frequency of things done to reduce energy use. There is a positive slope, the more years the respondent learned, the more often he/she do activities to reduce energy use.

2

ggplot(data = ESS1, aes(x = hinctnta1, y = rdcenr1)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = y~x) +
  ylab("How often do things to reduce energy use")+
  xlab("Income")+
  theme_bw()

The graph shows, that the higher income, the less person do to reduce energy use, but the relation is not so strong, as the slope is amost flat.

3

fit2 <- lm(rdcenr1 ~ eduyrs1 * hinctnta1, data=ESS1)
sjPlot::plot_model(fit2, type = "int", show.ci = T, mdrt.values = "all")+theme_bw() + ylab("How often do things to reduce energy use")+ xlab("Years of full-time education completed")+
ggtitle("Predicted values for readiness of the most people to reduce energy use")

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors

## Warning: Removed 27 rows containing missing values (geom_path).

This graph shows that people with higher income and longer period of education more likely tend to do things to reduce energy use.

4

fit3 <- lm(rdcenr1 ~ eduyrs1 * gndr1, data=ESS1)
sjPlot::plot_model(fit3, type = "int", show.ci = T, mdrt.values = "all")+theme_bw() + ylab("How often do things to reduce energy use")+ xlab("Years of full-time education completed")+
ggtitle("Predicted values for readiness of the most people to reduce energy use")

The graph shows that the more women spend time on full-time education, the more frequently she do things to reduce energy use. As regards men, they do these things more frequently than women if they spent less than about 18 years on education. If they study for more than 18 years, they do such things less frequently than women.

Conclusions:

So, our first hypothesis was accepted: women with a longer education period are more likely to do things that reduce energy consumption.
And our second hypothesis was accepted: people with higher income and longer period of education more likely tend to do things to reduce energy use

References:

Aaron M. McCright (2010). The effects of gender on climate change knowledge and concern in the American public
Tamara Shapiro Ledley, Juliette Rooney-Varga, and Frank Niepold Subject: Environmental Issues and Problems, Sustainability and Solutions Online Publication

Climate change in Finland

Lkhasaranova Yu., Mikhailova K., Nakhatova M.

T-test conducting

Chi-square test

ANOVA

Moving to preparation for ANOVA.

Conclusions:

Conclusions:

Conclusions:

Regression

Descriptive statistics

Conclusions:

Regression with interaction effect

Descriptive statistics

Anova

Conclusions:

References: