Research question: Whether the duration of the educational period, gender and income of the respondents affect their readiness to buy and consume energy efficient staff.
Hypotheses:
H1:People with longer educational duration and respectively who have higher income have higher level of willingness to use energy efficient stuff.
H2: women with longer period of education more tend to use energy efficient staff.
In our case, the predictor variables will be the number of years of education, gender and income, and the Outcome will be the degree of readiness to consume energy efficient staff.
Years of education
ggplot() +
geom_histogram(data = ESS1, aes(x = eduyrs1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Years of full-time education completed") +
theme_bw()
summary(ESS1$eduyrs1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 12.00 11.99 15.00 27.00
On this histogram we can see that this distribution is close to normal. The majority of respondents have approximately 9-years or 14-years of education.
summary(ESS1$eduyrs1) Income
ggplot() +
geom_histogram(data = ESS1, aes(x = hinctnta1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Income") +
theme_bw()
summary(ESS1$hinctnta1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 6.000 5.746 8.000 10.000
1 - min, 10 - max income. We can see that the majority of respondents consider their income from 5,5 to 8.
Gender
ggplot() +
geom_histogram(data = ESS1, aes(x = gndr), stat="count", fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Gender distribution") +
theme_bw()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
There are approximately equal numbers of respondents of both gender.
Readiness to buy and consume energy efficient staff
ggplot() +
geom_histogram(data = ESS1, aes(x = eneffap1), binwidth = 1, fill="#008080", col="#483D8B", alpha = 0.5)+
ggtitle("Readiness of the most people to buy energy efficient staff") +
theme_bw()
summary(ESS1$eneffap1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 8.000 9.000 8.888 10.000 11.000
This ghaph shows that there is high level of readiness of the most people to buy energy efficient staff. Meanings: 1 - not likely at all, 11 - extrimely likely.
Years of education and gender
ggplot() +
geom_boxplot(data = ESS1, aes(x = gndr, y = eduyrs1 ), col = "#E52B50", fill = "#F0F8FF") +
ylab("Years of full-time education") +
ggtitle("Years of full-time education completed and Gender") +
theme_bw()
These boxplots shows that median meaning of the educational duration of female is higher: it is located approximately at 13 years, while males’ median is near to 11.
ggplot( data = ESS1, aes(x=hinctnta1, y=eduyrs1)) + geom_jitter() +
xlab("Income") +
ylab("Years of full-time education completed") +
theme_bw()
This graph shows, that there is almost no relation between the number of years, spent on education and income. But it is important to note, that there are outliers where the years of education is more than 20, so income is relatively high in this area. On the contrary, very small income is observed in cases, where there are less than 5 years, spent on education.
ggplot( data = ESS1, aes(x=eneffap1, y=eduyrs1)) + geom_jitter() +
xlab("How likely to buy most energy efficient home appliance") +
ylab("Years of full-time education completed") +
theme_bw()
This scatterplot provides the information about how the number of years of full-time education completed influence on the people readiness to buy most energy efficient home appliance.
In the result we can see that people who get about 15 years of education (completed bachelor or masters degree) tend to be more ready to buy energy efficient staff.
We tried to check whether the number of years, spent on education, is related with willingness to use energy efficient stuff.
m1 <- lm(eneffap1 ~ eduyrs1, data = ESS1)
summary(m1)
##
## Call:
## lm(formula = eneffap1 ~ eduyrs1, data = ESS1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1197 -0.7737 0.3803 1.2839 2.6875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.19718 0.15584 52.60 < 2e-16 ***
## eduyrs1 0.05765 0.01235 4.67 3.23e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.076 on 1808 degrees of freedom
## Multiple R-squared: 0.01192, Adjusted R-squared: 0.01137
## F-statistic: 21.81 on 1 and 1808 DF, p-value: 3.233e-06
We get an intersection on the rate 8,19, coefficient b is equal 0,06. We can compose the regression equation: Y=0,06x+8,19.
This means, that change on one year of the number of years spent on education leads to increasing the willingness to use energy efficient stuff by 0,06 (from 8,19).
R squared helps to evaluate the quality of the regression model, we can see, that about 1 percent of the whole sample can be predicted by the model.
The next step is to check how gender can affect the willingness of individives to use energy efficient stuff.
m2 <- lm(eneffap1 ~ eduyrs1 + gndr1, data = ESS1)
summary(m2)
##
## Call:
## lm(formula = eneffap1 ~ eduyrs1 + gndr1, data = ESS1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1671 -0.7117 0.3856 1.3311 2.7672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.89823 0.20281 38.945 < 2e-16 ***
## eduyrs1 0.05452 0.01241 4.395 1.17e-05 ***
## gndr1 0.22551 0.09807 2.300 0.0216 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.073 on 1807 degrees of freedom
## Multiple R-squared: 0.0148, Adjusted R-squared: 0.01371
## F-statistic: 13.57 on 2 and 1807 DF, p-value: 1.407e-06
We have an intersection about 7.9. Also we can see what the coefficient b is equal to 0,23 for gndr1, and 0.05 for eduyrs1.
Thus we can compose the regression equation:
Y = 0,23(gndr1) + 0.05(eduyrs1)7.9
By interpreting this equation, we can say that the willingness to consume the energy efficient stuff will be increased by 0,23 depending on gender and by 0.05 depending on eduyrs1 starting from the interception point 7.9.
It is also necessary to pay attention on the meaning of R squred in order to evaluate the quality of the regression model.
We see that R squred is approximately 0.014, which tells us that about 1 percent of the observations can be predicted by this model.
m3 <- lm(eneffap1 ~ eduyrs1 + gndr1 + hinctnta1, data = ESS1)
summary(m3)
##
## Call:
## lm(formula = eneffap1 ~ eduyrs1 + gndr1 + hinctnta1, data = ESS1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4070 -0.7273 0.4832 1.3805 2.8476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.57569 0.21616 35.047 < 2e-16 ***
## eduyrs1 0.03663 0.01307 2.802 0.00513 **
## gndr1 0.27676 0.09839 2.813 0.00496 **
## hinctnta1 0.08016 0.01919 4.176 3.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.064 on 1806 degrees of freedom
## Multiple R-squared: 0.02423, Adjusted R-squared: 0.0226
## F-statistic: 14.95 on 3 and 1806 DF, p-value: 1.302e-09
In our final model we decided to check how all predictor variables influence on respondents’ willingness to use energy efficient products.
Thus, we get the final regression equation
Y = 7.6 + 0.28(gndr) + 0.04(eduyrs) + 0.08(hinctnta)
F-ratio is equal to 14.95 and we consider it significant, p-value is less than 0.05, so we can assume that our linear model is significant and effective.
Adjusted R squared is equal to 0.02. This fact tells us that we can predict about 2% of observations using this model
sjt.lm(m3)
| Â | Â | eneffap1 | ||
| Â | Â | B | CI | p |
| (Intercept) |  | 7.58 | 7.15 – 8.00 | <.001 |
| eduyrs1 |  | 0.04 | 0.01 – 0.06 | .005 |
| gndr1 |  | 0.28 | 0.08 – 0.47 | .005 |
| hinctnta1 |  | 0.08 | 0.04 – 0.12 | <.001 |
| Observations | Â | 1810 | ||
| R2 / adj. R2 | Â | .024 / .023 | ||
This is our summary table. It also shows that the gender plays the most significant role in readiness of respondent to use energy efficient staff in our case.
anova(m1, m2)
## Analysis of Variance Table
##
## Model 1: eneffap1 ~ eduyrs1
## Model 2: eneffap1 ~ eduyrs1 + gndr1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1808 7789.5
## 2 1807 7766.8 1 22.729 5.2881 0.02159 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As a result, we see that m2 is better than first one.
anova(m2, m3)
## Analysis of Variance Table
##
## Model 1: eneffap1 ~ eduyrs1 + gndr1
## Model 2: eneffap1 ~ eduyrs1 + gndr1 + hinctnta1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1807 7766.8
## 2 1806 7692.5 1 74.29 17.442 3.104e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that m3 seems to be better.
layout(matrix(1:4,2,2)); plot(fit)
As the graphs confirm, our data is not distributed normally (red lines are not oriented straightly gorizontal). We have the next outliers: 1407, 1134, 417
H0: distribution is normal
H1: distribution is not normal
model=aov(ESS1$eduyrs1 ~ ESS1$eneffap1)
res=model$residuals
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.9921, p-value = 2.648e-08
model=aov(ESS1$gndr1 ~ ESS1$eneffap1)
res=model$residuals
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.68947, p-value < 2.2e-16
model=aov(ESS1$hinctnta1 ~ ESS1$eneffap1)
res=model$residuals
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.96707, p-value < 2.2e-16
According to the Shapiro-Wilk normality test, distributions are not normal, because p-value is low < 0.05.
crPlots(fit)
According to the graphs everything is more or less linear.
Ho: the variance of the residuals is constant
H1: the variance of the residuals is not constant
ncvTest(fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 31.92184 Df = 1 p = 1.605026e-08
p-value is higher than 0.05, so our H 0 tend to be rejected, so the variance of the residuals is not constant and there is heteroscedasticity.
spreadLevelPlot(fit)
##
## Suggested power transformation: 4.999119
This graph provide us with the information that our residuals is not constant and there is heteroscedasticity.
vif(fit)
## eduyrs1 gndr1 hinctnta1
## 1.134046 1.028212 1.128128
Our VIFs are equal to about 1, so we can assume that we do not have perfect multicollinearity.
H0 : there is no autocorrelation H1 : there is an autocorrelation
durbinWatsonTest(fit)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.01746633 2.034066 0.49
## Alternative hypothesis: rho != 0
p-value is higher than 0.05, so H0 is accepted. It means that there is no autocorrelation.
ggplot(data = ESS1, aes(x = eduyrs1, y = eneffap1)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
ylab("How likely to buy most energy efficient home appliance")+
xlab("Years of full-time education completed")+
theme_bw()
As the number of educational years increase, we can observe that the willingness for using energy efficient stuff rises,too.
ggplot(data = ESS1, aes(x = hinctnta1, y = eneffap1)) +
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
ylab("How likely to buy most energy efficient home appliance")+
xlab("Income")+
theme_bw()
As the income goes up, the more a person is likely to buy energy efficient appliance.
fit1 <- lm(eneffap1 ~ eduyrs1 * gndr, data=ESS1)
sjPlot::plot_model(fit1, type = "int", show.ci = T, mdrt.values = "all")+theme_bw() + ylab("How likely to buy most energy efficient home appliance")+ xlab("Years of full-time education completed")+
ggtitle("Predicted values for readiness of the most people to buy energy efficient staff")
This graph shows that the line, which represents female, has more steep tilt angle and lower interception. In whole, we can assume that women, who have longer educational period, more tend to use energy efficient staff, than men.
fit2 <- lm(eneffap1 ~ eduyrs1 * hinctnta1, data=ESS1)
sjPlot::plot_model(fit2, type = "int", show.ci = T, mdrt.values = "all")+theme_bw() + ylab("How likely to buy most energy efficient home appliance")+ xlab("Years of full-time education completed")+
ggtitle("Predicted values for readiness of the most people to buy energy efficient staff")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
This graph proves that people with higher income more likely tend to consume energy efficient staff, despite the fact that their income is not strongly influenced by the duration of their education period.
Women tend to use energy efficient stuff more frequently, especially those, with relatively long period of education.
People with a big number of years, spent on education, who have a relatively high income, would use energy efficient stuff with a higher probability.
So, both of our hypothesis were accepted.