Elizaveta Dyachenko, Nadezhda Bykova
18 April 2019
As usually, the first step in every analysis is to download needed packages and data.
library(foreign)
library(ggplot2)
library(sjPlot)
library(dplyr)
data <- read.spss("ESS5e03.sav", to.data.frame = TRUE)Then we make a separate database in order to inspect only the variables we are interested in.
savevars <- c("eduyrs", "agea", "agertr", "hinctnta", "gndr", "cntry", "icmnact")
ESS1 <- data[savevars] So, what variables did we choose?
eduyrs - Years of full-time education completed
agea - Age
gndr - Gender
agertr - What age you would like to/would have liked to retire
hinctnta - Household’s total net income in deciles
cntry - Country
icmnact - Main activity
Now let’s take a closer look at our variables.
str(ESS1) ## 'data.frame': 52458 obs. of 7 variables:
## $ eduyrs : Factor w/ 45 levels "0","1","2","3",..: 16 16 14 16 16 14 18 14 13 13 ...
## $ agea : Factor w/ 88 levels "14","15","16",..: 9 30 6 10 45 49 13 27 35 55 ...
## $ agertr : Factor w/ 77 levels "0","1","2","5",..: NA NA NA NA 45 45 NA NA 43 46 ...
## $ hinctnta: Factor w/ 10 levels "J","R","C","M",..: NA 5 1 10 6 5 6 2 6 8 ...
## $ gndr : Factor w/ 2 levels "Male","Female": 1 1 2 2 2 2 1 1 1 1 ...
## $ cntry : Factor w/ 27 levels "Belgium","Bulgaria",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ icmnact : Factor w/ 3 levels "In paid work",..: 3 1 3 1 3 2 1 1 1 2 ...
It seems like some of the variables shouldn’t be factors. Let’s transfrom education, age, desired age of retirement and household’s income into numeric variables and look at their distributions.
for(i in 1:3) {
ESS1[,i] <- as.numeric(as.character(ESS1[,i]))
}
ESS1$hinctnta1<-as.numeric(ESS1$hinctnta)
summary(ESS1) ## eduyrs agea agertr hinctnta
## Min. : 0.00 Min. : 14.00 Min. : 0.00 R : 4801
## 1st Qu.:10.00 1st Qu.: 33.00 1st Qu.: 56.00 J : 4647
## Median :12.00 Median : 48.00 Median : 60.00 C : 4488
## Mean :12.29 Mean : 48.51 Mean : 60.04 F : 4369
## 3rd Qu.:15.00 3rd Qu.: 63.00 3rd Qu.: 64.00 M : 4365
## Max. :55.00 Max. :102.00 Max. :120.00 (Other):17168
## NA's :629 NA's :153 NA's :27902 NA's :12620
## gndr cntry icmnact
## Male :23782 Germany : 3031 In paid work:24049
## Female:28655 Greece : 2715 Retired :14055
## NA's : 21 Russian Federation: 2595 All others :14220
## Ireland : 2576 NA's : 134
## Bulgaria : 2434
## United Kingdom : 2422
## (Other) :36685
## hinctnta1
## Min. : 1.000
## 1st Qu.: 3.000
## Median : 5.000
## Mean : 5.049
## 3rd Qu.: 7.000
## Max. :10.000
## NA's :12620
As we can see, mean number of years spend of education (eduyrs) among respondents is 12, as well as median. Someone was able to study for 55 years… What would you want to study for so long?
As for the age of respondents (agea), it ranges from 14 to 102 years old with mean and median equal to 48.
Both mean and median desired age of retirement (agertr) for infromants is 60. However, there was at least one human who expressed the desire to work till he/she is 120 years old. Such a hardworking one!
Decile income (hinctntal1) ranges from 1 (minimum) to 10 (maximum) with mean and median equal to 5.
Now let’s inspect the distributions of our variables visually. As our outcome variable is age of retirement, we will start with it.
ggplot(ESS1, aes(x = agertr)) +
geom_histogram(fill="blue") +
labs(x = "age of retirement", y = "number of records", title = "Distribution of Desired Age of Retirement") +
theme(plot.title = element_text(face="bold", hjust=0.5)) What do you think about this distribution? Looks like a bell shape, so I would say that age of retirement is normally distributed.
Our first predictor is age, so now let’s move to it.
ggplot(ESS1, aes(x = agea)) +
geom_histogram(fill="blue") +
labs(x = "age", y = "number of records", title = "Distribution of Age") +
theme(plot.title = element_text(face="bold", hjust=0.5)) Well, this one is also similar to bell shape but I would say it is right skewed.
Second predictor is income, let’s procede to it.
ggplot(ESS1, aes(x = hinctnta1)) +
geom_histogram(fill="blue") +
labs(x = "household income", y = "number of records", title = "Distribution of Household Income") +
theme(plot.title = element_text(face="bold", hjust=0.5)) This graph doesn’t look normal but still we can make some conclusions. Most of the respondents’ income is very low, between 1 and 2 deciles, while the number of people with incom close to 10 is much less.
The last continuous predictor is education years.
ggplot(ESS1, aes(x = eduyrs)) +
geom_histogram(fill="blue") +
labs(x = "years of education", y = "number of records", title = "Distribution of Years of Completed Education") +
theme(plot.title = element_text(face="bold", hjust=0.5)) The distribution of years spend on education would be considered to be normal if it didn’t have this tail on the right, so it is right skewed.
As it is not very logical to ask people who don’t work yet or people who are already retired about their desired age of retirement, for our analysis we only need people who are now working on a paid position.
ESS2 <- na.omit(subset(ESS1, icmnact=="In paid work")) Now, we need to understand if our variables are correlated at all. Let’s build few scatterplots and try to guess the correlation coefficients.
ggplot(ESS2, aes(x = agertr, y = agea, color = gndr)) +
geom_point(aes()) +
labs(x = "age of retirement", y = "age", title = "Desired Age of Retirement VS Age", color = "gender") +
theme(plot.title = element_text(face="bold", hjust=0.5)) What can you say about correlation of these variables? Obviously, there is a positive correlation but quite weak.
Now we move on to another pair of variables.
ggplot(ESS2, aes(x = agertr, y = hinctnta1, color = gndr)) +
geom_point() +
labs(x = "age od retirement", y = "household income", title = "Desired Age of Retirement VS Household Income", color = "gender") +
theme(plot.title = element_text(face="bold", hjust=0.5))Looks like no correlation at all…
Let’s procede to the last pair of variables.
ggplot(ESS2, aes(x = agertr, y = eduyrs, color = gndr)) +
geom_point() +
labs(x = "age od retirement", y = "years of education", title = "Desired Age of Retirement VS Years of Education Completed", color = "gender") +
theme(plot.title = element_text(face="bold", hjust=0.5))This one looks like a cloud actually, not like a graph of correlated variables.
Now let’s look at the numbers! We only need the row with agertr (first one).
ESS3 <- select(ESS2, agertr, agea, hinctnta1, eduyrs)
sjp.corr(ESS3, title = "Pearson's Correlation Coefficients") If the correlation coefficient is less than 0.1, the correlation is considered to be almost absent. That is why we were right and there is no correlation between age of retirement and household income (correlation coefficient is equal to 0.06). 0.079 also means almost no correlation between the age of retirement and number of education years. Finally, as we predicted, age of retirement and age have weak positive correlation which is equal to 0.264.
Finally, now we need to construct models! As we have already mentioned, the desired age of retirement is the predicted variable, the one which our models should predict with help of independent variables: age, gender, income, and education.
Let’s start with age and look at the details of our model. Oh, wait! Don’t we need to do something else first? As we remember, minimal age among the respondents is 14, that’s why we need to centralize it.
ESS2$agea_cent <- ESS2$agea - mean(ESS2$agea) Now we can inspect our model with the help of fuction summary().
m1 <- lm(agertr ~ agea_cent, data = ESS2)
summary(m1) ##
## Call:
## lm(formula = agertr ~ agea_cent, data = ESS2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.250 -2.506 0.195 2.505 58.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.55648 0.05998 1009.56 <2e-16 ***
## agea_cent 0.24455 0.01035 23.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.169 on 7425 degrees of freedom
## Multiple R-squared: 0.06996, Adjusted R-squared: 0.06983
## F-statistic: 558.5 on 1 and 7425 DF, p-value: < 2.2e-16
Well, what do all these numbers mean? Let me remind you and we will start from the bottom. F-statistic tells us how good the model is, in other words it helps to get if our model is really better than the one that predicts desired age of retirement with mean value. In our case the F-statistic is statistically significant as the p-value < 2.2e-16 and to be significant it needs to be less than 0.05. Thus, we can conclude that our model is better than no model at all. Moving on to adjusted R-squared, it should be mentioned what it is. R-squared is also called the coefficient of determination and it indicates the proportion of variance we can explain with our model. So, our first model explains only 7% of variation in age which people want to be retired at. In coefficients we need only the first column. The intercept is the predicted value of age of retirement when age is equal to the minimum. So, if a respondent is 14 years old he/she is very likely to say that his/her desired age of retirement is equal to 60. The lower number shows how the desired age of retirement changes with the change in age of a person. So, our regression equation would look like this:
\[Predicted Age of Retirement = 60,5 + 0,24 * Age\]
The interpretation sounds like this:
With each increase in the age of a person by one year the age he/she would want to be retired at increases by 0.24.
Let’s move to another model which will predict the desired age of retirement by both age and gender.
m2 <- lm(agertr ~ agea_cent + gndr, data = ESS2)
summary(m2) ##
## Call:
## lm(formula = agertr ~ agea_cent + gndr, data = ESS2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.962 -2.723 -0.002 2.384 57.993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.30832 0.08332 735.86 <2e-16 ***
## agea_cent 0.23869 0.01025 23.30 <2e-16 ***
## gndrFemale -1.52691 0.11879 -12.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.113 on 7424 degrees of freedom
## Multiple R-squared: 0.09021, Adjusted R-squared: 0.08996
## F-statistic: 368 on 2 and 7424 DF, p-value: < 2.2e-16
According to the significant F-statistic (p-value < 2.2e-16), the model is still good and, according to the R-squared, it now explains 9% of the variation in age of retirement. The desired retirement age for a person of minimum age is now equal to 61 years. The regression equation now looks this way:
\[Predicted Age of Retirement = 61,3 + 0,24 * Age - 1,53 * Gender\]
The interpretation:
With each increase in the age of a person by one year the age he/she would want to be retired at increases by 0.24 and if a respondent is a female, her retired age also decreases by 1.53.
To the next model! Now we explain the age a person would like to be retired at by age, gender, and household income.
m3 <- lm(agertr ~ agea_cent + gndr + hinctnta1, data = ESS2)
summary(m3)##
## Call:
## lm(formula = agertr ~ agea_cent + gndr + hinctnta1, data = ESS2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.321 -2.667 0.033 2.424 57.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.44109 0.17162 352.182 < 2e-16 ***
## agea_cent 0.24357 0.01026 23.745 < 2e-16 ***
## gndrFemale -1.44434 0.11939 -12.097 < 2e-16 ***
## hinctnta1 0.13259 0.02295 5.776 7.95e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.102 on 7423 degrees of freedom
## Multiple R-squared: 0.09428, Adjusted R-squared: 0.09391
## F-statistic: 257.6 on 3 and 7423 DF, p-value: < 2.2e-16
According to the significant F-statistic (p-value < 2.2e-16), the model is again good and, according to the R-squared, it still explains 9% of the variation in age of retirement. The desired retirement age for a person of minimum age is 60 years. The regression equation is here:
\[Predicted Age of Retirement = 60,4 + 0,24 * Age - 1,44 * Gender + 0,13 * Income\]
The interpretation:
a) With each increase in the age of a person by one year the age he/she would want to be retired at increases by 0.24. b) If a respondent is a female, her retired age also decreases by 1.44. c) With each increase in hoyusehold income by 1 decile the retired age increases by 0.13.
This time we also add education years to the model.
m4 <- lm(agertr ~ agea_cent + gndr + hinctnta1 + eduyrs, data = ESS2)
summary(m4) ##
## Call:
## lm(formula = agertr ~ agea_cent + gndr + hinctnta1 + eduyrs,
## data = ESS2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.964 -2.634 0.073 2.459 57.756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.09525 0.24139 244.817 < 2e-16 ***
## agea_cent 0.24686 0.01022 24.145 < 2e-16 ***
## gndrFemale -1.52126 0.11930 -12.751 < 2e-16 ***
## hinctnta1 0.07234 0.02410 3.002 0.0027 **
## eduyrs 0.13201 0.01672 7.896 3.31e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.081 on 7422 degrees of freedom
## Multiple R-squared: 0.1018, Adjusted R-squared: 0.1013
## F-statistic: 210.3 on 4 and 7422 DF, p-value: < 2.2e-16
According to the significant F-statistic (p-value < 2.2e-16), the model is good and, according to the R-squared, it already explains 10% of the variation in age of retirement. Well, this model looks like the best one! The desired retirement age for a person of minimum age is 59 years. The regression equation:
\[Predicted Age of Retirement = 59,1 + 0,25 * Age - 1,52 * Gender + 0,07 * Income + 0,13 * Education Years\]
The interpretation:
a) With each increase in the age of a person by one year the age he/she would want to be retired at increases by 0.24. b) If a respondent is a female, her retired age also decreases by 1.52. c) With each increase in household income by 1 decile the retired age increases by 0.07. d) With each increase in years of completed education by one the desired age of retirement increases by 0.13.
Let’s also add some visualizations into our ananlysis.
plot(m4) The first graph named ‘Residuals vs Fitted’ shows that residuals are distributed linearly because the points are equally spread around a line. The ‘Normal Q-Q’ plot illustrates that residuals are distributed normally as the line is almost straight even though there are some outliers. From the third graph called ‘Scale-Location’, we can conclude that the residuals are spead equally along the ranges of predictors, again because of the almost straight line. Finally, the last plot ‘Residuals vs Leverage’ means that we don’t have highly influential observations that don’t get along with the majority of other cases because we don’t have the dots outside of the line of ‘Cook’s distance’.
Now let’s figure out which model among those we made is the best one! For this purpose we will need ANOVA test.
anova(m1, m2) ## Analysis of Variance Table
##
## Model 1: agertr ~ agea_cent
## Model 2: agertr ~ agea_cent + gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 7425 198412
## 2 7424 194093 1 4319.4 165.21 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As p-value < 2.2e-16, we can conclude that F-statistic is statistically significant. Thus, model 2 is better than model 1. It means that age is not enough to predict the desired retirement age, age and gender together predict it better.
anova(m2, m3) ## Analysis of Variance Table
##
## Model 1: agertr ~ agea_cent + gndr
## Model 2: agertr ~ agea_cent + gndr + hinctnta1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 7424 194093
## 2 7423 193224 1 868.5 33.365 7.949e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As p-value < 7.949e-09, we can conclude that F-statistic is statistically significant. Thus, model 3 is better than model 2. In other words, to predict the desired age of retire besides age and gender you also need to know the household income of a person.
anova(m3, m4) ## Analysis of Variance Table
##
## Model 1: agertr ~ agea_cent + gndr + hinctnta1
## Model 2: agertr ~ agea_cent + gndr + hinctnta1 + eduyrs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 7423 193224
## 2 7422 191615 1 1609.4 62.339 3.307e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As p-value < 3.307e-15, we can conclude that F-statistic is statistically significant. Thus, model 4 is better than model 3. So, if you really want to predict the desired retirement age of a person you should know his/her age, gender, income, and number of years he/she spent on education.
Congratulations! The model 4 is the winner!