Firstly, a working directory must be set and relevant packages imported. I am also clearing my environment from any variables which may be left over from prior analyses. Finally, I had to define myself a couple of functions to free myself from endless copy-pasting.
setwd("C:/Users/matt/MEGA/School/ECN3000/ECN3015/CW01")
require(psych)
require(foreign)
require(colorRamps)
require(tidyverse)
require(gplots)
require(knitr)
require(MASS)
require(nnet)
require(lmtest)
require(stargazer)
rm(list=ls())
#outlogit function used for presenting a multinomial logit model in an appropriate way
outlogit<- function(x){
z <- cbind(round(exp(summary(x)$coefficients), 3), round(summary(x)$coefficients, 3),
round((1 - pnorm(abs(summary(x)$coefficients/summary(x)$standard.errors), 0, 1)) * 2, 3))
z <- cbind(z, ifelse(z[,3]<0.01, 99, ifelse(z[,3]<0.05, 95, ifelse(z[,3]<0.1, 90, 0) ) ) )
colnames(z)<- c("OR","coefficient", "p-value", "sig")
return(z)
}
MattHist <- function(v, color="deepskyblue", main=NULL, breaks="Sturges"){
g<-hist(v, breaks=breaks, plot=FALSE)
hist(v,
freq = FALSE,
breaks= breaks,
col=paste0(color, "1"),
lty=1,
main = main,
axes=TRUE,
ylim = c(-max(g$density)/20,max(g$density)))
lines(density(v, adjust=5, na.rm=TRUE),
col=paste0(color,"4"),
lwd=2)
curve(dnorm(x , mean=mean(v, na.rm = TRUE), sd=sd(v, na.rm=TRUE)),
col="Black",
lwd=1,
add=TRUE)
boxplot(v,
horizontal=TRUE,
outline=TRUE,
axes=FALSE,
col = paste0(color,"2"),
whisklty=1,
staplelty=0,
boxwex=max(g$density)/10,
add=TRUE,
at=-max(g$density)/20,
pch=16,
outcol=paste0(color,"2"))
rm(list = c("g"))
}
sharebox<-function(p, v, cts=NULL, main="Breakdown", color="deepskyblue"){
par(oma=c(0,5,0,0.5), mar=c(4,0,1,0.5), cex.axis=0.7, cex.lab=0.7, cex.main=0.7, las=1)
nf<-layout(matrix(c(1,2), 1, 2, byrow = TRUE), widths=c(3,1))
boxplot(p~v,
horizontal=TRUE, outline=TRUE, axes=TRUE,
names=cts,
col = paste0(color,"2"),
main=main,
whisklty=1,
staplelty=0,
notch=FALSE,
boxwex=0.5,
pch=16,
outcol=paste0(color,"2"),
xlab="Subjective Wellbeing")
barplot(prop.table(table(v)),
horiz=TRUE,
las=1,
axes=TRUE,
names.arg = "",
col=paste0(color,"3"),
xlab = "%",
main="Share")
}
misobs<-function(v){
cat(paste0(round(100*length(which(is.na(v)))/length(DG_AGE)), "% of observations missing"))
}Then, the data file needs to be imported and cleared of observations which cannot be used due to null values of predicted obsetrvations. Then, the dataset is attached for simplification of code in further steps
lmsurvey <- read.spss("./DAT/SN7924.sav", use.value.labels=FALSE, to.data.frame=TRUE, reencode = FALSE)
fullCases <- lmsurvey[!(is.na(lmsurvey$WB_SATIS))&!(is.na(lmsurvey$WB_WORTH)),]
attach(fullCases)
cat(c("Original observations:", length(lmsurvey$DG_AGE), "\nComplete observations:", length(fullCases$DG_AGE)))## Original observations: 567481
## Complete observations: 303492
As the histogram of subjective life satisfaction shows, the distribution of responses is negatively skewed, with mode and median both at value of 8 and mean at 7.54. Values below 5 are outliers, likely to be caused be related to factors such as bad health or life in poverty, hence the single bin. 5 and 6 observe almost identical density, which is also the case for 9 and 10. 7 and 8 then differ somewhat, with 8 being mode, possibly due to general preference in even numbers, when it comes to self-assessed levels.
For individual feeling of an extent to which life is worthwhile, similar characteristics apply, with just slightly higher mean at 7.79, fewer outliers and consequential higher negative skew.
This not only shows that individuals in the sample tend not to see things in a grave manner, but also that 75% don’t feel either completely satisfied, nor perceive their activities fully worthwhile.
Because the resulting LOGIT analysis will focus on subjective life satisfaction, the variable shall be recorded into the appropriate bins identified above. Same, however shall be done for feeling worthwhile, for capturing difference in relative meaning of the two variables. It is for example expected that employment may have positive effect on worthwhile, while not so much on life satisfaction, as working is pain. Here, I needed to combine 7 and 8 into a single level, as the model couldn’t distinguish between them
recode <- c(0,0,0,0,0,1,1,2,2,3,3)
fullCases$WB_SATIS5 <- recode[WB_SATIS+1]
fullCases$WB_WORTH5 <- recode[WB_WORTH+1]
detach(fullCases); attach(fullCases)
rm(ls="recode")
par(mar=c(2,2,0.5,0), #margins: L&B 2 for axes, T 0.5 for title
cex.axis=0.7, #reduce size of axis numbers
cex.main=0.8, #reduce title size
mfrow=c(2,2))
MattHist(WB_SATIS, "deepskyblue", "Life Satisfaction: Collected", c(0,4.5,6.5,8.5,10.5))
MattHist(WB_SATIS5, "tomato", "Life Satisfaction: Recoded", seq(-0.5, 3.5, 1))
MattHist(WB_WORTH, "turquoise", "Life Worthwhile: Collected", c(0,4.5,6.5,8.5,10.5))
MattHist(WB_WORTH5, "orange", "Life Worthwhile: Recoded", seq(-0.5, 3.5, 1))form<-(c("WB_SATIS5", "PRED"))Due to similar distribution of these variables, an inspection using heatmap may come handy for a first-look inspection of any potential relationship between these two.
Below, a frequency matrix is graphed, expressing number of respondents in thousands. There indeed seems to be some linear relationship between those two variables.
Looking further, their correlation coefficient estimated ot be in neighborhood of 0.61 is significant and some causation is confirmed by simple one-variable OLS estimates with p-value of zero.
heatmap.2(xtabs(~WB_SATIS5 + WB_WORTH5), #first is Y, second is X (rows, columns)
cellnote = round(xtabs(~WB_SATIS5 + WB_WORTH5)/1000, 0),
xlab="Worthwhile level", #FIRST
ylab="Satisfaction level", #SECOND
notecol="black",
density.info="none",
trace="none",
col = rev(heat.colors(20)),
dendrogram="none",
Colv="NA",
Rowv="NA",
lwid=c(0.1,4.5),
lhei=c(0.1,4.5),
key=FALSE,
symm=TRUE)cat("Correlation coefficient 95% confidence interval: ",
cor.test(WB_SATIS5, WB_WORTH5)$conf.int,
"\n")## Correlation coefficient 95% confidence interval: 0.6008662 0.6053933
summary(lm(WB_SATIS5~WB_WORTH5))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6925858 0.003378361 205.0065 0
## WB_WORTH5 0.6204777 0.001489520 416.5622 0
In comparison, the self-assessed variables for immediate mental state on the day prior to responding to the survey exhibit similar characteristics with negative relationship yet yield more distinct distribution. Because these two variables are related to the specific point in time, very short run, they do not seem ideal for being estimated based on variables which vary at far lower frequency.
Due to this, only a brief graphical description follows for basic insight.
par(mar=c(2,2,0.5,0), #margins: L&B 2 for axes, T 0.5 for title
cex.axis=0.7, #reduce size of axis numbers
cex.main=0.8, #reduce title size
mfrow=c(1,2) #1row 2cols plot matrix
)
MattHist(WB_HAPPY, "darkgoldenrod", "Feeling happy", 10)
MattHist(WB_ANXIOUS, "sienna", "Feeling anxious", 10)heatmap.2(xtabs(~WB_HAPPY + WB_ANXIOUS),
cellnote = round(xtabs(~WB_HAPPY + WB_ANXIOUS)/1000, 0),
ylab="Happiness level",
xlab="Anxiety level",
notecol="black",
density.info="none",
trace="none",
col = rev(heat.colors(20)),
dendrogram="none",
Colv="NA",
Rowv="NA",
lwid=c(0.1,4.5),
lhei=c(0.1,4.5),
key=FALSE,
symm=TRUE)cat("Correlation coefficient 95% confidence interval:\n", cor.test(WB_HAPPY, WB_ANXIOUS)$conf.int)## Correlation coefficient 95% confidence interval:
## -0.4697965 -0.4642267
cat("\nLinear model estimate:\n")##
## Linear model estimate:
summary(lm(WB_HAPPY~WB_ANXIOUS))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.4515512 0.005013846 1685.6422 0
## WB_ANXIOUS -0.3547329 0.001220563 -290.6305 0
Overall, this process has shown that the sample of British population generally experiences great level of happiness and feeling worthwhile. Consequently, there must be a handful of factors causing positive and negative disturbances.
Logit model is the most feasible technique for inspecting said factors and will be utilized in the final section.
The dataset“Annual Population Survey: Personal Well-Being, April 2012 - March 2015” was obtained from UK Data Service. After removal of variables which were too detailed to be useful, the following remained.
## [1] "DG_AGE" "DG_GENDER" "DG_MARSTA" "DG_REGION" "DG_EDU"
## [6] "DG_ORIGIN" "DG_UKSINCE" "DG_ETH" "DG_RELIG" "EC_ECAC"
## [11] "EC_UNDUR" "EC_FTPTW" "EC_PERMJOB" "EC_HOMEJOB" "EC_JOBTYPE"
## [16] "EC_JOBSCTR" "EC_PPWK" "EC_WORKHR" "EC_BENFTS" "AC_TYPE"
## [21] "AC_TENURE" "AC_LSTMOV" "HE_QHEALTH" "HE_SMOK" "HE_SMOKpd"
## [26] "HE_SMOKever" "WB_SATIS" "WB_WORTH" "WB_ANXIOUS" "WB_HAPPY"
## [31] "BS_PHONE" "BS_PROXY" "WB_SATIS5" "WB_WORTH5"
Variable classifiable as scales, which shall be used in this piece are following.
Age variable is pretty simple and expected to have an effect. No observations are missing, the variable is normally distributed, with fairly lower degree of kurtosis. Minimum age is 16, due to the removal of variables which did not include self-assessment values, for the subjects were not trusted to do so honestly.
This variable can be treated as is in the next steps.
misobs(DG_AGE)## 0% of observations missing
par(oma=c(2,2,0,0), mar=c(0,0,0.6,0), cex.main=0.8, cex.axis=0.8)
MattHist(DG_AGE, "darkslategray", "Age") Surprising element is the following quadratic element the relationship between age and subjective level of satisfaction exhibits. Apparently, 45 year olds are in the mid-life crisis!
lm_age <- lm(WB_SATIS5 ~ I(DG_AGE+16) + I((DG_AGE+16)^2), na.action = na.omit)
summary(lm_age)##
## Call:
## lm(formula = WB_SATIS5 ~ I(DG_AGE + 16) + I((DG_AGE + 16)^2),
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.40986 -0.11104 0.02414 0.89885 1.04616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.565e+00 2.039e-02 125.82 <2e-16 ***
## I(DG_AGE + 16) -1.982e-02 6.169e-04 -32.13 <2e-16 ***
## I((DG_AGE + 16)^2) 1.606e-04 4.470e-06 35.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8164 on 303489 degrees of freedom
## Multiple R-squared: 0.006323, Adjusted R-squared: 0.006317
## F-statistic: 965.6 on 2 and 303489 DF, p-value: < 2.2e-16
par(oma=c(0,0,0,0), mar=c(4,4,2,1), cex.main=0.8, cex.axis=0.8)
plot(DG_AGE,WB_SATIS5,
pch=16,
xlab="Age",
ylab="Life Satisfaction",
main="Age and life satisfaction",
col="grey80")
grid(nx = NULL, col = "lightgrey", lty = "dotted", equilogs = TRUE)
agelmx<-seq(0, 100, 1)
agelmy<-predict(lm_age, list(DG_AGE=agelmx), type="response") #?predict
lines(agelmx, agelmy, lwd=2, col="blue")which.is.max(-agelmy)## [1] 47
form<-rbind(form, c("DG_AGE", "-16 & quad"))
rm(list=c("lm_age", "agelmy", "agelmx"))Firstly, it is important to note that the income variable is only available in cases where EC_ECAC=1, that is where the individual is in employment. Furthermore, 14% of total remaining observations is missing due to rejection to answer.
Therefore, the variable will act as a slope dummy, allowing to fill the observations out-of-employment with value of 1, so that after logarithmic transformation, it is set to 0. It is important that it is correctly interpreted from the output!
kable(rbind(table(ifelse(is.na(EC_PPWK)|EC_PPWK == 0, "Missing", "Available"), EC_ECAC),
100*round(prop.table(table(ifelse(is.na(EC_PPWK), "Missing [%]", "Available [%]"), EC_ECAC)), 2)))| 1 | 2 | 3 | |
|---|---|---|---|
| Available | 120079 | 0 | 0 |
| Missing | 42395 | 11761 | 129257 |
| Available [%] | 40 | 0 | 0 |
| Missing [%] | 14 | 4 | 43 |
fullCases$EC_PPWK_ECAC1 <- ifelse(is.na(EC_PPWK)&EC_ECAC!=1, 1, EC_PPWK); detach(fullCases); attach(fullCases)
form<-rbind(form, c("EC_PPWK_ECAC1", "log"))
kable(
rbind(
table(ifelse(is.na(EC_PPWK_ECAC1)|EC_PPWK_ECAC1 == 0, "Missing", "Available"), EC_ECAC),
100*round(prop.table(table(ifelse(is.na(EC_PPWK_ECAC1), "Missing [%]", "Available [%]"), EC_ECAC)), 2))
)| 1 | 2 | 3 | |
|---|---|---|---|
| Available | 120079 | 11761 | 129257 |
| Missing | 42395 | 0 | 0 |
| Available [%] | 40 | 4 | 43 |
| Missing [%] | 14 | 0 | 0 |
par(oma=c(2,2,0,0), mar=c(0,0,0.6,0), cex.main=0.8, cex.axis=0.8, mfrow=c(1,2))
MattHist(log(EC_PPWK), "deepskyblue", "Log of income")
MattHist(log(EC_PPWK_ECAC1), "deepskyblue", "Log of income, including inactive")kable(
rbind(
c("EC_PPWK", round(cor.test(EC_PPWK, WB_WORTH5, use="complete.obs")$conf.int[1:2], 3)),
c("EC_PPWK_ECAC1", round(cor.test(EC_PPWK_ECAC1, WB_WORTH5, use="complete.obs")$conf.int[1:2], 3))),
col.names = c("Variable","min", "max"), caption = "Correlation with WB_WORTH5 confidence intervals")| Variable | min | max |
|---|---|---|
| EC_PPWK | 0.019 | 0.03 |
| EC_PPWK_ECAC1 | 0.042 | 0.05 |
misobs(EC_PPWK_ECAC1)## 14% of observations missing
The massive group of outliers generated through the recoding process did influence most descriptive statistics on the variable, however it is usable in spite, due to the dummy properties and otherwise log-normal distribution of the original information. This way, the missing observations are reduced to 14%, which may still be workable. Progress of this exercise shall show.
Furthermore, the transformation had a positive effect on the correlation coefficient, as one may expect from a fake variable.
in contrast to somewhat closely related income, hours worked are expected to have a negative effect on the predicted variable.
It might be possible to combine these two variables into one, that is hourly rate. However, there are two important components that require attention.
Firstly, the number of missing observations is far lower and may enable insight into variation among individuals who rejected to provide information on their income.
kable(rbind(table(ifelse(is.na(EC_WORKHR)|EC_WORKHR == 0, "Missing", "Available"), EC_ECAC),
100*round(prop.table(table(ifelse(is.na(EC_WORKHR), "Missing [%]", "Available [%]"), EC_ECAC)), 2)),
caption = "WORKHR Before Recoding")| 1 | 2 | 3 | |
|---|---|---|---|
| Available | 158856 | 0 | 0 |
| Missing | 3618 | 11761 | 129257 |
| Available [%] | 52 | 0 | 0 |
| Missing [%] | 1 | 4 | 43 |
fullCases$EC_WORKHR_ECAC1 <- ifelse(is.na(EC_WORKHR)&EC_ECAC!=1, 0, EC_WORKHR); detach(fullCases); attach(fullCases)
form<-rbind(form, c("EC_WORKHR_ECAC1", "quad"))
kable(rbind(table(ifelse(is.na(EC_WORKHR_ECAC1), "Missing", "Available"), EC_ECAC),
100*round(prop.table(table(ifelse(is.na(EC_WORKHR_ECAC1), "Missing [%]", "Available [%]"), EC_ECAC)), 2)
), caption = "WORKHR After Recoding")| 1 | 2 | 3 | |
|---|---|---|---|
| Available | 158955 | 11761 | 129257 |
| Missing | 3519 | 0 | 0 |
| Available [%] | 52 | 4 | 43 |
| Missing [%] | 1 | 0 | 0 |
par(oma=c(2,2,0,0), mar=c(0,0,0.6,0), cex.main=0.8, cex.axis=0.8)
MattHist(EC_WORKHR_ECAC1, main = "Hours worked")misobs(EC_WORKHR_ECAC1)## 1% of observations missing
Secondly, a quadratic component is expected to be present in the relationship due to diminishing satisfaction from work. The highest level of life satisfaction is at 43 weekly hours at work, suggesting that the 40 hour week is appropriate.
Bullshit! I have to admit that the zero values for unemployed people did create some degree of bias. The blue line shows the sample including unemployed individuals, while the red one does not. Apparently, some people just love their job. Regardless, this effect should disappear through inclusion of appropriate dummies. It has already also shown that people out of job are less happy. Let’s inspect it further!
lm_workhr <- lm(WB_SATIS5 ~ EC_WORKHR_ECAC1 + I(EC_WORKHR_ECAC1^2), na.action = na.omit)
lm_workhr2 <- lm(WB_SATIS5 ~ EC_WORKHR + I(EC_WORKHR^2), na.action = na.omit)
par(oma=c(0,0,0,0), mar=c(4,4,2,1), cex.main=0.8, cex.axis=0.8)
plot(EC_WORKHR,WB_SATIS5,
pch=16,
xlab="Weekly hours worked",
ylab="Life Satisfaction",
main="Hours Worked and Life Satisfaction",
col="grey80")
grid(nx = NULL, col = "lightgrey", lty = "dotted", equilogs = TRUE)
whrlmx<-seq(0, 100, 1)
whrlmy<-predict(lm_workhr, list(EC_WORKHR_ECAC1=whrlmx), type="response") #?predict
lines(whrlmx, whrlmy, lwd=2, col="blue")
whr2lmy<-predict(lm_workhr2, list(EC_WORKHR=whrlmx), type="response") #?predict
lines(whrlmx, whr2lmy, lwd=2, col="red")which.is.max(-whr2lmy)## [1] 48
which.is.max(whrlmy)## [1] 44
rm(list=c("lm_workhr", "lm_workhr2", "whr2lmy", "whrlmx", "whrlmy"))Because of the amount of available values, this shall be omitted in the first analyses. Maybe create a dummy for people new to the UK?
misobs(DG_UKSINCE)## 89% of observations missing
boxplot.stats(DG_UKSINCE)$stats## [1] 1926 1974 1998 2006 2015
#plot(DG_UKSINCE, WB_SATIS5)In terms of ordinal categorical variables, following of the available ones shall be used
Shall we group GCSE and above or just say that education has no significant effect? I mean, of course life of somebody with no education at age over 16 is worthless. And I know it is not fair to say, but it is true! There is a lot of missing observations, which I am treating with what might create a lot of false negatives in the dummy. Let’s just assume all these people have above GSCE education.
misobs(DG_EDU)## 24% of observations missing
sharebox(WB_SATIS, DG_EDU,
cts=c("Postgraduate", "Graduate", "A-level", "GCSE", "Other", "None", "Not known"),
main="Breakdown by Education",
color="lightsalmon")fullCases$DG_LOWEDU<-ifelse(is.na(DG_EDU), 0, ifelse(DG_EDU>=5, 1, 0))
detach(fullCases); attach(fullCases)
form<-rbind(form, c("DG_LOWEDU", "DUM"))
drop <- cbind("DG_EDU", drop)There does not seem to be distributional difference between employed and inactive individuals. Furthermore, unemployed people may be distinguished by the unemployment duration variable. Regardless, I am creating two dummies for each, expecting them to be removed while clearing model from insignificant variables.
misobs(EC_ECAC)## 0% of observations missing
sharebox(WB_SATIS, EC_ECAC,
cts=c("Employed", "Unemployed", "Inactive"),
main="Breakdown by Economic Activity",
color="aquamarine")fullCases$EC_UNEMP<-ifelse(EC_ECAC == 2, 1, 0)
fullCases$EC_INAC<-ifelse(EC_ECAC == 3, 1, 0)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("EC_UNEMP", "DUM"))
form<-rbind(form, c("EC_INAC", "DUM"))
drop <- cbind("EC_ECAC", drop)Availability of this variable is decent, given that it is available for most unemployed individuals. Surprisingly, a sample of individuals in employment presents with a value as well.
Still, some influence of unemployment duration is observed on subjective level of wellbeing. Group for 3y+ and 3m-3y seems feasible People unemployed for less than 3 months shall be kept as a benchmark.
misobs(EC_UNDUR)## 96% of observations missing
kable(100*round(prop.table(table(ifelse(is.na(EC_UNDUR), "Missing [%]", "Available [%]"), EC_ECAC)), 2))| 1 | 2 | 3 | |
|---|---|---|---|
| Available [%] | 0 | 4 | 0 |
| Missing [%] | 53 | 0 | 43 |
sharebox(WB_SATIS, EC_UNDUR,
cts=c("3m-", "3-6m", "6-12m", "12-18m", "18-24m", "2-3y", "3-4y", "4-5y", "5y+"),
main="Breakdown by Unemployment Duration",
color="yellow")fullCases$EC_UNEMP3Y<-ifelse(is.na(EC_UNDUR), 0, ifelse(EC_UNDUR >= 7, 1, 0))
fullCases$EC_UNEMP3M<-ifelse(is.na(EC_UNDUR), 0, ifelse(EC_UNDUR %in% 2:6, 1, 0))
detach(fullCases); attach(fullCases)
form<-rbind(form, c("EC_UNEMP3Y", "DUM"))
form<-rbind(form, c("EC_UNEMP3M", "DUM"))
drop <- cbind("EC_UNDUR", drop)Here, we can observe that people with a routine job, or those who never worked exhibit a different distribution than those in the higher groups. These shall therefore receive a dummy, with the higher grades as a benchmark.
misobs(EC_JOBTYPE)## 0% of observations missing
sharebox(WB_SATIS, EC_JOBTYPE, main="Breakdown by Smoking Status",
cts=c("High Mgmt/Prof", "Lower Mgmt", "Intermediate", "Small Emp's", "Tech & S'visory",
"Semi-routine", "Routine", "Never Worked"),
color="navajowhite")fullCases$EC_ROUTJ <- ifelse(EC_JOBTYPE >= 6, 1, 0)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("EC_ROUTJ", "DUM"))
drop <- cbind("EC_JOBTYPE", drop)Owning one’s house does remove some degree of dissatisfaction, just as not having to pay rent does. Let’s keep these as benchmark. Renters shall receive their own dummy. The 6 squatters shall be left in the benchmark category
misobs(AC_TENURE)## 0% of observations missing
sharebox(WB_SATIS, AC_TENURE,
cts=c("Owned", "Mortgage", "Part Rent", "Rented", "Rent Free", "Squat"),
main="Breakdown by Accommodation Tenure",
color="seagreen")fullCases$AC_RENT<-ifelse(AC_TENURE %in% 3:4, 1, 0)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("AC_RENT", "DUM"))
drop <- cbind("AC_TENURE", drop)Based on the boxplot, use of this variable shouldn’t be necessary. It however also shows an issue with sample selection, as people generally move relatively often.
misobs(AC_LSTMOV)## 0% of observations missing
sharebox(WB_SATIS, AC_LSTMOV,
cts=c("1-", "1-2", "2-3", "3-5", "5-10", "10+"),
main="Breakdown by years at location",
color= "tomato")drop <- cbind("AC_LSTMOV", drop) Good and Very good shall be kept as a benchmark. Three dummies shall be created.
misobs(HE_QHEALTH)## 3% of observations missing
heatmap.2(xtabs(~WB_SATIS5 + HE_QHEALTH), #first is Y, second is X (rows, columns)
cellnote = round(xtabs(~WB_SATIS5 + HE_QHEALTH)/1000, 0),
xlab="Subjective Health level", #FIRST
ylab="Satisfaction level", #SECOND
notecol="black",
density.info="none",
trace="none",
col = rev(heat.colors(20)),
dendrogram="none",
Colv="NA",
Rowv="NA",
lwid=c(0.1,4.5),
lhei=c(0.1,4.5),
key=FALSE,
symm=TRUE)sharebox(WB_SATIS, HE_QHEALTH,
cts=c("Very good", "Good", "Fair", "Bad", "Very bad"),
main="Breakdown by Health",
color="orangered")fullCases$HE_Q3<-ifelse(HE_QHEALTH==3, 1, 0)
fullCases$HE_Q4<-ifelse(HE_QHEALTH==4, 1, 0)
fullCases$HE_Q5<-ifelse(HE_QHEALTH==5, 1, 0)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("HE_Q3", "DUM"))
form<-rbind(form, c("HE_Q4", "DUM"))
form<-rbind(form, c("HE_Q5", "DUM"))
drop <- cbind("HE_QHEALTH", drop)After scales and ordinal categories have been dealt with, dummies shall be created for nominal categories to capture additional effects of relevant factors. The sharebox function will assist selection of appropriate attributes to consider. Some of the insignificant distinguishing variables will be dropped without recoding, making my life easier.
The gender variable presents with a minimum of 1 and maximum of 2, for which reason it is recoded into a dummy with 1 for males After doing so, the share of males is determined. The boxplot has no point here, interesting deviations from the overall distribution of predicted variable.
misobs(DG_GENDER)## 0% of observations missing
summary(DG_GENDER) #Need to recode from 1 or 2 to dummy## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.561 2.000 2.000
fullCases$DG_MALE <- ifelse(DG_GENDER == 2, 0, 1)
detach(fullCases); attach(fullCases)
cat(c("Share of males:\t", round(mean(DG_MALE), 2), "%"))## Share of males: 0.44 %
form<-rbind(form, c("DG_MALE", "DUM"))
drop <- cbind("DG_GENDER", drop)For marital status, two dummies are created to represent partnered individuals and individuals, who are either separated, divorced, or widowed.
misobs(DG_MARSTA)## 0% of observations missing
sharebox(WB_SATIS, DG_MARSTA,
cts=c("Single", "Married", "Separated", "Divorced", "Widowed", "CP", "Sep CP", "Div CP", "Wid CP"),
main="Breakdown by Marital Status",
color=c("blue", "green", "red", "red", "red", "green", "red", "red", "red")) fullCases$DG_MAR_PART <- ifelse(DG_MARSTA %in% c(2, 6), 1, 0)
fullCases$DG_MAR_EXPART <- ifelse(DG_MARSTA %in% c(3:5, 7:9), 1, 0)
#Single left as benchmark
detach(fullCases); attach(fullCases)
form<-rbind(form, c("DG_MAR_PART", "DUM"))
form<-rbind(form, c("DG_MAR_EXPART", "DUM"))
drop <- cbind("DG_MARSTA", drop) par(oma=c(0,0,0,0), mar=c(0,0,0.7,0), cex=0.8)
pie(c("Expartnered"=mean(DG_MAR_EXPART),
"Partnered"=mean(DG_MAR_PART),
"Single/other"=1-mean(DG_MAR_PART)-mean(DG_MAR_EXPART)),
col=rainbow(3),
main="Marital Status Breakdown") ####Religion In general, Atheists and buddhists tend to be less prone to reaching life satisfaction level of 10, yet present with lower variance in general. Similarity of the two camps could be attributed to the fact that buddhism is a life philospohy rather than religion, as it does not follow concepts of faith or worship.
In addition, Muslims seem to vary the most.
misobs(DG_RELIG)## 2% of observations missing
sharebox(WB_SATIS, DG_RELIG,
cts=c("Atheist", "Christian", "Budhist", "Hindu", "Jewish", "Muslim", "Sikh", "Other"),
main="Breakdown by Religion",
color="rosybrown")fullCases$DG_RELIGIOUS <- ifelse(DG_RELIG %in% c(1, 3), 0, 1)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("DG_RELIG", "DUM"))
drop <- cbind("DG_RELIG", drop)
cat(c("Share of religious: ", 100*round(mean(DG_RELIGIOUS, na.rm = TRUE), 2), "%"))## Share of religious: 72 %
London… As always! Created a dummy.
misobs(DG_REGION)## 0% of observations missing
sharebox(WB_SATIS, DG_REGION,
cts=c("North East", "North West", "Yorkshire+", "East Midlands", "West Midlands", "East of England", "London",
"South East", "South West", "Northern Ireland", "Scotland", "Wales"),
main="Breakdown by Region",
color="paleturquoise")fullCases$DG_LONDON <- ifelse(DG_REGION == "E12000007", 1, 0)
form<-rbind(form, c("DG_LONDON", "DUM"))
detach(fullCases); attach(fullCases)
drop <- cbind("DG_REGION", drop)The origin variable has 12 values, less than 0.5% missing. People from Pakistan apparently tend to be more disappointed. 1684 individuals are not really worth the dummy, are they? So that’s this off the table
misobs(DG_ORIGIN)## 0% of observations missing
sharebox(WB_SATIS, DG_ORIGIN,
cts=c("China", "Hong Kong", "India", "ROI", "Pakistan", "Poland", "England", "NI", "Scotland", "Wales",
"Other UK", "Other"),
main="Breakdown by Origin",
color="firebrick")detach(fullCases); attach(fullCases)
drop <- cbind("DG_ORIGIN", drop)No missing, that’s good. 92% are white, which kinda sucks, though.
None of the groups are weird enough to be worth a dummy. But because nobody seems to be significantly happier than the whites, non-white people will receive a dummy.
misobs(DG_ETH)## 0% of observations missing
sharebox(WB_SATIS, DG_ETH,
cts=c("White", "Mixed", "Indian", "Pakistani", "Bangladeshi", "Chinese", "Other Asian", "Black", "Other"),
main = "Breakdown by Ethnicity",
color="plum")fullCases$DG_NONW<-ifelse(DG_ETH==1,0,1)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("DG_NONW", "DUM"))
drop <- cbind("DG_ETH", drop) We are only interested in people who currently smoke, as people who claim to have smoked are either in denial, or as happy as non-smoker
misobs(DG_ORIGIN)## 0% of observations missing
sharebox(WB_SATIS, HE_SMOKpd,
main="Breakdown by Smoking Status",
cts=c("Smoker", "Ex-smoker", "Non-smoker"),
color="navajowhite")fullCases$HE_SMOKER <- ifelse(HE_SMOKpd ==1, 1, 0)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("HE_SMOKER", "DUM"))
drop <- cbind("HE_SMOK", "HE_SMOKpd", "HE_SMOKever", drop)Aaaand now it is the right time to drop all of the original variables which have been recoded and look at what is left to work with and shall be incorporated into the model!
drop<-cbind(drop, "EC_PERMJOB", "EC_FTPTW", "EC_HOMEJOB", "EC_JOBSCTR", "EC_BNFTS", "AC_TYPE", "BS_PHONE", "BS_PROXY")
fullCases <- fullCases[,!names(fullCases) %in% drop]
rm(list="drop")
colnames(fullCases)## [1] "DG_AGE" "DG_UKSINCE" "EC_PPWK"
## [4] "EC_WORKHR" "EC_BENFTS" "WB_SATIS"
## [7] "WB_WORTH" "WB_ANXIOUS" "WB_HAPPY"
## [10] "WB_SATIS5" "WB_WORTH5" "EC_PPWK_ECAC1"
## [13] "EC_WORKHR_ECAC1" "DG_LOWEDU" "EC_UNEMP"
## [16] "EC_INAC" "EC_UNEMP3Y" "EC_UNEMP3M"
## [19] "EC_ROUTJ" "AC_RENT" "HE_Q3"
## [22] "HE_Q4" "HE_Q5" "DG_MALE"
## [25] "DG_MAR_PART" "DG_MAR_EXPART" "DG_RELIGIOUS"
## [28] "DG_LONDON" "DG_NONW" "HE_SMOKER"
Now that all variables to be included are finally handled and ready, the first multinomial logit model can be estimated. All those to be used were accumulated in a variable.
cat(paste0(form[,1], ";"))## WB_SATIS5; DG_AGE; EC_PPWK_ECAC1; EC_WORKHR_ECAC1; DG_LOWEDU; EC_UNEMP; EC_INAC; EC_UNEMP3Y; EC_UNEMP3M; EC_ROUTJ; AC_RENT; HE_Q3; HE_Q4; HE_Q5; DG_MALE; DG_MAR_PART; DG_MAR_EXPART; DG_RELIG; DG_LONDON; DG_NONW; HE_SMOKER;
Now, the predicted variable needs to be converted into factor and reference happiness level chosen.
fullCases$WB_SATIS5 <- as.factor(WB_SATIS5)
levels(fullCases$WB_SATIS5) <- c("0-4", "5-6", "7-8", "9-10")
fullCases$WB_SATIS5 = relevel(fullCases$WB_SATIS5, ref = "7-8")
detach(fullCases); attach(fullCases)And the estimation of the model executed.
logit1<-multinom(WB_SATIS5~
I(DG_AGE-16)+
I((DG_AGE-16)^2)+
I(DG_MALE*(DG_AGE-16))+
I(DG_MALE*(DG_AGE-16)^2)+
log(EC_PPWK_ECAC1)+
EC_WORKHR_ECAC1+
I(EC_WORKHR_ECAC1^2)+
DG_LOWEDU+
EC_UNEMP+
EC_INAC+
EC_UNEMP3M+
EC_UNEMP3Y+
EC_ROUTJ+
AC_RENT+
HE_Q3+
HE_Q4+
HE_Q5+
DG_MALE+
DG_MAR_PART+
DG_MAR_EXPART+
DG_RELIGIOUS+
DG_LONDON+
DG_NONW+
HE_SMOKER,
na.action = na.omit,
maxit=1000)
logit1s<-summary(logit1)The most appropriate way to display the model output is using package stargazer as done below. In addition to giving the output easily interpretable format, it calculates significance for each of the values, including standard error, t-statistic and p-value.
stargazer(logit1,
title="Logit model: Subjective life satisfaction level",
type="text",
style="qje",
keep.stat="chi2",
order="Constant")##
## Logit model: Subjective life satisfaction level
## ====================================================================
## 0-4 5-6 9-10
## (1) (2) (3)
## --------------------------------------------------------------------
## Constant -1.745*** -0.793*** -0.135***
## (0.002) (0.003) (0.004)
##
## I(DG_AGE - 16) 0.068*** 0.030*** -0.024***
## (0.002) (0.001) (0.001)
##
## I((DG_AGE - 16)2) -0.001*** -0.0004*** 0.0004***
## (0.00003) (0.00002) (0.00002)
##
## I(DG_MALE * (DG_AGE - 16)) -0.010*** 0.002 -0.002**
## (0.002) (0.001) (0.001)
##
## I(DG_MALE * (DG_AGE - 16)2) 0.0001* -0.00005** 0.00003
## (0.00004) (0.00002) (0.00002)
##
## log(EC_PPWK_ECAC1) -0.556*** -0.332*** -0.064***
## (0.002) (0.010) (0.008)
##
## EC_WORKHR_ECAC1 0.029*** 0.028*** -0.014***
## (0.002) (0.003) (0.002)
##
## I(EC_WORKHR_ECAC12) -0.0002*** -0.0002*** 0.0002***
## (0.00004) (0.00004) (0.00003)
##
## DG_LOWEDU 0.038*** 0.102*** 0.209***
## (0.003) (0.007) (0.010)
##
## EC_UNEMP -1.165*** -0.721*** -0.800***
## (0.003) (0.007) (0.003)
##
## EC_INAC -2.010*** -1.154*** -0.237***
## (0.003) (0.008) (0.006)
##
## EC_UNEMP3M 0.192*** 0.251*** 0.095***
## (0.002) (0.005) (0.002)
##
## EC_UNEMP3Y 0.305*** 0.295*** 0.291***
## (0.0004) (0.001) (0.0004)
##
## EC_ROUTJ 0.125*** 0.170*** 0.037***
## (0.005) (0.008) (0.009)
##
## AC_RENT 0.231*** 0.230*** -0.016*
## (0.008) (0.007) (0.009)
##
## HE_Q3 1.065*** 0.719*** -0.528***
## (0.005) (0.010) (0.012)
##
## HE_Q4 2.363*** 1.317*** -0.689***
## (0.004) (0.005) (0.001)
##
## HE_Q5 3.300*** 1.519*** -0.437***
## (0.001) (0.001) (0.0003)
##
## DG_MALE 0.326*** -0.076*** -0.114***
## (0.001) (0.001) (0.001)
##
## DG_MAR_PART -0.584*** -0.366*** 0.408***
## (0.005) (0.007) (0.007)
##
## DG_MAR_EXPART 0.200*** 0.108*** 0.001
## (0.004) (0.007) (0.005)
##
## DG_RELIGIOUS -0.151*** -0.033*** 0.130***
## (0.004) (0.011) (0.010)
##
## DG_LONDON 0.071*** 0.062*** -0.162***
## (0.001) (0.002) (0.002)
##
## DG_NONW 0.357*** 0.404*** 0.057***
## (0.001) (0.003) (0.002)
##
## HE_SMOKER 0.529*** 0.263*** -0.006
## (0.006) (0.008) (0.010)
##
## ====================================================================
## Notes: ***Significant at the 1 percent level.
## **Significant at the 5 percent level.
## *Significant at the 10 percent level.
Using more direct approach, the code below also works for determining the p-value.
exp(logit1s$coefficients)
round(2*(1-pnorm(abs(logit1s$coefficients/logit1s$standard.errors),0,1)), 3)the only issues with significance below 95% confidence level are following
Based on the varianles used, following benchmark individual is assumed.
Solving benchmark probabilities for individual bins follows
p_8<-1/(1+sum(exp(logit1s$coefficients[,1])))
p_0<-p_8*exp(logit1s$coefficients[,1])
p<-c(p_0[1:2], "7-8"=p_8, p_0[3])
kable(t(t(round(p*100, 2))), col.names = "% likelihood")| % likelihood | |
|---|---|
| 0-4 | 6.99 |
| 5-6 | 18.09 |
| 7-8 | 39.98 |
| 9-10 | 34.95 |
rm(list=c("p_8", "p_0", "p"))Now to assess how well the model classifies observations based on exogenous variables into specific bins of wellbeing level, a random sample test shall be conducted. Firstly, the relevant columns are taken from the source table, incomplete cases are removed, and 50,000 rows are selected at random. Then, the new column with estimated bin of WB_SATIS5 is added by selecting the category with highest probability based on the outcome of the LOGIT model for each of the sample observation. Finally, the differences between actual and predicted values are contrasted in the heatmap below. –Note: There was an issue with accurately predicting the level of satisfaction at 7 This issue was chosen to be eliminated by merging 7 and 8 into one level, forming more appropriate base for analysis.–
varsused<-c("WB_SATIS5", "DG_AGE", "EC_PPWK_ECAC1", "EC_WORKHR_ECAC1",
"DG_LOWEDU", "EC_UNEMP", "EC_INAC", "EC_UNEMP3M",
"EC_UNEMP3Y", "EC_ROUTJ", "AC_RENT", "HE_Q3",
"HE_Q4", "HE_Q5", "DG_MALE", "DG_MAR_PART",
"DG_MAR_EXPART", "DG_RELIGIOUS", "DG_LONDON",
"DG_NONW", "HE_SMOKER")
stest <- fullCases[,names(fullCases) %in% varsused]
stest <- stest[complete.cases(stest),]
stest <- stest[sample(nrow(stest), 50000),]
stest$WB_PRED<-predict(logit1, type="class", newdata=stest[,!names(stest)=="WB_SATIS5"])
stest$WB_SATIS5<-factor(stest$WB_SATIS5, c("0-4", "5-6", "7-8", "9-10"))
stest$WB_PRED<-factor(stest$WB_PRED, c("0-4", "5-6", "7-8", "9-10"))
heatmap.2(xtabs(~stest$WB_SATIS5 + stest$WB_PRED), #first is Y, second is X (rows, columns)
cellnote = xtabs(~stest$WB_SATIS5 + stest$WB_PRED),
ylab="Actual", #FIRST
xlab="Predicted", #SECOND
notecol="black",
density.info="none",
trace="none",
col = rev(heat.colors(20)),
dendrogram="none",
Colv="NA",
Rowv="NA",
lwid=c(0.1,4.5),
lhei=c(0.1,4.5),
key=FALSE,
symm=TRUE)cat(paste0("Misclassification error: ",
round(100*mean(as.character(stest$WB_SATIS5) != as.character(stest$WB_PRED)), 2),
"%"))#Misclassification error## Misclassification error: 48.72%
par(mar=c(2,2,0.5,0), #margins: L&B 2 for axes, T 0.5 for title
cex.axis=0.7, #reduce size of axis numbers
cex.main=0.8, #reduce title size
mfrow=c(1,2))
MattHist(as.numeric(stest$WB_SATIS5), "turquoise", "Life Worthwhile: Sample", c(0.5,1.5,2.5,3.5,4.5))
MattHist(as.numeric(stest$WB_PRED), "turquoise", "Life Worthwhile: Predicted", c(0.5,1.5,2.5,3.5,4.5))Visualization of benchmark case with variable age, comparing male vs female probabilities of falling into individual categories. There is an indice suggesting higher likelihood of midlife crisis to occur in males.
val <- data.frame(DG_AGE=0:99,
EC_PPWK_ECAC1=1,
EC_WORKHR_ECAC1=1,
DG_LOWEDU=0,
EC_UNEMP=0,
EC_INAC=0,
EC_UNEMP3M=0,
EC_UNEMP3Y=0,
EC_ROUTJ=0,
AC_RENT=0,
HE_Q3=0,
HE_Q4=0,
HE_Q5=0,
DG_MALE=0,
DG_MAR_PART=0,
DG_MAR_EXPART=0,
DG_RELIGIOUS=0,
DG_LONDON=0,
DG_NONW=0,
HE_SMOKER=0)
pred1p<-predict(logit1, type="probs", newdata=val)
pred1c<-predict(logit1, type="class", newdata=val)
predpdat<-cbind(val, pred1p)
par(mar=c(4,4,0.5,2), #margins: L&B 2 for axes, T 0.5 for title
oma=c(0,0,0,0),
cex.axis=0.7, #reduce size of axis numbers
cex.main=0.8, #reduce title size
mfrow=c(1,2))
plot(predpdat$DG_AGE, rep(0:1, 50),
pch=16,
xlab="Age",
ylab="Absolute Probability",
col="white",
xaxs="i",
yaxs="i")
lines(predpdat$DG_AGE, predpdat$"9-10", lwd=2, col="grey80", lty=1)
lines(predpdat$DG_AGE, predpdat$"7-8", lwd=2, col="grey60", lty=2)
lines(predpdat$DG_AGE, predpdat$"5-6", lwd=2, col="grey40", lty=3)
lines(predpdat$DG_AGE, predpdat$"0-4", lwd=2, col="grey20", lty=4)
legend("top", c("9-10", "7-8", "5-6", "0-4"),
col = paste0("grey",seq(80,20,-20)),
lty = 1:4, lwd=3,
ncol=2, cex=0.8, title="Wellbeing Category")
plot(predpdat$DG_AGE, rep(0:1, 50),
pch=16,
xlab="Age",
ylab="Aggregate Probability",
col="white",
xaxs="i",
yaxs="i")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8"+predpdat$"9-10",0), col="grey80")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8",0), col="grey60")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4",0), col="grey40")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"0-4",0), col="grey20")val <- data.frame(DG_AGE=0:99,
EC_PPWK_ECAC1=1,
EC_WORKHR_ECAC1=1,
DG_LOWEDU=0,
EC_UNEMP=0,
EC_INAC=0,
EC_UNEMP3M=0,
EC_UNEMP3Y=0,
EC_ROUTJ=0,
AC_RENT=0,
HE_Q3=0,
HE_Q4=0,
HE_Q5=0,
DG_MALE=1,
DG_MAR_PART=0,
DG_MAR_EXPART=0,
DG_RELIGIOUS=0,
DG_LONDON=0,
DG_NONW=0,
HE_SMOKER=0)
pred1p<-predict(logit1, type="probs", newdata=val)
pred1c<-predict(logit1, type="class", newdata=val)
predpdat<-cbind(val, pred1p)
par(mar=c(4,4,0.5,2), #margins: L&B 2 for axes, T 0.5 for title
oma=c(0,0,0,0),
cex.axis=0.7, #reduce size of axis numbers
cex.main=0.8, #reduce title size
mfrow=c(1,2))
plot(predpdat$DG_AGE, rep(0:1, 50),
pch=16,
xlab="Age",
ylab="Absolute Probability",
col="white",
xaxs="i",
yaxs="i")
lines(predpdat$DG_AGE, predpdat$"9-10", lwd=2, col="grey80", lty=1)
lines(predpdat$DG_AGE, predpdat$"7-8", lwd=2, col="grey60", lty=2)
lines(predpdat$DG_AGE, predpdat$"5-6", lwd=2, col="grey40", lty=3)
lines(predpdat$DG_AGE, predpdat$"0-4", lwd=2, col="grey20", lty=4)
legend("top", c("9-10", "7-8", "5-6", "0-4"),
col = paste0("grey",seq(80,20,-20)),
lty = 1:4, lwd=3,
ncol=2, cex=0.8, title="Wellbeing Category")
plot(predpdat$DG_AGE, rep(0:1, 50),
pch=16,
xlab="Age",
ylab="Aggregate Probability",
col="white",
xaxs="i",
yaxs="i")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8"+predpdat$"9-10",0), col="grey80")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8",0), col="grey60")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4",0), col="grey40")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"0-4",0), col="grey20")Visualization of benchmark case with variable hours worked
val <- data.frame(DG_AGE=16,
EC_PPWK_ECAC1=1,
EC_WORKHR_ECAC1=1:100,
DG_LOWEDU=0,
EC_UNEMP=0,
EC_INAC=0,
EC_UNEMP3M=0,
EC_UNEMP3Y=0,
EC_ROUTJ=0,
AC_RENT=0,
HE_Q3=0,
HE_Q4=0,
HE_Q5=0,
DG_MALE=0,
DG_MAR_PART=0,
DG_MAR_EXPART=0,
DG_RELIGIOUS=0,
DG_LONDON=0,
DG_NONW=0,
HE_SMOKER=0)
pred1p<-predict(logit1, type="probs", newdata=val)
pred1c<-predict(logit1, type="class", newdata=val)
predpdat<-cbind(val, pred1p)
par(mar=c(4,4,0.5,2), #margins: L&B 2 for axes, T 0.5 for title
oma=c(0,0,0,0),
cex.axis=0.7, #reduce size of axis numbers
cex.main=0.8, #reduce title size
mfrow=c(1,2))
plot(predpdat$EC_WORKHR_ECAC1, rep(0:1, 50),
pch=16,
xlab="Weekly Hours Worked",
ylab="Absolute Probability",
col="white",
xaxs="i",
yaxs="i")
lines(predpdat$EC_WORKHR_ECAC1, predpdat$"9-10", lwd=2, col="grey80", lty=1)
lines(predpdat$EC_WORKHR_ECAC1, predpdat$"7-8", lwd=2, col="grey60", lty=2)
lines(predpdat$EC_WORKHR_ECAC1, predpdat$"5-6", lwd=2, col="grey40", lty=3)
lines(predpdat$EC_WORKHR_ECAC1, predpdat$"0-4", lwd=2, col="grey20", lty=4)
legend("top", c("9-10", "7-8", "5-6", "0-4"),
col = paste0("grey",seq(80,20,-20)),
lty = 1:4, lwd=3,
ncol=2, cex=0.8, title="Wellbeing Category")
plot(predpdat$EC_WORKHR_ECAC1, rep(0:1, 50),
pch=16,
xlab="Weekly Hours Worked",
ylab="Aggregate Probability",
col="white",
xaxs="i",
yaxs="i")
polygon(c(1,predpdat$EC_WORKHR_ECAC1,100), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8"+predpdat$"9-10",0), col="grey80")
polygon(c(1,predpdat$EC_WORKHR_ECAC1,100), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8",0), col="grey60")
polygon(c(1,predpdat$EC_WORKHR_ECAC1,100), c(0,predpdat$"5-6"+predpdat$"0-4",0), col="grey40")
polygon(c(1,predpdat$EC_WORKHR_ECAC1,100), c(0,predpdat$"0-4",0), col="grey20") ###Effect of income Visualization of myself with variable pay
val <- data.frame(DG_AGE=22,
EC_PPWK_ECAC1=1:500,
EC_WORKHR_ECAC1=20,
DG_LOWEDU=0,
EC_UNEMP=0,
EC_INAC=0,
EC_UNEMP3M=0,
EC_UNEMP3Y=0,
EC_ROUTJ=0,
AC_RENT=1,
HE_Q3=0,
HE_Q4=0,
HE_Q5=0,
DG_MALE=1,
DG_MAR_PART=0,
DG_MAR_EXPART=0,
DG_RELIGIOUS=0,
DG_LONDON=0,
DG_NONW=0,
HE_SMOKER=0)
pred1p<-predict(logit1, type="probs", newdata=val)
pred1c<-predict(logit1, type="class", newdata=val)
predpdat<-cbind(val, pred1p)
par(mar=c(4,4,0.5,2), #margins: L&B 2 for axes, T 0.5 for title
oma=c(0,0,0,0),
cex.axis=0.7, #reduce size of axis numbers
cex.main=0.8, #reduce title size
mfrow=c(1,2))
plot(predpdat$EC_PPWK_ECAC1, rep(0:1, 250),
pch=16,
xlab="Weekly Pay",
ylab="Absolute Probability",
col="white",
xaxs="i",
yaxs="i")
lines(predpdat$EC_PPWK_ECAC1, predpdat$"9-10", lwd=2, col="grey80", lty=1)
lines(predpdat$EC_PPWK_ECAC1, predpdat$"7-8", lwd=2, col="grey60", lty=2)
lines(predpdat$EC_PPWK_ECAC1, predpdat$"5-6", lwd=2, col="grey40", lty=3)
lines(predpdat$EC_PPWK_ECAC1, predpdat$"0-4", lwd=2, col="grey20", lty=4)
legend("top", c("9-10", "7-8", "5-6", "0-4"),
col = paste0("grey",seq(80,20,-20)),
lty = 1:4, lwd=3,
ncol=2, cex=0.8, title="Wellbeing Category")
plot(predpdat$EC_PPWK_ECAC1, rep(0:1, 250),
pch=16,
xlab="Weekly Pay",
ylab="Aggregate Probability",
col="white",
xaxs="i",
yaxs="i")
polygon(c(1,predpdat$EC_PPWK_ECAC1,500), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8"+predpdat$"9-10",0), col="grey80")
polygon(c(1,predpdat$EC_PPWK_ECAC1,500), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8",0), col="grey60")
polygon(c(1,predpdat$EC_PPWK_ECAC1,500), c(0,predpdat$"5-6"+predpdat$"0-4",0), col="grey40")
polygon(c(1,predpdat$EC_PPWK_ECAC1,500), c(0,predpdat$"0-4",0), col="grey20")Looking at the previous chart, while also being told that I earn GBP 350 a week for my 20 hours of work, the reader can easily take an appropriate section and draw the pie chart below.
par(mar=c(0,0,1.5,0), #margins: L&B 2 for axes, T 0.5 for title
cex.main=0.8)
val <- data.frame(DG_AGE=22,
EC_PPWK_ECAC1=350,
EC_WORKHR_ECAC1=20,
DG_LOWEDU=0,
EC_UNEMP=0,
EC_INAC=0,
EC_UNEMP3M=0,
EC_UNEMP3Y=0,
EC_ROUTJ=0,
AC_RENT=1,
HE_Q3=0,
HE_Q4=0,
HE_Q5=0,
DG_MALE=1,
DG_MAR_PART=0,
DG_MAR_EXPART=0,
DG_RELIGIOUS=0,
DG_LONDON=0,
DG_NONW=0,
HE_SMOKER=0)
me<-predict(logit1, type="probs", newdata=val)
pie(me[order(me,decreasing=TRUE)],
main="Author: life satisfaction\ncategory probabilities",
labels=paste0(names(me[order(me,decreasing=TRUE)]), " - " ,round(100*me[order(me,decreasing=TRUE)]), "%"),
clockwise = TRUE,
init.angle = 0,
col=paste0("grey",seq(80,20,-20)))Will be added later