http://mattved.eu/

Environment preparation

Firstly, a working directory must be set and relevant packages imported. I am also clearing my environment from any variables which may be left over from prior analyses. Finally, I had to define myself a couple of functions to free myself from endless copy-pasting.

setwd("C:/Users/matt/MEGA/School/ECN3000/ECN3015/CW01")
require(psych)
require(foreign)
require(colorRamps)
require(tidyverse)
require(gplots)
require(knitr)
require(MASS)
require(nnet)
require(lmtest)
require(stargazer)

rm(list=ls())
#outlogit function used for presenting a multinomial logit model in an appropriate way
outlogit<- function(x){
    z <- cbind(round(exp(summary(x)$coefficients), 3), round(summary(x)$coefficients, 3),
               round((1 - pnorm(abs(summary(x)$coefficients/summary(x)$standard.errors), 0, 1)) * 2, 3))
    z <- cbind(z, ifelse(z[,3]<0.01, 99, ifelse(z[,3]<0.05, 95, ifelse(z[,3]<0.1, 90, 0) ) ) )
    colnames(z)<- c("OR","coefficient", "p-value", "sig")
    return(z)
}
MattHist <- function(v, color="deepskyblue", main=NULL, breaks="Sturges"){
    g<-hist(v, breaks=breaks, plot=FALSE)
    hist(v,
     freq = FALSE, 
     breaks= breaks, 
     col=paste0(color, "1"), 
     lty=1, 
     main = main,
     axes=TRUE, 
     ylim = c(-max(g$density)/20,max(g$density)))
    lines(density(v, adjust=5, na.rm=TRUE), 
          col=paste0(color,"4"), 
          lwd=2)
    curve(dnorm(x , mean=mean(v, na.rm = TRUE), sd=sd(v, na.rm=TRUE)), 
          col="Black", 
          lwd=1, 
          add=TRUE)
    boxplot(v, 
            horizontal=TRUE,  
            outline=TRUE,
            axes=FALSE, 
            col = paste0(color,"2"), 
            whisklty=1, 
            staplelty=0, 
            boxwex=max(g$density)/10, 
            add=TRUE, 
            at=-max(g$density)/20, 
            pch=16, 
            outcol=paste0(color,"2"))
    rm(list = c("g"))
}
sharebox<-function(p, v, cts=NULL, main="Breakdown", color="deepskyblue"){
    par(oma=c(0,5,0,0.5), mar=c(4,0,1,0.5), cex.axis=0.7, cex.lab=0.7, cex.main=0.7, las=1)
    nf<-layout(matrix(c(1,2), 1, 2, byrow = TRUE), widths=c(3,1))
    boxplot(p~v, 
        horizontal=TRUE,  outline=TRUE,  axes=TRUE,
        names=cts,
        col = paste0(color,"2"), 
        main=main,
        whisklty=1, 
        staplelty=0, 
        notch=FALSE, 
        boxwex=0.5, 
        pch=16, 
        outcol=paste0(color,"2"), 
        xlab="Subjective Wellbeing")
    barplot(prop.table(table(v)), 
        horiz=TRUE, 
        las=1, 
        axes=TRUE, 
        names.arg = "", 
        col=paste0(color,"3"), 
        xlab = "%", 
        main="Share")

}
misobs<-function(v){
    cat(paste0(round(100*length(which(is.na(v)))/length(DG_AGE)), "% of observations missing"))
}

Data import

Then, the data file needs to be imported and cleared of observations which cannot be used due to null values of predicted obsetrvations. Then, the dataset is attached for simplification of code in further steps

lmsurvey <- read.spss("./DAT/SN7924.sav", use.value.labels=FALSE, to.data.frame=TRUE, reencode = FALSE)
fullCases <- lmsurvey[!(is.na(lmsurvey$WB_SATIS))&!(is.na(lmsurvey$WB_WORTH)),]
attach(fullCases)
cat(c("Original observations:", length(lmsurvey$DG_AGE), "\nComplete observations:", length(fullCases$DG_AGE)))
## Original observations: 567481 
## Complete observations: 303492

Predicted variables

Subjective wellbeing and feeling worthwhile

As the histogram of subjective life satisfaction shows, the distribution of responses is negatively skewed, with mode and median both at value of 8 and mean at 7.54. Values below 5 are outliers, likely to be caused be related to factors such as bad health or life in poverty, hence the single bin. 5 and 6 observe almost identical density, which is also the case for 9 and 10. 7 and 8 then differ somewhat, with 8 being mode, possibly due to general preference in even numbers, when it comes to self-assessed levels.
For individual feeling of an extent to which life is worthwhile, similar characteristics apply, with just slightly higher mean at 7.79, fewer outliers and consequential higher negative skew.
This not only shows that individuals in the sample tend not to see things in a grave manner, but also that 75% don’t feel either completely satisfied, nor perceive their activities fully worthwhile.
Because the resulting LOGIT analysis will focus on subjective life satisfaction, the variable shall be recorded into the appropriate bins identified above. Same, however shall be done for feeling worthwhile, for capturing difference in relative meaning of the two variables. It is for example expected that employment may have positive effect on worthwhile, while not so much on life satisfaction, as working is pain. Here, I needed to combine 7 and 8 into a single level, as the model couldn’t distinguish between them

recode <- c(0,0,0,0,0,1,1,2,2,3,3)
fullCases$WB_SATIS5 <- recode[WB_SATIS+1]
fullCases$WB_WORTH5 <- recode[WB_WORTH+1]
  detach(fullCases); attach(fullCases)
rm(ls="recode")
par(mar=c(2,2,0.5,0), #margins: L&B 2 for axes, T 0.5 for title
    cex.axis=0.7, #reduce size of axis numbers 
    cex.main=0.8, #reduce title size
    mfrow=c(2,2))
MattHist(WB_SATIS, "deepskyblue", "Life Satisfaction: Collected", c(0,4.5,6.5,8.5,10.5))
MattHist(WB_SATIS5, "tomato", "Life Satisfaction: Recoded", seq(-0.5, 3.5, 1))
MattHist(WB_WORTH, "turquoise", "Life Worthwhile: Collected", c(0,4.5,6.5,8.5,10.5))
MattHist(WB_WORTH5, "orange", "Life Worthwhile: Recoded", seq(-0.5, 3.5, 1))

form<-(c("WB_SATIS5", "PRED"))

Due to similar distribution of these variables, an inspection using heatmap may come handy for a first-look inspection of any potential relationship between these two.
Below, a frequency matrix is graphed, expressing number of respondents in thousands. There indeed seems to be some linear relationship between those two variables.
Looking further, their correlation coefficient estimated ot be in neighborhood of 0.61 is significant and some causation is confirmed by simple one-variable OLS estimates with p-value of zero.

heatmap.2(xtabs(~WB_SATIS5 + WB_WORTH5), #first is Y, second is X (rows, columns)
  cellnote = round(xtabs(~WB_SATIS5 + WB_WORTH5)/1000, 0),
  xlab="Worthwhile level", #FIRST
  ylab="Satisfaction level", #SECOND
  notecol="black",
  density.info="none",
  trace="none",
  col = rev(heat.colors(20)),
  dendrogram="none",
  Colv="NA",
  Rowv="NA",
  lwid=c(0.1,4.5), 
  lhei=c(0.1,4.5),
  key=FALSE,
  symm=TRUE)

cat("Correlation coefficient 95% confidence interval: ", 
    cor.test(WB_SATIS5, WB_WORTH5)$conf.int, 
    "\n")
## Correlation coefficient 95% confidence interval:  0.6008662 0.6053933
summary(lm(WB_SATIS5~WB_WORTH5))$coefficients
##              Estimate  Std. Error  t value Pr(>|t|)
## (Intercept) 0.6925858 0.003378361 205.0065        0
## WB_WORTH5   0.6204777 0.001489520 416.5622        0

Happiness and anxiety

In comparison, the self-assessed variables for immediate mental state on the day prior to responding to the survey exhibit similar characteristics with negative relationship yet yield more distinct distribution. Because these two variables are related to the specific point in time, very short run, they do not seem ideal for being estimated based on variables which vary at far lower frequency.
Due to this, only a brief graphical description follows for basic insight.

par(mar=c(2,2,0.5,0), #margins: L&B 2 for axes, T 0.5 for title
    cex.axis=0.7, #reduce size of axis numbers 
    cex.main=0.8, #reduce title size
    mfrow=c(1,2) #1row 2cols plot matrix
    )
MattHist(WB_HAPPY, "darkgoldenrod", "Feeling happy", 10)
MattHist(WB_ANXIOUS, "sienna", "Feeling anxious", 10)

heatmap.2(xtabs(~WB_HAPPY + WB_ANXIOUS),
  cellnote = round(xtabs(~WB_HAPPY + WB_ANXIOUS)/1000, 0),
  ylab="Happiness level",
  xlab="Anxiety level",
  notecol="black",
  density.info="none",
  trace="none",
  col = rev(heat.colors(20)),
  dendrogram="none",
  Colv="NA",
  Rowv="NA",
  lwid=c(0.1,4.5), 
  lhei=c(0.1,4.5),
  key=FALSE,
  symm=TRUE)

cat("Correlation coefficient 95% confidence interval:\n", cor.test(WB_HAPPY, WB_ANXIOUS)$conf.int)
## Correlation coefficient 95% confidence interval:
##  -0.4697965 -0.4642267
cat("\nLinear model estimate:\n")
## 
## Linear model estimate:
summary(lm(WB_HAPPY~WB_ANXIOUS))$coefficients
##               Estimate  Std. Error   t value Pr(>|t|)
## (Intercept)  8.4515512 0.005013846 1685.6422        0
## WB_ANXIOUS  -0.3547329 0.001220563 -290.6305        0

Wrapup

Overall, this process has shown that the sample of British population generally experiences great level of happiness and feeling worthwhile. Consequently, there must be a handful of factors causing positive and negative disturbances.
Logit model is the most feasible technique for inspecting said factors and will be utilized in the final section.

Explanatory variables

The dataset“Annual Population Survey: Personal Well-Being, April 2012 - March 2015” was obtained from UK Data Service. After removal of variables which were too detailed to be useful, the following remained.

##  [1] "DG_AGE"      "DG_GENDER"   "DG_MARSTA"   "DG_REGION"   "DG_EDU"     
##  [6] "DG_ORIGIN"   "DG_UKSINCE"  "DG_ETH"      "DG_RELIG"    "EC_ECAC"    
## [11] "EC_UNDUR"    "EC_FTPTW"    "EC_PERMJOB"  "EC_HOMEJOB"  "EC_JOBTYPE" 
## [16] "EC_JOBSCTR"  "EC_PPWK"     "EC_WORKHR"   "EC_BENFTS"   "AC_TYPE"    
## [21] "AC_TENURE"   "AC_LSTMOV"   "HE_QHEALTH"  "HE_SMOK"     "HE_SMOKpd"  
## [26] "HE_SMOKever" "WB_SATIS"    "WB_WORTH"    "WB_ANXIOUS"  "WB_HAPPY"   
## [31] "BS_PHONE"    "BS_PROXY"    "WB_SATIS5"   "WB_WORTH5"

Scales

Variable classifiable as scales, which shall be used in this piece are following.

  • DG_AGE
  • EC_PPWK
  • EC_WORKHR
  • DG_UKSINCE

Age

Age variable is pretty simple and expected to have an effect. No observations are missing, the variable is normally distributed, with fairly lower degree of kurtosis. Minimum age is 16, due to the removal of variables which did not include self-assessment values, for the subjects were not trusted to do so honestly.
This variable can be treated as is in the next steps.

misobs(DG_AGE)
## 0% of observations missing
par(oma=c(2,2,0,0), mar=c(0,0,0.6,0), cex.main=0.8, cex.axis=0.8)
MattHist(DG_AGE, "darkslategray", "Age")

Surprising element is the following quadratic element the relationship between age and subjective level of satisfaction exhibits. Apparently, 45 year olds are in the mid-life crisis!

lm_age <- lm(WB_SATIS5 ~ I(DG_AGE+16) + I((DG_AGE+16)^2), na.action = na.omit)
summary(lm_age)
## 
## Call:
## lm(formula = WB_SATIS5 ~ I(DG_AGE + 16) + I((DG_AGE + 16)^2), 
##     na.action = na.omit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.40986 -0.11104  0.02414  0.89885  1.04616 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.565e+00  2.039e-02  125.82   <2e-16 ***
## I(DG_AGE + 16)     -1.982e-02  6.169e-04  -32.13   <2e-16 ***
## I((DG_AGE + 16)^2)  1.606e-04  4.470e-06   35.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8164 on 303489 degrees of freedom
## Multiple R-squared:  0.006323,   Adjusted R-squared:  0.006317 
## F-statistic: 965.6 on 2 and 303489 DF,  p-value: < 2.2e-16
par(oma=c(0,0,0,0), mar=c(4,4,2,1), cex.main=0.8, cex.axis=0.8)
plot(DG_AGE,WB_SATIS5, 
     pch=16,
     xlab="Age",
     ylab="Life Satisfaction",
     main="Age and life satisfaction",
     col="grey80")
grid(nx = NULL, col = "lightgrey", lty = "dotted", equilogs = TRUE)
agelmx<-seq(0, 100, 1)
agelmy<-predict(lm_age, list(DG_AGE=agelmx), type="response") #?predict
lines(agelmx, agelmy, lwd=2, col="blue")

which.is.max(-agelmy)
## [1] 47
form<-rbind(form, c("DG_AGE", "-16 & quad"))
rm(list=c("lm_age", "agelmy", "agelmx"))

Weekly income

Firstly, it is important to note that the income variable is only available in cases where EC_ECAC=1, that is where the individual is in employment. Furthermore, 14% of total remaining observations is missing due to rejection to answer.
Therefore, the variable will act as a slope dummy, allowing to fill the observations out-of-employment with value of 1, so that after logarithmic transformation, it is set to 0. It is important that it is correctly interpreted from the output!

kable(rbind(table(ifelse(is.na(EC_PPWK)|EC_PPWK == 0, "Missing", "Available"), EC_ECAC),
      100*round(prop.table(table(ifelse(is.na(EC_PPWK), "Missing [%]", "Available [%]"), EC_ECAC)), 2)))
1 2 3
Available 120079 0 0
Missing 42395 11761 129257
Available [%] 40 0 0
Missing [%] 14 4 43
fullCases$EC_PPWK_ECAC1 <- ifelse(is.na(EC_PPWK)&EC_ECAC!=1, 1, EC_PPWK); detach(fullCases); attach(fullCases)
form<-rbind(form, c("EC_PPWK_ECAC1", "log"))

kable(
    rbind(
        table(ifelse(is.na(EC_PPWK_ECAC1)|EC_PPWK_ECAC1 == 0, "Missing", "Available"), EC_ECAC),
        100*round(prop.table(table(ifelse(is.na(EC_PPWK_ECAC1), "Missing [%]", "Available [%]"), EC_ECAC)), 2))
    )
1 2 3
Available 120079 11761 129257
Missing 42395 0 0
Available [%] 40 4 43
Missing [%] 14 0 0
par(oma=c(2,2,0,0), mar=c(0,0,0.6,0), cex.main=0.8, cex.axis=0.8, mfrow=c(1,2))
MattHist(log(EC_PPWK), "deepskyblue", "Log of income")
MattHist(log(EC_PPWK_ECAC1), "deepskyblue", "Log of income, including inactive")

kable(
    rbind(
        c("EC_PPWK", round(cor.test(EC_PPWK, WB_WORTH5, use="complete.obs")$conf.int[1:2], 3)),
        c("EC_PPWK_ECAC1", round(cor.test(EC_PPWK_ECAC1, WB_WORTH5, use="complete.obs")$conf.int[1:2], 3))), 
    col.names = c("Variable","min", "max"), caption = "Correlation with WB_WORTH5 confidence intervals")
Correlation with WB_WORTH5 confidence intervals
Variable min max
EC_PPWK 0.019 0.03
EC_PPWK_ECAC1 0.042 0.05
misobs(EC_PPWK_ECAC1)
## 14% of observations missing

The massive group of outliers generated through the recoding process did influence most descriptive statistics on the variable, however it is usable in spite, due to the dummy properties and otherwise log-normal distribution of the original information. This way, the missing observations are reduced to 14%, which may still be workable. Progress of this exercise shall show.
Furthermore, the transformation had a positive effect on the correlation coefficient, as one may expect from a fake variable.

Weekly hours worked

in contrast to somewhat closely related income, hours worked are expected to have a negative effect on the predicted variable.
It might be possible to combine these two variables into one, that is hourly rate. However, there are two important components that require attention.
Firstly, the number of missing observations is far lower and may enable insight into variation among individuals who rejected to provide information on their income.

kable(rbind(table(ifelse(is.na(EC_WORKHR)|EC_WORKHR == 0, "Missing", "Available"), EC_ECAC),
      100*round(prop.table(table(ifelse(is.na(EC_WORKHR), "Missing [%]", "Available [%]"), EC_ECAC)), 2)),
       caption = "WORKHR Before Recoding")
WORKHR Before Recoding
1 2 3
Available 158856 0 0
Missing 3618 11761 129257
Available [%] 52 0 0
Missing [%] 1 4 43
fullCases$EC_WORKHR_ECAC1 <- ifelse(is.na(EC_WORKHR)&EC_ECAC!=1, 0, EC_WORKHR); detach(fullCases); attach(fullCases)
form<-rbind(form, c("EC_WORKHR_ECAC1", "quad"))

kable(rbind(table(ifelse(is.na(EC_WORKHR_ECAC1), "Missing", "Available"), EC_ECAC),
      100*round(prop.table(table(ifelse(is.na(EC_WORKHR_ECAC1), "Missing [%]", "Available [%]"), EC_ECAC)), 2)
      ), caption = "WORKHR After Recoding")
WORKHR After Recoding
1 2 3
Available 158955 11761 129257
Missing 3519 0 0
Available [%] 52 4 43
Missing [%] 1 0 0
par(oma=c(2,2,0,0), mar=c(0,0,0.6,0), cex.main=0.8, cex.axis=0.8)
MattHist(EC_WORKHR_ECAC1, main = "Hours worked")

misobs(EC_WORKHR_ECAC1)
## 1% of observations missing

Secondly, a quadratic component is expected to be present in the relationship due to diminishing satisfaction from work. The highest level of life satisfaction is at 43 weekly hours at work, suggesting that the 40 hour week is appropriate.

Bullshit! I have to admit that the zero values for unemployed people did create some degree of bias. The blue line shows the sample including unemployed individuals, while the red one does not. Apparently, some people just love their job. Regardless, this effect should disappear through inclusion of appropriate dummies. It has already also shown that people out of job are less happy. Let’s inspect it further!

lm_workhr <- lm(WB_SATIS5 ~ EC_WORKHR_ECAC1 + I(EC_WORKHR_ECAC1^2), na.action = na.omit)
lm_workhr2 <- lm(WB_SATIS5 ~ EC_WORKHR + I(EC_WORKHR^2), na.action = na.omit)
par(oma=c(0,0,0,0), mar=c(4,4,2,1), cex.main=0.8, cex.axis=0.8)
plot(EC_WORKHR,WB_SATIS5, 
     pch=16,
     xlab="Weekly hours worked",
     ylab="Life Satisfaction",
     main="Hours Worked and Life Satisfaction",
     col="grey80")
grid(nx = NULL, col = "lightgrey", lty = "dotted", equilogs = TRUE)
whrlmx<-seq(0, 100, 1)
whrlmy<-predict(lm_workhr, list(EC_WORKHR_ECAC1=whrlmx), type="response") #?predict
lines(whrlmx, whrlmy, lwd=2, col="blue")
whr2lmy<-predict(lm_workhr2, list(EC_WORKHR=whrlmx), type="response") #?predict
lines(whrlmx, whr2lmy, lwd=2, col="red")

which.is.max(-whr2lmy)
## [1] 48
which.is.max(whrlmy)
## [1] 44
rm(list=c("lm_workhr", "lm_workhr2", "whr2lmy", "whrlmx", "whrlmy"))

Time of first arrival to the UK

Because of the amount of available values, this shall be omitted in the first analyses. Maybe create a dummy for people new to the UK?

misobs(DG_UKSINCE)
## 89% of observations missing
boxplot.stats(DG_UKSINCE)$stats
## [1] 1926 1974 1998 2006 2015
#plot(DG_UKSINCE, WB_SATIS5)

Ordinal categories

In terms of ordinal categorical variables, following of the available ones shall be used

  • DG_EDU
  • EC_ECAC
  • EC_UNDUR
  • EC_JOBTYPE
  • AC_TENURE
  • AC_LASTMOV
  • HE_QHEALTH

Education

Shall we group GCSE and above or just say that education has no significant effect? I mean, of course life of somebody with no education at age over 16 is worthless. And I know it is not fair to say, but it is true! There is a lot of missing observations, which I am treating with what might create a lot of false negatives in the dummy. Let’s just assume all these people have above GSCE education.

misobs(DG_EDU)
## 24% of observations missing
sharebox(WB_SATIS, DG_EDU,
         cts=c("Postgraduate", "Graduate", "A-level", "GCSE", "Other", "None", "Not known"), 
         main="Breakdown by Education", 
         color="lightsalmon")

fullCases$DG_LOWEDU<-ifelse(is.na(DG_EDU), 0, ifelse(DG_EDU>=5, 1, 0))
detach(fullCases); attach(fullCases)
form<-rbind(form, c("DG_LOWEDU", "DUM"))
drop <- cbind("DG_EDU", drop)

Economic Activity

There does not seem to be distributional difference between employed and inactive individuals. Furthermore, unemployed people may be distinguished by the unemployment duration variable. Regardless, I am creating two dummies for each, expecting them to be removed while clearing model from insignificant variables.

misobs(EC_ECAC)
## 0% of observations missing
sharebox(WB_SATIS, EC_ECAC,
         cts=c("Employed", "Unemployed", "Inactive"), 
         main="Breakdown by Economic Activity", 
         color="aquamarine")

fullCases$EC_UNEMP<-ifelse(EC_ECAC == 2, 1, 0)
fullCases$EC_INAC<-ifelse(EC_ECAC == 3, 1, 0)
detach(fullCases); attach(fullCases)

form<-rbind(form, c("EC_UNEMP", "DUM"))
form<-rbind(form, c("EC_INAC", "DUM"))

drop <- cbind("EC_ECAC", drop)

Unemployment Duration

Availability of this variable is decent, given that it is available for most unemployed individuals. Surprisingly, a sample of individuals in employment presents with a value as well.
Still, some influence of unemployment duration is observed on subjective level of wellbeing. Group for 3y+ and 3m-3y seems feasible People unemployed for less than 3 months shall be kept as a benchmark.

misobs(EC_UNDUR)
## 96% of observations missing
kable(100*round(prop.table(table(ifelse(is.na(EC_UNDUR), "Missing [%]", "Available [%]"), EC_ECAC)), 2))
1 2 3
Available [%] 0 4 0
Missing [%] 53 0 43
sharebox(WB_SATIS, EC_UNDUR,
         cts=c("3m-", "3-6m", "6-12m", "12-18m", "18-24m", "2-3y", "3-4y", "4-5y", "5y+"), 
         main="Breakdown by Unemployment Duration", 
         color="yellow")

fullCases$EC_UNEMP3Y<-ifelse(is.na(EC_UNDUR), 0, ifelse(EC_UNDUR >= 7, 1, 0))
fullCases$EC_UNEMP3M<-ifelse(is.na(EC_UNDUR), 0, ifelse(EC_UNDUR %in% 2:6, 1, 0))
detach(fullCases); attach(fullCases)
form<-rbind(form, c("EC_UNEMP3Y", "DUM"))
form<-rbind(form, c("EC_UNEMP3M", "DUM"))

drop <- cbind("EC_UNDUR", drop)

Job class

Here, we can observe that people with a routine job, or those who never worked exhibit a different distribution than those in the higher groups. These shall therefore receive a dummy, with the higher grades as a benchmark.

misobs(EC_JOBTYPE)
## 0% of observations missing
sharebox(WB_SATIS, EC_JOBTYPE, main="Breakdown by Smoking Status", 
         cts=c("High Mgmt/Prof", "Lower Mgmt", "Intermediate", "Small Emp's", "Tech & S'visory",
               "Semi-routine", "Routine", "Never Worked"), 
         color="navajowhite")

fullCases$EC_ROUTJ <- ifelse(EC_JOBTYPE >= 6, 1, 0)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("EC_ROUTJ", "DUM"))
drop <- cbind("EC_JOBTYPE", drop)

Accommodation Tenure

Owning one’s house does remove some degree of dissatisfaction, just as not having to pay rent does. Let’s keep these as benchmark. Renters shall receive their own dummy. The 6 squatters shall be left in the benchmark category

misobs(AC_TENURE)
## 0% of observations missing
sharebox(WB_SATIS, AC_TENURE,
         cts=c("Owned", "Mortgage", "Part Rent", "Rented", "Rent Free", "Squat"), 
         main="Breakdown by Accommodation Tenure", 
         color="seagreen")

fullCases$AC_RENT<-ifelse(AC_TENURE %in% 3:4, 1, 0)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("AC_RENT", "DUM"))
drop <- cbind("AC_TENURE", drop)

Time lived at the location

Based on the boxplot, use of this variable shouldn’t be necessary. It however also shows an issue with sample selection, as people generally move relatively often.

misobs(AC_LSTMOV)
## 0% of observations missing
sharebox(WB_SATIS, AC_LSTMOV,
         cts=c("1-", "1-2", "2-3", "3-5", "5-10", "10+"),
         main="Breakdown by years at location",
         color= "tomato")

drop <- cbind("AC_LSTMOV", drop) 

Quality of health

Good and Very good shall be kept as a benchmark. Three dummies shall be created.

misobs(HE_QHEALTH)
## 3% of observations missing
heatmap.2(xtabs(~WB_SATIS5 + HE_QHEALTH), #first is Y, second is X (rows, columns)
  cellnote = round(xtabs(~WB_SATIS5 + HE_QHEALTH)/1000, 0),
  xlab="Subjective Health level", #FIRST
  ylab="Satisfaction level", #SECOND
  notecol="black",
  density.info="none",
  trace="none",
  col = rev(heat.colors(20)),
  dendrogram="none",
  Colv="NA",
  Rowv="NA",
  lwid=c(0.1,4.5), 
  lhei=c(0.1,4.5),
  key=FALSE,
  symm=TRUE)

sharebox(WB_SATIS, HE_QHEALTH,
   cts=c("Very good", "Good", "Fair", "Bad", "Very bad"), 
   main="Breakdown by Health", 
   color="orangered")

fullCases$HE_Q3<-ifelse(HE_QHEALTH==3, 1, 0)
fullCases$HE_Q4<-ifelse(HE_QHEALTH==4, 1, 0)
fullCases$HE_Q5<-ifelse(HE_QHEALTH==5, 1, 0)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("HE_Q3", "DUM"))
form<-rbind(form, c("HE_Q4", "DUM"))
form<-rbind(form, c("HE_Q5", "DUM"))
drop <- cbind("HE_QHEALTH", drop)

Nominal Categories

After scales and ordinal categories have been dealt with, dummies shall be created for nominal categories to capture additional effects of relevant factors. The sharebox function will assist selection of appropriate attributes to consider. Some of the insignificant distinguishing variables will be dropped without recoding, making my life easier.

  • DG_GENDER
  • DG_MARSTA
  • DG_RELIG
  • DG_REGION
  • DG_ORIGIN
  • DG_ETH
  • HE_SMOK

Gender

The gender variable presents with a minimum of 1 and maximum of 2, for which reason it is recoded into a dummy with 1 for males After doing so, the share of males is determined. The boxplot has no point here, interesting deviations from the overall distribution of predicted variable.

misobs(DG_GENDER)
## 0% of observations missing
  summary(DG_GENDER) #Need to recode from 1 or 2 to dummy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.561   2.000   2.000
  fullCases$DG_MALE <- ifelse(DG_GENDER == 2, 0, 1)
  detach(fullCases); attach(fullCases)
  cat(c("Share of males:\t", round(mean(DG_MALE), 2), "%"))
## Share of males:   0.44 %
  form<-rbind(form, c("DG_MALE", "DUM"))
  drop <- cbind("DG_GENDER", drop)

Marital Status

For marital status, two dummies are created to represent partnered individuals and individuals, who are either separated, divorced, or widowed.

misobs(DG_MARSTA)
## 0% of observations missing
sharebox(WB_SATIS, DG_MARSTA,
         cts=c("Single", "Married", "Separated", "Divorced", "Widowed", "CP", "Sep CP", "Div CP", "Wid CP"), 
         main="Breakdown by Marital Status", 
         color=c("blue", "green", "red", "red", "red", "green", "red", "red", "red"))

  fullCases$DG_MAR_PART <- ifelse(DG_MARSTA %in% c(2, 6), 1, 0)
  fullCases$DG_MAR_EXPART <- ifelse(DG_MARSTA %in% c(3:5, 7:9), 1, 0)
  #Single left as benchmark
  detach(fullCases); attach(fullCases)
  form<-rbind(form, c("DG_MAR_PART", "DUM"))
  form<-rbind(form, c("DG_MAR_EXPART", "DUM"))
  drop <- cbind("DG_MARSTA", drop)
  par(oma=c(0,0,0,0), mar=c(0,0,0.7,0), cex=0.8)
  pie(c("Expartnered"=mean(DG_MAR_EXPART), 
        "Partnered"=mean(DG_MAR_PART), 
        "Single/other"=1-mean(DG_MAR_PART)-mean(DG_MAR_EXPART)),
      col=rainbow(3),
      main="Marital Status Breakdown")

####Religion In general, Atheists and buddhists tend to be less prone to reaching life satisfaction level of 10, yet present with lower variance in general. Similarity of the two camps could be attributed to the fact that buddhism is a life philospohy rather than religion, as it does not follow concepts of faith or worship.
In addition, Muslims seem to vary the most.

misobs(DG_RELIG)
## 2% of observations missing
sharebox(WB_SATIS, DG_RELIG,
         cts=c("Atheist", "Christian", "Budhist", "Hindu", "Jewish", "Muslim", "Sikh", "Other"), 
         main="Breakdown by Religion", 
         color="rosybrown")

fullCases$DG_RELIGIOUS <- ifelse(DG_RELIG %in% c(1, 3), 0, 1)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("DG_RELIG", "DUM"))
drop <- cbind("DG_RELIG", drop)
cat(c("Share of religious: ", 100*round(mean(DG_RELIGIOUS, na.rm = TRUE), 2), "%"))
## Share of religious:  72 %

Region

London… As always! Created a dummy.

misobs(DG_REGION)
## 0% of observations missing
sharebox(WB_SATIS, DG_REGION,
         cts=c("North East", "North West", "Yorkshire+", "East Midlands", "West Midlands", "East of England", "London", 
                "South East", "South West", "Northern Ireland", "Scotland", "Wales"), 
         main="Breakdown by Region", 
         color="paleturquoise")

fullCases$DG_LONDON <- ifelse(DG_REGION == "E12000007", 1, 0)
form<-rbind(form, c("DG_LONDON", "DUM"))
detach(fullCases); attach(fullCases)
drop <- cbind("DG_REGION", drop)

Origin

The origin variable has 12 values, less than 0.5% missing. People from Pakistan apparently tend to be more disappointed. 1684 individuals are not really worth the dummy, are they? So that’s this off the table

misobs(DG_ORIGIN)
## 0% of observations missing
sharebox(WB_SATIS, DG_ORIGIN,
         cts=c("China", "Hong Kong", "India", "ROI", "Pakistan", "Poland", "England", "NI", "Scotland", "Wales",
               "Other UK", "Other"), 
         main="Breakdown by Origin", 
         color="firebrick")

detach(fullCases); attach(fullCases)
drop <- cbind("DG_ORIGIN", drop)

Ethnicity

No missing, that’s good. 92% are white, which kinda sucks, though.
None of the groups are weird enough to be worth a dummy. But because nobody seems to be significantly happier than the whites, non-white people will receive a dummy.

misobs(DG_ETH)
## 0% of observations missing
sharebox(WB_SATIS, DG_ETH,
         cts=c("White", "Mixed", "Indian", "Pakistani", "Bangladeshi", "Chinese", "Other Asian", "Black", "Other"),
         main = "Breakdown by Ethnicity",
         color="plum")

fullCases$DG_NONW<-ifelse(DG_ETH==1,0,1)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("DG_NONW", "DUM"))
drop <- cbind("DG_ETH", drop) 

Smoker Status

We are only interested in people who currently smoke, as people who claim to have smoked are either in denial, or as happy as non-smoker

misobs(DG_ORIGIN)
## 0% of observations missing
sharebox(WB_SATIS, HE_SMOKpd, 
         main="Breakdown by Smoking Status", 
         cts=c("Smoker", "Ex-smoker", "Non-smoker"), 
         color="navajowhite")

fullCases$HE_SMOKER <- ifelse(HE_SMOKpd ==1, 1, 0)
detach(fullCases); attach(fullCases)
form<-rbind(form, c("HE_SMOKER", "DUM"))
drop <- cbind("HE_SMOK", "HE_SMOKpd", "HE_SMOKever", drop)

The Drop

Aaaand now it is the right time to drop all of the original variables which have been recoded and look at what is left to work with and shall be incorporated into the model!

drop<-cbind(drop, "EC_PERMJOB", "EC_FTPTW", "EC_HOMEJOB", "EC_JOBSCTR", "EC_BNFTS", "AC_TYPE", "BS_PHONE", "BS_PROXY")
fullCases <- fullCases[,!names(fullCases) %in% drop]
rm(list="drop")
colnames(fullCases)
##  [1] "DG_AGE"          "DG_UKSINCE"      "EC_PPWK"        
##  [4] "EC_WORKHR"       "EC_BENFTS"       "WB_SATIS"       
##  [7] "WB_WORTH"        "WB_ANXIOUS"      "WB_HAPPY"       
## [10] "WB_SATIS5"       "WB_WORTH5"       "EC_PPWK_ECAC1"  
## [13] "EC_WORKHR_ECAC1" "DG_LOWEDU"       "EC_UNEMP"       
## [16] "EC_INAC"         "EC_UNEMP3Y"      "EC_UNEMP3M"     
## [19] "EC_ROUTJ"        "AC_RENT"         "HE_Q3"          
## [22] "HE_Q4"           "HE_Q5"           "DG_MALE"        
## [25] "DG_MAR_PART"     "DG_MAR_EXPART"   "DG_RELIGIOUS"   
## [28] "DG_LONDON"       "DG_NONW"         "HE_SMOKER"

LOGIT model

Now that all variables to be included are finally handled and ready, the first multinomial logit model can be estimated. All those to be used were accumulated in a variable.

cat(paste0(form[,1], ";"))
## WB_SATIS5; DG_AGE; EC_PPWK_ECAC1; EC_WORKHR_ECAC1; DG_LOWEDU; EC_UNEMP; EC_INAC; EC_UNEMP3Y; EC_UNEMP3M; EC_ROUTJ; AC_RENT; HE_Q3; HE_Q4; HE_Q5; DG_MALE; DG_MAR_PART; DG_MAR_EXPART; DG_RELIG; DG_LONDON; DG_NONW; HE_SMOKER;

Now, the predicted variable needs to be converted into factor and reference happiness level chosen.

fullCases$WB_SATIS5 <- as.factor(WB_SATIS5)
levels(fullCases$WB_SATIS5) <- c("0-4", "5-6", "7-8", "9-10")
fullCases$WB_SATIS5 = relevel(fullCases$WB_SATIS5, ref = "7-8")
detach(fullCases); attach(fullCases)

And the estimation of the model executed.

logit1<-multinom(WB_SATIS5~
                I(DG_AGE-16)+
                I((DG_AGE-16)^2)+
                I(DG_MALE*(DG_AGE-16))+
                I(DG_MALE*(DG_AGE-16)^2)+
                log(EC_PPWK_ECAC1)+
                EC_WORKHR_ECAC1+
                I(EC_WORKHR_ECAC1^2)+
                DG_LOWEDU+
                EC_UNEMP+
                EC_INAC+
                EC_UNEMP3M+
                EC_UNEMP3Y+
                EC_ROUTJ+
                AC_RENT+
                HE_Q3+
                HE_Q4+
                HE_Q5+
                DG_MALE+
                DG_MAR_PART+
                DG_MAR_EXPART+
                DG_RELIGIOUS+
                DG_LONDON+
                DG_NONW+
                HE_SMOKER,
                na.action = na.omit,
                maxit=1000)
logit1s<-summary(logit1)

The most appropriate way to display the model output is using package stargazer as done below. In addition to giving the output easily interpretable format, it calculates significance for each of the values, including standard error, t-statistic and p-value.

stargazer(logit1,
          title="Logit model: Subjective life satisfaction level",
          type="text",
          style="qje",
          keep.stat="chi2",
          order="Constant")
## 
## Logit model: Subjective life satisfaction level
## ====================================================================
##                                  0-4           5-6          9-10    
##                                  (1)           (2)          (3)     
## --------------------------------------------------------------------
## Constant                      -1.745***     -0.793***    -0.135***  
##                                (0.002)       (0.003)      (0.004)   
##                                                                     
## I(DG_AGE - 16)                0.068***      0.030***     -0.024***  
##                                (0.002)       (0.001)      (0.001)   
##                                                                     
## I((DG_AGE - 16)2)             -0.001***    -0.0004***    0.0004***  
##                               (0.00003)     (0.00002)    (0.00002)  
##                                                                     
## I(DG_MALE * (DG_AGE - 16))    -0.010***       0.002       -0.002**  
##                                (0.002)       (0.001)      (0.001)   
##                                                                     
## I(DG_MALE * (DG_AGE - 16)2)    0.0001*     -0.00005**     0.00003   
##                               (0.00004)     (0.00002)    (0.00002)  
##                                                                     
## log(EC_PPWK_ECAC1)            -0.556***     -0.332***    -0.064***  
##                                (0.002)       (0.010)      (0.008)   
##                                                                     
## EC_WORKHR_ECAC1               0.029***      0.028***     -0.014***  
##                                (0.002)       (0.003)      (0.002)   
##                                                                     
## I(EC_WORKHR_ECAC12)          -0.0002***    -0.0002***    0.0002***  
##                               (0.00004)     (0.00004)    (0.00003)  
##                                                                     
## DG_LOWEDU                     0.038***      0.102***      0.209***  
##                                (0.003)       (0.007)      (0.010)   
##                                                                     
## EC_UNEMP                      -1.165***     -0.721***    -0.800***  
##                                (0.003)       (0.007)      (0.003)   
##                                                                     
## EC_INAC                       -2.010***     -1.154***    -0.237***  
##                                (0.003)       (0.008)      (0.006)   
##                                                                     
## EC_UNEMP3M                    0.192***      0.251***      0.095***  
##                                (0.002)       (0.005)      (0.002)   
##                                                                     
## EC_UNEMP3Y                    0.305***      0.295***      0.291***  
##                               (0.0004)       (0.001)      (0.0004)  
##                                                                     
## EC_ROUTJ                      0.125***      0.170***      0.037***  
##                                (0.005)       (0.008)      (0.009)   
##                                                                     
## AC_RENT                       0.231***      0.230***      -0.016*   
##                                (0.008)       (0.007)      (0.009)   
##                                                                     
## HE_Q3                         1.065***      0.719***     -0.528***  
##                                (0.005)       (0.010)      (0.012)   
##                                                                     
## HE_Q4                         2.363***      1.317***     -0.689***  
##                                (0.004)       (0.005)      (0.001)   
##                                                                     
## HE_Q5                         3.300***      1.519***     -0.437***  
##                                (0.001)       (0.001)      (0.0003)  
##                                                                     
## DG_MALE                       0.326***      -0.076***    -0.114***  
##                                (0.001)       (0.001)      (0.001)   
##                                                                     
## DG_MAR_PART                   -0.584***     -0.366***     0.408***  
##                                (0.005)       (0.007)      (0.007)   
##                                                                     
## DG_MAR_EXPART                 0.200***      0.108***       0.001    
##                                (0.004)       (0.007)      (0.005)   
##                                                                     
## DG_RELIGIOUS                  -0.151***     -0.033***     0.130***  
##                                (0.004)       (0.011)      (0.010)   
##                                                                     
## DG_LONDON                     0.071***      0.062***     -0.162***  
##                                (0.001)       (0.002)      (0.002)   
##                                                                     
## DG_NONW                       0.357***      0.404***      0.057***  
##                                (0.001)       (0.003)      (0.002)   
##                                                                     
## HE_SMOKER                     0.529***      0.263***       -0.006   
##                                (0.006)       (0.008)      (0.010)   
##                                                                     
## ====================================================================
## Notes:                        ***Significant at the 1 percent level.
##                                **Significant at the 5 percent level.
##                                *Significant at the 10 percent level.

Using more direct approach, the code below also works for determining the p-value.

exp(logit1s$coefficients)
round(2*(1-pnorm(abs(logit1s$coefficients/logit1s$standard.errors),0,1)), 3)

the only issues with significance below 95% confidence level are following

  • Expartnered people have same likelihood of falling into 9-10 as those who are single, yet higher likelihood to be lower on the scale
  • People who pay rent are equally likely to fall into 9-1O as those who do not
  • Smokers are equally likely to fall into 9-10 as non smokers, but tend to fall lower on the scale

Benchmark probabilities

Based on the varianles used, following benchmark individual is assumed.

  • 16 years old
  • Single
  • Female
  • White
  • Atheist or Budhist
  • Lives out of London
  • 0 income
  • 0 hours worked weekly
  • At least GCSE
  • In employment
  • At least intermediate profession
  • Does not pay rent
  • Good health
  • Non-smoker

Solving benchmark probabilities for individual bins follows

p_8<-1/(1+sum(exp(logit1s$coefficients[,1])))
p_0<-p_8*exp(logit1s$coefficients[,1])
p<-c(p_0[1:2], "7-8"=p_8, p_0[3])
kable(t(t(round(p*100, 2))), col.names = "% likelihood")
% likelihood
0-4 6.99
5-6 18.09
7-8 39.98
9-10 34.95
rm(list=c("p_8", "p_0", "p"))

Assessment

Predictive ability

Now to assess how well the model classifies observations based on exogenous variables into specific bins of wellbeing level, a random sample test shall be conducted. Firstly, the relevant columns are taken from the source table, incomplete cases are removed, and 50,000 rows are selected at random. Then, the new column with estimated bin of WB_SATIS5 is added by selecting the category with highest probability based on the outcome of the LOGIT model for each of the sample observation. Finally, the differences between actual and predicted values are contrasted in the heatmap below. –Note: There was an issue with accurately predicting the level of satisfaction at 7 This issue was chosen to be eliminated by merging 7 and 8 into one level, forming more appropriate base for analysis.–

varsused<-c("WB_SATIS5", "DG_AGE", "EC_PPWK_ECAC1", "EC_WORKHR_ECAC1", 
            "DG_LOWEDU", "EC_UNEMP", "EC_INAC", "EC_UNEMP3M",
            "EC_UNEMP3Y", "EC_ROUTJ", "AC_RENT", "HE_Q3",
            "HE_Q4", "HE_Q5", "DG_MALE", "DG_MAR_PART",
            "DG_MAR_EXPART", "DG_RELIGIOUS", "DG_LONDON",
            "DG_NONW", "HE_SMOKER")

stest <- fullCases[,names(fullCases) %in% varsused]
stest <- stest[complete.cases(stest),]
stest <- stest[sample(nrow(stest), 50000),]

stest$WB_PRED<-predict(logit1, type="class", newdata=stest[,!names(stest)=="WB_SATIS5"])
stest$WB_SATIS5<-factor(stest$WB_SATIS5, c("0-4", "5-6", "7-8", "9-10"))
stest$WB_PRED<-factor(stest$WB_PRED, c("0-4", "5-6", "7-8", "9-10"))

heatmap.2(xtabs(~stest$WB_SATIS5 + stest$WB_PRED), #first is Y, second is X (rows, columns)
  cellnote = xtabs(~stest$WB_SATIS5 + stest$WB_PRED),
  ylab="Actual", #FIRST
  xlab="Predicted", #SECOND
  notecol="black",
  density.info="none",
  trace="none",
  col = rev(heat.colors(20)),
  dendrogram="none",
  Colv="NA",
  Rowv="NA",
  lwid=c(0.1,4.5), 
  lhei=c(0.1,4.5),
  key=FALSE,
  symm=TRUE)

cat(paste0("Misclassification error: ",
           round(100*mean(as.character(stest$WB_SATIS5) != as.character(stest$WB_PRED)), 2),
           "%"))#Misclassification error
## Misclassification error: 48.72%
par(mar=c(2,2,0.5,0), #margins: L&B 2 for axes, T 0.5 for title
    cex.axis=0.7, #reduce size of axis numbers 
    cex.main=0.8, #reduce title size
    mfrow=c(1,2))
MattHist(as.numeric(stest$WB_SATIS5), "turquoise", "Life Worthwhile: Sample", c(0.5,1.5,2.5,3.5,4.5))
MattHist(as.numeric(stest$WB_PRED), "turquoise", "Life Worthwhile: Predicted", c(0.5,1.5,2.5,3.5,4.5))

Visualization

Effect of age

Visualization of benchmark case with variable age, comparing male vs female probabilities of falling into individual categories. There is an indice suggesting higher likelihood of midlife crisis to occur in males.

Females

val <- data.frame(DG_AGE=0:99,
                EC_PPWK_ECAC1=1,
                EC_WORKHR_ECAC1=1,
                DG_LOWEDU=0,
                EC_UNEMP=0,
                EC_INAC=0,
                EC_UNEMP3M=0,
                EC_UNEMP3Y=0,
                EC_ROUTJ=0,
                AC_RENT=0,
                HE_Q3=0,
                HE_Q4=0,
                HE_Q5=0,
                DG_MALE=0,
                DG_MAR_PART=0,
                DG_MAR_EXPART=0,
                DG_RELIGIOUS=0,
                DG_LONDON=0,
                DG_NONW=0,
                HE_SMOKER=0)
pred1p<-predict(logit1, type="probs", newdata=val)
pred1c<-predict(logit1, type="class", newdata=val)
predpdat<-cbind(val, pred1p)

par(mar=c(4,4,0.5,2), #margins: L&B 2 for axes, T 0.5 for title
    oma=c(0,0,0,0),
    cex.axis=0.7, #reduce size of axis numbers 
    cex.main=0.8, #reduce title size
    mfrow=c(1,2))

plot(predpdat$DG_AGE, rep(0:1, 50),
     pch=16,
     xlab="Age",
     ylab="Absolute Probability",
     col="white",
     xaxs="i", 
     yaxs="i")

lines(predpdat$DG_AGE, predpdat$"9-10", lwd=2, col="grey80", lty=1)
lines(predpdat$DG_AGE, predpdat$"7-8", lwd=2, col="grey60", lty=2)
lines(predpdat$DG_AGE, predpdat$"5-6", lwd=2, col="grey40", lty=3)
lines(predpdat$DG_AGE, predpdat$"0-4", lwd=2, col="grey20", lty=4)

legend("top", c("9-10", "7-8", "5-6", "0-4"), 
       col = paste0("grey",seq(80,20,-20)),
       lty = 1:4, lwd=3,
       ncol=2, cex=0.8, title="Wellbeing Category")

plot(predpdat$DG_AGE, rep(0:1, 50),
     pch=16,
     xlab="Age",
     ylab="Aggregate Probability",
     col="white",
     xaxs="i", 
     yaxs="i")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8"+predpdat$"9-10",0), col="grey80")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8",0), col="grey60")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4",0), col="grey40")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"0-4",0), col="grey20")

Males

val <- data.frame(DG_AGE=0:99,
                EC_PPWK_ECAC1=1,
                EC_WORKHR_ECAC1=1,
                DG_LOWEDU=0,
                EC_UNEMP=0,
                EC_INAC=0,
                EC_UNEMP3M=0,
                EC_UNEMP3Y=0,
                EC_ROUTJ=0,
                AC_RENT=0,
                HE_Q3=0,
                HE_Q4=0,
                HE_Q5=0,
                DG_MALE=1,
                DG_MAR_PART=0,
                DG_MAR_EXPART=0,
                DG_RELIGIOUS=0,
                DG_LONDON=0,
                DG_NONW=0,
                HE_SMOKER=0)
pred1p<-predict(logit1, type="probs", newdata=val)
pred1c<-predict(logit1, type="class", newdata=val)
predpdat<-cbind(val, pred1p)

par(mar=c(4,4,0.5,2), #margins: L&B 2 for axes, T 0.5 for title
    oma=c(0,0,0,0),
    cex.axis=0.7, #reduce size of axis numbers 
    cex.main=0.8, #reduce title size
    mfrow=c(1,2))

plot(predpdat$DG_AGE, rep(0:1, 50),
     pch=16,
     xlab="Age",
     ylab="Absolute Probability",
     col="white",
     xaxs="i", 
     yaxs="i")

lines(predpdat$DG_AGE, predpdat$"9-10", lwd=2, col="grey80", lty=1)
lines(predpdat$DG_AGE, predpdat$"7-8", lwd=2, col="grey60", lty=2)
lines(predpdat$DG_AGE, predpdat$"5-6", lwd=2, col="grey40", lty=3)
lines(predpdat$DG_AGE, predpdat$"0-4", lwd=2, col="grey20", lty=4)

legend("top", c("9-10", "7-8", "5-6", "0-4"), 
       col = paste0("grey",seq(80,20,-20)),
       lty = 1:4, lwd=3,
       ncol=2, cex=0.8, title="Wellbeing Category")

plot(predpdat$DG_AGE, rep(0:1, 50),
     pch=16,
     xlab="Age",
     ylab="Aggregate Probability",
     col="white",
     xaxs="i", 
     yaxs="i")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8"+predpdat$"9-10",0), col="grey80")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8",0), col="grey60")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"5-6"+predpdat$"0-4",0), col="grey40")
polygon(c(0,predpdat$DG_AGE,99), c(0,predpdat$"0-4",0), col="grey20")

Effect of hours worked

Visualization of benchmark case with variable hours worked

val <- data.frame(DG_AGE=16,
                EC_PPWK_ECAC1=1,
                EC_WORKHR_ECAC1=1:100,
                DG_LOWEDU=0,
                EC_UNEMP=0,
                EC_INAC=0,
                EC_UNEMP3M=0,
                EC_UNEMP3Y=0,
                EC_ROUTJ=0,
                AC_RENT=0,
                HE_Q3=0,
                HE_Q4=0,
                HE_Q5=0,
                DG_MALE=0,
                DG_MAR_PART=0,
                DG_MAR_EXPART=0,
                DG_RELIGIOUS=0,
                DG_LONDON=0,
                DG_NONW=0,
                HE_SMOKER=0)
pred1p<-predict(logit1, type="probs", newdata=val)
pred1c<-predict(logit1, type="class", newdata=val)
predpdat<-cbind(val, pred1p)

par(mar=c(4,4,0.5,2), #margins: L&B 2 for axes, T 0.5 for title
    oma=c(0,0,0,0),
    cex.axis=0.7, #reduce size of axis numbers 
    cex.main=0.8, #reduce title size
    mfrow=c(1,2))

plot(predpdat$EC_WORKHR_ECAC1, rep(0:1, 50),
     pch=16,
     xlab="Weekly Hours Worked",
     ylab="Absolute Probability",
     col="white",
     xaxs="i", 
     yaxs="i")

lines(predpdat$EC_WORKHR_ECAC1, predpdat$"9-10", lwd=2, col="grey80", lty=1)
lines(predpdat$EC_WORKHR_ECAC1, predpdat$"7-8", lwd=2, col="grey60", lty=2)
lines(predpdat$EC_WORKHR_ECAC1, predpdat$"5-6", lwd=2, col="grey40", lty=3)
lines(predpdat$EC_WORKHR_ECAC1, predpdat$"0-4", lwd=2, col="grey20", lty=4)

legend("top", c("9-10", "7-8", "5-6", "0-4"), 
       col = paste0("grey",seq(80,20,-20)),
       lty = 1:4, lwd=3,
       ncol=2, cex=0.8, title="Wellbeing Category")

plot(predpdat$EC_WORKHR_ECAC1, rep(0:1, 50),
     pch=16,
     xlab="Weekly Hours Worked",
     ylab="Aggregate Probability",
     col="white",
     xaxs="i", 
     yaxs="i")
polygon(c(1,predpdat$EC_WORKHR_ECAC1,100), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8"+predpdat$"9-10",0), col="grey80")
polygon(c(1,predpdat$EC_WORKHR_ECAC1,100), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8",0), col="grey60")
polygon(c(1,predpdat$EC_WORKHR_ECAC1,100), c(0,predpdat$"5-6"+predpdat$"0-4",0), col="grey40")
polygon(c(1,predpdat$EC_WORKHR_ECAC1,100), c(0,predpdat$"0-4",0), col="grey20")

###Effect of income Visualization of myself with variable pay

  • 22 years old
  • Single
  • Male
  • White
  • Atheist
  • Lives out of London
  • GBP 350 income
  • 20 hours worked weekly
  • Over GCSE
  • In employment
  • Intermediate profession
  • Rent payer
  • Good health
  • Non-smoker
val <- data.frame(DG_AGE=22,
                EC_PPWK_ECAC1=1:500,
                EC_WORKHR_ECAC1=20,
                DG_LOWEDU=0,
                EC_UNEMP=0,
                EC_INAC=0,
                EC_UNEMP3M=0,
                EC_UNEMP3Y=0,
                EC_ROUTJ=0,
                AC_RENT=1,
                HE_Q3=0,
                HE_Q4=0,
                HE_Q5=0,
                DG_MALE=1,
                DG_MAR_PART=0,
                DG_MAR_EXPART=0,
                DG_RELIGIOUS=0,
                DG_LONDON=0,
                DG_NONW=0,
                HE_SMOKER=0)
pred1p<-predict(logit1, type="probs", newdata=val)
pred1c<-predict(logit1, type="class", newdata=val)
predpdat<-cbind(val, pred1p)

par(mar=c(4,4,0.5,2), #margins: L&B 2 for axes, T 0.5 for title
    oma=c(0,0,0,0),
    cex.axis=0.7, #reduce size of axis numbers 
    cex.main=0.8, #reduce title size
    mfrow=c(1,2))

plot(predpdat$EC_PPWK_ECAC1, rep(0:1, 250),
     pch=16,
     xlab="Weekly Pay",
     ylab="Absolute Probability",
     col="white",
     xaxs="i", 
     yaxs="i")

lines(predpdat$EC_PPWK_ECAC1, predpdat$"9-10", lwd=2, col="grey80", lty=1)
lines(predpdat$EC_PPWK_ECAC1, predpdat$"7-8", lwd=2, col="grey60", lty=2)
lines(predpdat$EC_PPWK_ECAC1, predpdat$"5-6", lwd=2, col="grey40", lty=3)
lines(predpdat$EC_PPWK_ECAC1, predpdat$"0-4", lwd=2, col="grey20", lty=4)

legend("top", c("9-10", "7-8", "5-6", "0-4"), 
       col = paste0("grey",seq(80,20,-20)),
       lty = 1:4, lwd=3,
       ncol=2, cex=0.8, title="Wellbeing Category")

plot(predpdat$EC_PPWK_ECAC1, rep(0:1, 250),
     pch=16,
     xlab="Weekly Pay",
     ylab="Aggregate Probability",
     col="white",
     xaxs="i", 
     yaxs="i")
polygon(c(1,predpdat$EC_PPWK_ECAC1,500), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8"+predpdat$"9-10",0), col="grey80")
polygon(c(1,predpdat$EC_PPWK_ECAC1,500), c(0,predpdat$"5-6"+predpdat$"0-4"+predpdat$"7-8",0), col="grey60")
polygon(c(1,predpdat$EC_PPWK_ECAC1,500), c(0,predpdat$"5-6"+predpdat$"0-4",0), col="grey40")
polygon(c(1,predpdat$EC_PPWK_ECAC1,500), c(0,predpdat$"0-4",0), col="grey20")

Self assessment

Looking at the previous chart, while also being told that I earn GBP 350 a week for my 20 hours of work, the reader can easily take an appropriate section and draw the pie chart below.

par(mar=c(0,0,1.5,0), #margins: L&B 2 for axes, T 0.5 for title
    cex.main=0.8)
val <- data.frame(DG_AGE=22,
                EC_PPWK_ECAC1=350,
                EC_WORKHR_ECAC1=20,
                DG_LOWEDU=0,
                EC_UNEMP=0,
                EC_INAC=0,
                EC_UNEMP3M=0,
                EC_UNEMP3Y=0,
                EC_ROUTJ=0,
                AC_RENT=1,
                HE_Q3=0,
                HE_Q4=0,
                HE_Q5=0,
                DG_MALE=1,
                DG_MAR_PART=0,
                DG_MAR_EXPART=0,
                DG_RELIGIOUS=0,
                DG_LONDON=0,
                DG_NONW=0,
                HE_SMOKER=0)
me<-predict(logit1, type="probs", newdata=val)
pie(me[order(me,decreasing=TRUE)],
        main="Author: life satisfaction\ncategory probabilities",
        labels=paste0(names(me[order(me,decreasing=TRUE)]), " - " ,round(100*me[order(me,decreasing=TRUE)]), "%"),
        clockwise = TRUE,
        init.angle = 0,
        col=paste0("grey",seq(80,20,-20)))

Conclusion

Will be added later