Our goal is to use this dataset to create a model that predicts happiness

require(ggplot2)
require(plyr)
require(stringr)
require(randomForest)
require(pander)
require(repmis)
happiness <- source_data('https://github.com/TheFedExpress/Data/blob/master/happiness.csv?raw=true')
Summary of Happiness Dataset (continued below)
V1 year workstat prestige divorce
Length:17137 Min. :1994 Length:17137 Min. :17.00 Length:17137
Class :character 1st Qu.:1996 Class :character 1st Qu.:33.00 Class :character
Mode :character Median :1998 Mode :character Median :43.00 Mode :character
NA Mean :1999 NA Mean :43.88 NA
NA 3rd Qu.:2004 NA 3rd Qu.:51.00 NA
NA Max. :2006 NA Max. :86.00 NA
NA NA NA NA’s :854 NA
Table continues below
widowed educ reg16 babies preteen
Length:17137 Min. : 0.00 Length:17137 Min. :0.0000 Min. :0.0000
Class :character 1st Qu.:12.00 Class :character 1st Qu.:0.0000 1st Qu.:0.0000
Mode :character Median :13.00 Mode :character Median :0.0000 Median :0.0000
NA Mean :13.32 NA Mean :0.2071 Mean :0.2557
NA 3rd Qu.:16.00 NA 3rd Qu.:0.0000 3rd Qu.:0.0000
NA Max. :20.00 NA Max. :6.0000 Max. :6.0000
NA NA’s :44 NA NA’s :101 NA’s :101
Table continues below
teens income region attend
Min. :0.0000 Length:17137 Length:17137 Length:17137
1st Qu.:0.0000 Class :character Class :character Class :character
Median :0.0000 Mode :character Mode :character Mode :character
Mean :0.1814 NA NA NA
3rd Qu.:0.0000 NA NA NA
Max. :7.0000 NA NA NA
NA’s :88 NA NA NA
Table continues below
happy owngun tvhours vhappy mothfath16
Length:17137 Length:17137 Min. : 0.000 Min. :0.0000 Min. :0.000
Class :character Class :character 1st Qu.: 1.000 1st Qu.:0.0000 1st Qu.:0.000
Mode :character Mode :character Median : 2.000 Median :0.0000 Median :1.000
NA NA Mean : 2.904 Mean :0.3069 Mean :0.693
NA NA 3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:1.000
NA NA Max. :24.000 Max. :1.0000 Max. :1.000
NA NA NA’s :5343 NA NA’s :5
Table continues below
black gwbush04 female blackfemale gwbush00
Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.00000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000
Median :0.0000 Median :1.000 Median :1.0000 Median :0.00000 Median :1.000
Mean :0.1384 Mean :0.502 Mean :0.5591 Mean :0.08905 Mean :0.522
3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.000
Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.00000 Max. :1.000
NA NA’s :15207 NA NA NA’s :13701
Table continues below
occattend regattend y94 y96 y98
Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.285 Mean :0.1312 Mean :0.1737 Mean :0.1683 Mean :0.1637
3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
NA’s :273 NA’s :273 NA NA NA
y00 y02 y04 y06 unem10
Min. :0.000 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.000
1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000
Median :0.000 Median :0.00000 Median :0.00000 Median :0.0000 Median :0.000
Mean :0.162 Mean :0.07989 Mean :0.07802 Mean :0.1742 Mean :0.318
3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.000
Max. :1.000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.000
NA NA NA NA NA’s :5796

Create some basic histograms with the built in “hist” function to get a sense of the data:

hist(happiness$prestige)  

hist(happiness$tvhours)

We create more involved plots to see how the continuous variables relate to happiness:

Note: While not quite continuous, we can treat them as such.
ggplot(data = happiness) + geom_density(aes(x = tvhours, color = happy)) + facet_wrap(~happy) + labs(title = "TVHOURS by Happiness")

ggplot(data = happiness) + geom_density(aes(x = tvhours, color = vhappy)) + facet_wrap(~vhappy) + labs(title = "TVHOURS by Vhappiness")

ggplot(happiness, aes(x = educ)) + geom_density() + labs(title = "Education Density")

ggplot(happiness, aes(x = educ, color = happy)) + geom_density() + facet_wrap(~happy) + labs(title = "Education by Happiness")

ggplot(happiness, aes(x = educ, color = vhappy)) + geom_density() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

ggplot(happiness, aes(y = tvhours, x= prestige, color = vhappy)) + geom_point() + facet_wrap(~vhappy) + labs(title = "TV Hours Vs Presige Faceted by Vhappiness")

ggplot(happiness, aes(y = educ, x = 1, color = vhappy)) + geom_boxplot() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

We can see that education is a little more right skewed for vhappy people, indicating more education can help happiness. Also, We see that there are slightly fewer vhappy people who watch 10+ hours of TV.

Perhaps subsetting and cleaning the data will help differentiate the distributions even further:

We want to subset the data for only working age people, so our variables offer more predictive power. (A teenager’s happiness isn’t likely to be affected by their income)
We also want to remove rows with NAs that can’t be adjusted for. We also choose the “vhappy” indicator variable over the “happy” ordinal variable, as it simplifies out model. This could wind up proving to be a mistake because the distibutions of our variables for “pretty happy”" and “very happy” look similar.
happiness_temp <- happiness[which(!happiness$workstat %in% c('school','other', 'retired') & !is.na(happiness$workstat)),]
happiness_temp <-happiness_temp[happiness_temp$teens == 0,]
happiness_temp <-happiness_temp[happiness_temp$preteen == 0,]
happiness_temp <- happiness_temp[!is.na(happiness_temp$prestige) & happiness_temp$prestige > 17,]
happiness_subset <-happiness_temp[!is.na(happiness_temp$income),c("workstat","divorce","educ",
                      "income", "regattend", "babies","tvhours","female","prestige","vhappy")]

len <- length(happiness_subset)
rows <-nrow(happiness_subset)

for (i in 1:len){
  if (!is.numeric(happiness_subset[,i])){
    happiness_subset[,i] <- as.character(happiness_subset[,i])
  }
}

A small adjustment for “divorce”:

happiness_subset$divorce[happiness_subset$divorce == "yes"] <- "1"
happiness_subset$divorce <- as.numeric(happiness_subset$divorce)

happiness_subset[,7][is.na(happiness_subset[,7])] <- 0
for (i in 1:len){
  if (is.numeric(happiness_subset[,i])){
    happiness_subset[,i][is.na(happiness_subset[,i])] <- 0
  }else
    happiness_subset[,i][is.na(happiness_subset[,i])] <- "unknown"
}
pander(head(happiness_subset, 10), type = "grid", caption = "Sample of new dataset")
Sample of new dataset (continued below)
  workstat divorce educ income regattend babies
3 working fulltime 0 12 $15000 - 19999 1 0
4 working fulltime 1 8 $15000 - 19999 0 0
6 working parttime 0 15 $25000 or more 0 0
8 working fulltime 0 12 $10000 - 14999 1 0
12 working fulltime 0 19 $25000 or more 0 0
13 working fulltime 0 16 $25000 or more 0 0
14 working fulltime 0 16 $25000 or more 0 0
15 working fulltime 0 18 $25000 or more 0 0
16 working fulltime 0 16 $25000 or more 0 0
18 keeping house 0 16 $25000 or more 0 2
  tvhours female prestige vhappy
3 1 1 29 0
4 3 0 42 0
6 0 1 43 0
8 4 0 44 1
12 2 1 75 0
13 2 0 51 1
14 0 0 50 1
15 1 0 73 0
16 1 1 41 0
18 3 1 59 1

Let’s find out the liklihood of happiness given some values of our variables:

pander(aggregate(vhappy ~ income, happiness_subset,mean), type = "grid")
income vhappy
$1000 to 2999 0.1867
$10000 - 14999 0.2302
$15000 - 19999 0.2021
$20000 - 24999 0.232
$25000 or more 0.333
$3000 to 3999 0.2113
$4000 to 4999 0.1719
$5000 to 5999 0.2439
$6000 to 6999 0.08571
$7000 to 7999 0.1795
$8000 to 9999 0.2126
lt $1000 0.2603
pander(aggregate(vhappy ~ income + female, happiness_subset, mean), type = "grid")
income female vhappy
$1000 to 2999 0 0.1667
$10000 - 14999 0 0.2028
$15000 - 19999 0 0.1792
$20000 - 24999 0 0.225
$25000 or more 0 0.313
$3000 to 3999 0 0.1
$4000 to 4999 0 0.2069
$5000 to 5999 0 0.2222
$6000 to 6999 0 0.1154
$7000 to 7999 0 0.2308
$8000 to 9999 0 0.2679
lt $1000 0 0.2222
$1000 to 2999 1 0.1961
$10000 - 14999 1 0.2466
$15000 - 19999 1 0.2194
$20000 - 24999 1 0.238
$25000 or more 1 0.3538
$3000 to 3999 1 0.2549
$4000 to 4999 1 0.1429
$5000 to 5999 1 0.2545
$6000 to 6999 1 0.06818
$7000 to 7999 1 0.1538
$8000 to 9999 1 0.1864
lt $1000 1 0.2826
pander(aggregate(vhappy ~ educ, happiness_subset,mean), type = "grid")
educ vhappy
0 0.3
1 0
2 0.25
3 0.2308
4 0.4167
5 0.1429
6 0.122
7 0.425
8 0.2462
9 0.1951
10 0.2524
11 0.2317
12 0.2748
13 0.2705
14 0.2871
15 0.2794
16 0.347
17 0.3746
18 0.3325
19 0.329
20 0.4143
pander(aggregate(vhappy ~ tvhours, happiness_subset,mean), type = "grid")
tvhours vhappy
0 0.3112
1 0.3376
2 0.2867
3 0.2871
4 0.2493
5 0.2754
6 0.2074
7 0.2564
8 0.2462
9 0
10 0.3023
11 0.6667
12 0.16
13 0
14 0.125
15 0.4
16 1
18 0
20 0.2
21 0
22 0
pander(aggregate(vhappy ~ female, happiness_subset,mean), type = "grid")
female vhappy
0 0.2846
1 0.3076
pander(aggregate(vhappy ~ babies, happiness_subset,mean), type = "grid")
babies vhappy
0 0.2929
1 0.3036
2 0.3607
3 0.2826
4 0.4
6 0
pander(aggregate(vhappy ~ divorce, happiness_subset,mean), type = "grid")
divorce vhappy
0 0.2845
1 0.383
pander(aggregate(vhappy ~ regattend, happiness_subset,mean), type = "grid")
regattend vhappy
0 0.2824
1 0.4118
pander(aggregate(vhappy ~ workstat, happiness_subset,mean), type = "grid")
workstat vhappy
keeping house 0.3333
temp not working 0.2927
unempl, laid off 0.1443
working fulltime 0.3015
working parttime 0.286
table = aggregate(vhappy ~ workstat, happiness_subset,mean)
ggplot(table, aes(x = workstat, y = vhappy)) + geom_bar(stat="identity") + labs(title = "Vhappy by Workstat")

The probabilities would probably pass a signficance test for independance, but nothing jumps out as an obvious predictor of happiness.

We also want to investiage the distributions of the continuous variables in our subset:

ggplot(happiness_subset, aes(x = tvhours, color = vhappy)) + geom_density() + facet_wrap(~vhappy) + labs(title = "Tvhours by Vhappiness")

ggplot(happiness_subset, aes(x = educ, color = vhappy)) + geom_density() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

ggplot(happiness_subset, aes(x = prestige)) + geom_histogram() + facet_wrap(~vhappy) + labs(title = "Age by Vhappiness")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(happiness_subset, aes(y = educ, x = 1, color = vhappy)) + geom_boxplot() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

It looks like we will not be keeping tvhours, but education looks somewhat promising, though not much better, if any, than the full dataset.

We now set the data up to be input into a predictive model:

Here we want to transform our categorical variables into numeric indicator variables. Income is categorical, but ordinal so we keep its order in tact here.
happiness_subset$workstat <-factor(happiness_subset$workstat)
workstats <- unique(happiness_subset$workstat)
happiness_transform <- happiness_subset
happiness_transform$workstat <- as.character(happiness_transform$workstat)
count = 0
for (stat in workstats){
  happiness_transform$workstat[happiness_transform$workstat == stat] <- count
  count <- count + 1
}
happiness_transform$income <- str_split(string = happiness_transform$income, pattern = "to | - | or", n = 2)
happiness_transform$income <- sapply(happiness_transform$income, FUN = function (x) x[1])
happiness_transform$income <- str_extract(string = happiness_transform$income, "\\d{4,5}")
happiness_transform$income <- as.numeric(happiness_transform$income)
incomes <- unique(happiness_transform$income)
income_sorted <- sort(incomes)

count = 0
for (stat in income_sorted){
  happiness_transform$income[happiness_transform$income == stat] <- count
  count <- count + 1
}
happiness_transform <- as.matrix(as.data.frame(happiness_transform))
happiness_transform <- apply(happiness_transform, 2, as.numeric)

We build and implement our model:

We use random forest because it is powerful right out of the box. We choose variables based on the above summary statistics and plots.
train <- happiness_transform[1:6000,c(1,3,4,5,6,9,len)]
veryhappy_train <- as.character(happiness_transform[1:6000,len])
test <- happiness_transform[6001:rows,c(1,3,4,5,9,len)]
veryhappy_test <- as.character(happiness_transform[6001:rows,len])

fit <- randomForest(as.factor(vhappy) ~ educ + workstat + income + prestige + regattend,
                    data=train, 
                    importance=TRUE, 
                    ntree=2000)
predction <- predict(fit, test)
table(Predicted = predction, Actual = veryhappy_test)
##          Actual
## Predicted    0    1
##         0 1477  600
##         1   32   22
Unfortunatley, our model did not do a very good job in predicting happiness. It put almost all people in the “not very happy” category. I suspect it found some happy people in the corner of space and made the division lines there. Because none of the variables found high numbers of happy people (like 80-90 percent), there were likely unhappy people mixed in, even for values of variables, that one would think would predict a happy person. Perhaps extending the income brackets higher would have helped the dataset. It is also possible we should have grouped “pretty happy” and “very happy” instead of “not too happy” and “pretty happy”. It does seem like we found variables that indicated if a person was more likely to be happy, so it is possible that through tuning our model, we could make better predictions. A model with soft decision boundries could work as well, but such thoughts are beyond the scope of this analysis.