Our goal is to use this dataset to create a model that predicts happiness
require(ggplot2)
require(plyr)
require(stringr)
require(randomForest)
require(pander)
require(repmis)
happiness <- source_data('https://github.com/TheFedExpress/Data/blob/master/happiness.csv?raw=true')
Summary of Happiness Dataset (continued below)
| Length:17137 |
Min. :1994 |
Length:17137 |
Min. :17.00 |
Length:17137 |
| Class :character |
1st Qu.:1996 |
Class :character |
1st Qu.:33.00 |
Class :character |
| Mode :character |
Median :1998 |
Mode :character |
Median :43.00 |
Mode :character |
| NA |
Mean :1999 |
NA |
Mean :43.88 |
NA |
| NA |
3rd Qu.:2004 |
NA |
3rd Qu.:51.00 |
NA |
| NA |
Max. :2006 |
NA |
Max. :86.00 |
NA |
| NA |
NA |
NA |
NA’s :854 |
NA |
Table continues below
| Length:17137 |
Min. : 0.00 |
Length:17137 |
Min. :0.0000 |
Min. :0.0000 |
| Class :character |
1st Qu.:12.00 |
Class :character |
1st Qu.:0.0000 |
1st Qu.:0.0000 |
| Mode :character |
Median :13.00 |
Mode :character |
Median :0.0000 |
Median :0.0000 |
| NA |
Mean :13.32 |
NA |
Mean :0.2071 |
Mean :0.2557 |
| NA |
3rd Qu.:16.00 |
NA |
3rd Qu.:0.0000 |
3rd Qu.:0.0000 |
| NA |
Max. :20.00 |
NA |
Max. :6.0000 |
Max. :6.0000 |
| NA |
NA’s :44 |
NA |
NA’s :101 |
NA’s :101 |
Table continues below
| Min. :0.0000 |
Length:17137 |
Length:17137 |
Length:17137 |
| 1st Qu.:0.0000 |
Class :character |
Class :character |
Class :character |
| Median :0.0000 |
Mode :character |
Mode :character |
Mode :character |
| Mean :0.1814 |
NA |
NA |
NA |
| 3rd Qu.:0.0000 |
NA |
NA |
NA |
| Max. :7.0000 |
NA |
NA |
NA |
| NA’s :88 |
NA |
NA |
NA |
Table continues below
| Length:17137 |
Length:17137 |
Min. : 0.000 |
Min. :0.0000 |
Min. :0.000 |
| Class :character |
Class :character |
1st Qu.: 1.000 |
1st Qu.:0.0000 |
1st Qu.:0.000 |
| Mode :character |
Mode :character |
Median : 2.000 |
Median :0.0000 |
Median :1.000 |
| NA |
NA |
Mean : 2.904 |
Mean :0.3069 |
Mean :0.693 |
| NA |
NA |
3rd Qu.: 4.000 |
3rd Qu.:1.0000 |
3rd Qu.:1.000 |
| NA |
NA |
Max. :24.000 |
Max. :1.0000 |
Max. :1.000 |
| NA |
NA |
NA’s :5343 |
NA |
NA’s :5 |
Table continues below
| Min. :0.0000 |
Min. :0.000 |
Min. :0.0000 |
Min. :0.00000 |
Min. :0.000 |
| 1st Qu.:0.0000 |
1st Qu.:0.000 |
1st Qu.:0.0000 |
1st Qu.:0.00000 |
1st Qu.:0.000 |
| Median :0.0000 |
Median :1.000 |
Median :1.0000 |
Median :0.00000 |
Median :1.000 |
| Mean :0.1384 |
Mean :0.502 |
Mean :0.5591 |
Mean :0.08905 |
Mean :0.522 |
| 3rd Qu.:0.0000 |
3rd Qu.:1.000 |
3rd Qu.:1.0000 |
3rd Qu.:0.00000 |
3rd Qu.:1.000 |
| Max. :1.0000 |
Max. :1.000 |
Max. :1.0000 |
Max. :1.00000 |
Max. :1.000 |
| NA |
NA’s :15207 |
NA |
NA |
NA’s :13701 |
Table continues below
| Min. :0.000 |
Min. :0.0000 |
Min. :0.0000 |
Min. :0.0000 |
Min. :0.0000 |
| 1st Qu.:0.000 |
1st Qu.:0.0000 |
1st Qu.:0.0000 |
1st Qu.:0.0000 |
1st Qu.:0.0000 |
| Median :0.000 |
Median :0.0000 |
Median :0.0000 |
Median :0.0000 |
Median :0.0000 |
| Mean :0.285 |
Mean :0.1312 |
Mean :0.1737 |
Mean :0.1683 |
Mean :0.1637 |
| 3rd Qu.:1.000 |
3rd Qu.:0.0000 |
3rd Qu.:0.0000 |
3rd Qu.:0.0000 |
3rd Qu.:0.0000 |
| Max. :1.000 |
Max. :1.0000 |
Max. :1.0000 |
Max. :1.0000 |
Max. :1.0000 |
| NA’s :273 |
NA’s :273 |
NA |
NA |
NA |
| Min. :0.000 |
Min. :0.00000 |
Min. :0.00000 |
Min. :0.0000 |
Min. :0.000 |
| 1st Qu.:0.000 |
1st Qu.:0.00000 |
1st Qu.:0.00000 |
1st Qu.:0.0000 |
1st Qu.:0.000 |
| Median :0.000 |
Median :0.00000 |
Median :0.00000 |
Median :0.0000 |
Median :0.000 |
| Mean :0.162 |
Mean :0.07989 |
Mean :0.07802 |
Mean :0.1742 |
Mean :0.318 |
| 3rd Qu.:0.000 |
3rd Qu.:0.00000 |
3rd Qu.:0.00000 |
3rd Qu.:0.0000 |
3rd Qu.:1.000 |
| Max. :1.000 |
Max. :1.00000 |
Max. :1.00000 |
Max. :1.0000 |
Max. :1.000 |
| NA |
NA |
NA |
NA |
NA’s :5796 |
Create some basic histograms with the built in “hist” function to get a sense of the data:
hist(happiness$prestige)

hist(happiness$tvhours)

We create more involved plots to see how the continuous variables relate to happiness:
Note: While not quite continuous, we can treat them as such.
ggplot(data = happiness) + geom_density(aes(x = tvhours, color = happy)) + facet_wrap(~happy) + labs(title = "TVHOURS by Happiness")

ggplot(data = happiness) + geom_density(aes(x = tvhours, color = vhappy)) + facet_wrap(~vhappy) + labs(title = "TVHOURS by Vhappiness")

ggplot(happiness, aes(x = educ)) + geom_density() + labs(title = "Education Density")

ggplot(happiness, aes(x = educ, color = happy)) + geom_density() + facet_wrap(~happy) + labs(title = "Education by Happiness")

ggplot(happiness, aes(x = educ, color = vhappy)) + geom_density() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

ggplot(happiness, aes(y = tvhours, x= prestige, color = vhappy)) + geom_point() + facet_wrap(~vhappy) + labs(title = "TV Hours Vs Presige Faceted by Vhappiness")

ggplot(happiness, aes(y = educ, x = 1, color = vhappy)) + geom_boxplot() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

We can see that education is a little more right skewed for vhappy people, indicating more education can help happiness. Also, We see that there are slightly fewer vhappy people who watch 10+ hours of TV.
Perhaps subsetting and cleaning the data will help differentiate the distributions even further:
We want to subset the data for only working age people, so our variables offer more predictive power. (A teenager’s happiness isn’t likely to be affected by their income)
We also want to remove rows with NAs that can’t be adjusted for. We also choose the “vhappy” indicator variable over the “happy” ordinal variable, as it simplifies out model. This could wind up proving to be a mistake because the distibutions of our variables for “pretty happy”" and “very happy” look similar.
happiness_temp <- happiness[which(!happiness$workstat %in% c('school','other', 'retired') & !is.na(happiness$workstat)),]
happiness_temp <-happiness_temp[happiness_temp$teens == 0,]
happiness_temp <-happiness_temp[happiness_temp$preteen == 0,]
happiness_temp <- happiness_temp[!is.na(happiness_temp$prestige) & happiness_temp$prestige > 17,]
happiness_subset <-happiness_temp[!is.na(happiness_temp$income),c("workstat","divorce","educ",
"income", "regattend", "babies","tvhours","female","prestige","vhappy")]
len <- length(happiness_subset)
rows <-nrow(happiness_subset)
for (i in 1:len){
if (!is.numeric(happiness_subset[,i])){
happiness_subset[,i] <- as.character(happiness_subset[,i])
}
}
A small adjustment for “divorce”:
happiness_subset$divorce[happiness_subset$divorce == "yes"] <- "1"
happiness_subset$divorce <- as.numeric(happiness_subset$divorce)
happiness_subset[,7][is.na(happiness_subset[,7])] <- 0
for (i in 1:len){
if (is.numeric(happiness_subset[,i])){
happiness_subset[,i][is.na(happiness_subset[,i])] <- 0
}else
happiness_subset[,i][is.na(happiness_subset[,i])] <- "unknown"
}
pander(head(happiness_subset, 10), type = "grid", caption = "Sample of new dataset")
Sample of new dataset (continued below)
| 3 |
working fulltime |
0 |
12 |
$15000 - 19999 |
1 |
0 |
| 4 |
working fulltime |
1 |
8 |
$15000 - 19999 |
0 |
0 |
| 6 |
working parttime |
0 |
15 |
$25000 or more |
0 |
0 |
| 8 |
working fulltime |
0 |
12 |
$10000 - 14999 |
1 |
0 |
| 12 |
working fulltime |
0 |
19 |
$25000 or more |
0 |
0 |
| 13 |
working fulltime |
0 |
16 |
$25000 or more |
0 |
0 |
| 14 |
working fulltime |
0 |
16 |
$25000 or more |
0 |
0 |
| 15 |
working fulltime |
0 |
18 |
$25000 or more |
0 |
0 |
| 16 |
working fulltime |
0 |
16 |
$25000 or more |
0 |
0 |
| 18 |
keeping house |
0 |
16 |
$25000 or more |
0 |
2 |
| 3 |
1 |
1 |
29 |
0 |
| 4 |
3 |
0 |
42 |
0 |
| 6 |
0 |
1 |
43 |
0 |
| 8 |
4 |
0 |
44 |
1 |
| 12 |
2 |
1 |
75 |
0 |
| 13 |
2 |
0 |
51 |
1 |
| 14 |
0 |
0 |
50 |
1 |
| 15 |
1 |
0 |
73 |
0 |
| 16 |
1 |
1 |
41 |
0 |
| 18 |
3 |
1 |
59 |
1 |
Let’s find out the liklihood of happiness given some values of our variables:
pander(aggregate(vhappy ~ income, happiness_subset,mean), type = "grid")
| $1000 to 2999 |
0.1867 |
| $10000 - 14999 |
0.2302 |
| $15000 - 19999 |
0.2021 |
| $20000 - 24999 |
0.232 |
| $25000 or more |
0.333 |
| $3000 to 3999 |
0.2113 |
| $4000 to 4999 |
0.1719 |
| $5000 to 5999 |
0.2439 |
| $6000 to 6999 |
0.08571 |
| $7000 to 7999 |
0.1795 |
| $8000 to 9999 |
0.2126 |
| lt $1000 |
0.2603 |
pander(aggregate(vhappy ~ income + female, happiness_subset, mean), type = "grid")
| $1000 to 2999 |
0 |
0.1667 |
| $10000 - 14999 |
0 |
0.2028 |
| $15000 - 19999 |
0 |
0.1792 |
| $20000 - 24999 |
0 |
0.225 |
| $25000 or more |
0 |
0.313 |
| $3000 to 3999 |
0 |
0.1 |
| $4000 to 4999 |
0 |
0.2069 |
| $5000 to 5999 |
0 |
0.2222 |
| $6000 to 6999 |
0 |
0.1154 |
| $7000 to 7999 |
0 |
0.2308 |
| $8000 to 9999 |
0 |
0.2679 |
| lt $1000 |
0 |
0.2222 |
| $1000 to 2999 |
1 |
0.1961 |
| $10000 - 14999 |
1 |
0.2466 |
| $15000 - 19999 |
1 |
0.2194 |
| $20000 - 24999 |
1 |
0.238 |
| $25000 or more |
1 |
0.3538 |
| $3000 to 3999 |
1 |
0.2549 |
| $4000 to 4999 |
1 |
0.1429 |
| $5000 to 5999 |
1 |
0.2545 |
| $6000 to 6999 |
1 |
0.06818 |
| $7000 to 7999 |
1 |
0.1538 |
| $8000 to 9999 |
1 |
0.1864 |
| lt $1000 |
1 |
0.2826 |
pander(aggregate(vhappy ~ educ, happiness_subset,mean), type = "grid")
| 0 |
0.3 |
| 1 |
0 |
| 2 |
0.25 |
| 3 |
0.2308 |
| 4 |
0.4167 |
| 5 |
0.1429 |
| 6 |
0.122 |
| 7 |
0.425 |
| 8 |
0.2462 |
| 9 |
0.1951 |
| 10 |
0.2524 |
| 11 |
0.2317 |
| 12 |
0.2748 |
| 13 |
0.2705 |
| 14 |
0.2871 |
| 15 |
0.2794 |
| 16 |
0.347 |
| 17 |
0.3746 |
| 18 |
0.3325 |
| 19 |
0.329 |
| 20 |
0.4143 |
pander(aggregate(vhappy ~ tvhours, happiness_subset,mean), type = "grid")
| 0 |
0.3112 |
| 1 |
0.3376 |
| 2 |
0.2867 |
| 3 |
0.2871 |
| 4 |
0.2493 |
| 5 |
0.2754 |
| 6 |
0.2074 |
| 7 |
0.2564 |
| 8 |
0.2462 |
| 9 |
0 |
| 10 |
0.3023 |
| 11 |
0.6667 |
| 12 |
0.16 |
| 13 |
0 |
| 14 |
0.125 |
| 15 |
0.4 |
| 16 |
1 |
| 18 |
0 |
| 20 |
0.2 |
| 21 |
0 |
| 22 |
0 |
pander(aggregate(vhappy ~ female, happiness_subset,mean), type = "grid")
pander(aggregate(vhappy ~ babies, happiness_subset,mean), type = "grid")
| 0 |
0.2929 |
| 1 |
0.3036 |
| 2 |
0.3607 |
| 3 |
0.2826 |
| 4 |
0.4 |
| 6 |
0 |
pander(aggregate(vhappy ~ divorce, happiness_subset,mean), type = "grid")
pander(aggregate(vhappy ~ regattend, happiness_subset,mean), type = "grid")
pander(aggregate(vhappy ~ workstat, happiness_subset,mean), type = "grid")
| keeping house |
0.3333 |
| temp not working |
0.2927 |
| unempl, laid off |
0.1443 |
| working fulltime |
0.3015 |
| working parttime |
0.286 |
table = aggregate(vhappy ~ workstat, happiness_subset,mean)
ggplot(table, aes(x = workstat, y = vhappy)) + geom_bar(stat="identity") + labs(title = "Vhappy by Workstat")

The probabilities would probably pass a signficance test for independance, but nothing jumps out as an obvious predictor of happiness.
We also want to investiage the distributions of the continuous variables in our subset:
ggplot(happiness_subset, aes(x = tvhours, color = vhappy)) + geom_density() + facet_wrap(~vhappy) + labs(title = "Tvhours by Vhappiness")

ggplot(happiness_subset, aes(x = educ, color = vhappy)) + geom_density() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

ggplot(happiness_subset, aes(x = prestige)) + geom_histogram() + facet_wrap(~vhappy) + labs(title = "Age by Vhappiness")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(happiness_subset, aes(y = educ, x = 1, color = vhappy)) + geom_boxplot() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

It looks like we will not be keeping tvhours, but education looks somewhat promising, though not much better, if any, than the full dataset.
We build and implement our model:
We use random forest because it is powerful right out of the box. We choose variables based on the above summary statistics and plots.
train <- happiness_transform[1:6000,c(1,3,4,5,6,9,len)]
veryhappy_train <- as.character(happiness_transform[1:6000,len])
test <- happiness_transform[6001:rows,c(1,3,4,5,9,len)]
veryhappy_test <- as.character(happiness_transform[6001:rows,len])
fit <- randomForest(as.factor(vhappy) ~ educ + workstat + income + prestige + regattend,
data=train,
importance=TRUE,
ntree=2000)
predction <- predict(fit, test)
table(Predicted = predction, Actual = veryhappy_test)
## Actual
## Predicted 0 1
## 0 1477 600
## 1 32 22
Unfortunatley, our model did not do a very good job in predicting happiness. It put almost all people in the “not very happy” category. I suspect it found some happy people in the corner of space and made the division lines there. Because none of the variables found high numbers of happy people (like 80-90 percent), there were likely unhappy people mixed in, even for values of variables, that one would think would predict a happy person. Perhaps extending the income brackets higher would have helped the dataset. It is also possible we should have grouped “pretty happy” and “very happy” instead of “not too happy” and “pretty happy”. It does seem like we found variables that indicated if a person was more likely to be happy, so it is possible that through tuning our model, we could make better predictions. A model with soft decision boundries could work as well, but such thoughts are beyond the scope of this analysis.