Final Project

Our goal is to use this dataset to create a model that predicts happiness

require(ggplot2)
require(plyr)
require(stringr)
require(randomForest)
require(pander)
require(repmis)
happiness <- source_data('https://github.com/TheFedExpress/Data/blob/master/happiness.csv?raw=true')

Summary of Happiness Dataset (continued below)
V1	year	workstat	prestige	divorce
Length:17137	Min. :1994	Length:17137	Min. :17.00	Length:17137
Class :character	1st Qu.:1996	Class :character	1st Qu.:33.00	Class :character
Mode :character	Median :1998	Mode :character	Median :43.00	Mode :character
NA	Mean :1999	NA	Mean :43.88	NA
NA	3rd Qu.:2004	NA	3rd Qu.:51.00	NA
NA	Max. :2006	NA	Max. :86.00	NA
NA	NA	NA	NA’s :854	NA

Table continues below
widowed	educ	reg16	babies	preteen
Length:17137	Min. : 0.00	Length:17137	Min. :0.0000	Min. :0.0000
Class :character	1st Qu.:12.00	Class :character	1st Qu.:0.0000	1st Qu.:0.0000
Mode :character	Median :13.00	Mode :character	Median :0.0000	Median :0.0000
NA	Mean :13.32	NA	Mean :0.2071	Mean :0.2557
NA	3rd Qu.:16.00	NA	3rd Qu.:0.0000	3rd Qu.:0.0000
NA	Max. :20.00	NA	Max. :6.0000	Max. :6.0000
NA	NA’s :44	NA	NA’s :101	NA’s :101

Table continues below
teens	income	region	attend
Min. :0.0000	Length:17137	Length:17137	Length:17137
1st Qu.:0.0000	Class :character	Class :character	Class :character
Median :0.0000	Mode :character	Mode :character	Mode :character
Mean :0.1814	NA	NA	NA
3rd Qu.:0.0000	NA	NA	NA
Max. :7.0000	NA	NA	NA
NA’s :88	NA	NA	NA

Table continues below
happy	owngun	tvhours	vhappy	mothfath16
Length:17137	Length:17137	Min. : 0.000	Min. :0.0000	Min. :0.000
Class :character	Class :character	1st Qu.: 1.000	1st Qu.:0.0000	1st Qu.:0.000
Mode :character	Mode :character	Median : 2.000	Median :0.0000	Median :1.000
NA	NA	Mean : 2.904	Mean :0.3069	Mean :0.693
NA	NA	3rd Qu.: 4.000	3rd Qu.:1.0000	3rd Qu.:1.000
NA	NA	Max. :24.000	Max. :1.0000	Max. :1.000
NA	NA	NA’s :5343	NA	NA’s :5

Table continues below
black	gwbush04	female	blackfemale	gwbush00
Min. :0.0000	Min. :0.000	Min. :0.0000	Min. :0.00000	Min. :0.000
1st Qu.:0.0000	1st Qu.:0.000	1st Qu.:0.0000	1st Qu.:0.00000	1st Qu.:0.000
Median :0.0000	Median :1.000	Median :1.0000	Median :0.00000	Median :1.000
Mean :0.1384	Mean :0.502	Mean :0.5591	Mean :0.08905	Mean :0.522
3rd Qu.:0.0000	3rd Qu.:1.000	3rd Qu.:1.0000	3rd Qu.:0.00000	3rd Qu.:1.000
Max. :1.0000	Max. :1.000	Max. :1.0000	Max. :1.00000	Max. :1.000
NA	NA’s :15207	NA	NA	NA’s :13701

Table continues below
occattend	regattend	y94	y96	y98
Min. :0.000	Min. :0.0000	Min. :0.0000	Min. :0.0000	Min. :0.0000
1st Qu.:0.000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.:0.0000
Median :0.000	Median :0.0000	Median :0.0000	Median :0.0000	Median :0.0000
Mean :0.285	Mean :0.1312	Mean :0.1737	Mean :0.1683	Mean :0.1637
3rd Qu.:1.000	3rd Qu.:0.0000	3rd Qu.:0.0000	3rd Qu.:0.0000	3rd Qu.:0.0000
Max. :1.000	Max. :1.0000	Max. :1.0000	Max. :1.0000	Max. :1.0000
NA’s :273	NA’s :273	NA	NA	NA

y00	y02	y04	y06	unem10
Min. :0.000	Min. :0.00000	Min. :0.00000	Min. :0.0000	Min. :0.000
1st Qu.:0.000	1st Qu.:0.00000	1st Qu.:0.00000	1st Qu.:0.0000	1st Qu.:0.000
Median :0.000	Median :0.00000	Median :0.00000	Median :0.0000	Median :0.000
Mean :0.162	Mean :0.07989	Mean :0.07802	Mean :0.1742	Mean :0.318
3rd Qu.:0.000	3rd Qu.:0.00000	3rd Qu.:0.00000	3rd Qu.:0.0000	3rd Qu.:1.000
Max. :1.000	Max. :1.00000	Max. :1.00000	Max. :1.0000	Max. :1.000
NA	NA	NA	NA	NA’s :5796

Create some basic histograms with the built in “hist” function to get a sense of the data:

hist(happiness$prestige)

hist(happiness$tvhours)

We create more involved plots to see how the continuous variables relate to happiness:

Note: While not quite continuous, we can treat them as such.

ggplot(data = happiness) + geom_density(aes(x = tvhours, color = happy)) + facet_wrap(~happy) + labs(title = "TVHOURS by Happiness")

ggplot(data = happiness) + geom_density(aes(x = tvhours, color = vhappy)) + facet_wrap(~vhappy) + labs(title = "TVHOURS by Vhappiness")

ggplot(happiness, aes(x = educ)) + geom_density() + labs(title = "Education Density")

ggplot(happiness, aes(x = educ, color = happy)) + geom_density() + facet_wrap(~happy) + labs(title = "Education by Happiness")

ggplot(happiness, aes(x = educ, color = vhappy)) + geom_density() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

ggplot(happiness, aes(y = tvhours, x= prestige, color = vhappy)) + geom_point() + facet_wrap(~vhappy) + labs(title = "TV Hours Vs Presige Faceted by Vhappiness")

ggplot(happiness, aes(y = educ, x = 1, color = vhappy)) + geom_boxplot() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

We can see that education is a little more right skewed for vhappy people, indicating more education can help happiness. Also, We see that there are slightly fewer vhappy people who watch 10+ hours of TV.

Perhaps subsetting and cleaning the data will help differentiate the distributions even further:

We want to subset the data for only working age people, so our variables offer more predictive power. (A teenager’s happiness isn’t likely to be affected by their income)

We also want to remove rows with NAs that can’t be adjusted for. We also choose the “vhappy” indicator variable over the “happy” ordinal variable, as it simplifies out model. This could wind up proving to be a mistake because the distibutions of our variables for “pretty happy”" and “very happy” look similar.

happiness_temp <- happiness[which(!happiness$workstat %in% c('school','other', 'retired') & !is.na(happiness$workstat)),]
happiness_temp <-happiness_temp[happiness_temp$teens == 0,]
happiness_temp <-happiness_temp[happiness_temp$preteen == 0,]
happiness_temp <- happiness_temp[!is.na(happiness_temp$prestige) & happiness_temp$prestige > 17,]
happiness_subset <-happiness_temp[!is.na(happiness_temp$income),c("workstat","divorce","educ",
                      "income", "regattend", "babies","tvhours","female","prestige","vhappy")]

len <- length(happiness_subset)
rows <-nrow(happiness_subset)

for (i in 1:len){
  if (!is.numeric(happiness_subset[,i])){
    happiness_subset[,i] <- as.character(happiness_subset[,i])
  }
}

A small adjustment for “divorce”:

happiness_subset$divorce[happiness_subset$divorce == "yes"] <- "1"
happiness_subset$divorce <- as.numeric(happiness_subset$divorce)

happiness_subset[,7][is.na(happiness_subset[,7])] <- 0
for (i in 1:len){
  if (is.numeric(happiness_subset[,i])){
    happiness_subset[,i][is.na(happiness_subset[,i])] <- 0
  }else
    happiness_subset[,i][is.na(happiness_subset[,i])] <- "unknown"
}
pander(head(happiness_subset, 10), type = "grid", caption = "Sample of new dataset")

Sample of new dataset (continued below)
	workstat	divorce	educ	income	regattend	babies
3	working fulltime	0	12	$15000 - 19999	1	0
4	working fulltime	1	8	$15000 - 19999	0	0
6	working parttime	0	15	$25000 or more	0	0
8	working fulltime	0	12	$10000 - 14999	1	0
12	working fulltime	0	19	$25000 or more	0	0
13	working fulltime	0	16	$25000 or more	0	0
14	working fulltime	0	16	$25000 or more	0	0
15	working fulltime	0	18	$25000 or more	0	0
16	working fulltime	0	16	$25000 or more	0	0
18	keeping house	0	16	$25000 or more	0	2

	tvhours	female	prestige	vhappy
3	1	1	29	0
4	3	0	42	0
6	0	1	43	0
8	4	0	44	1
12	2	1	75	0
13	2	0	51	1
14	0	0	50	1
15	1	0	73	0
16	1	1	41	0
18	3	1	59	1

Let’s find out the liklihood of happiness given some values of our variables:

pander(aggregate(vhappy ~ income, happiness_subset,mean), type = "grid")

income	vhappy
$1000 to 2999	0.1867
$10000 - 14999	0.2302
$15000 - 19999	0.2021
$20000 - 24999	0.232
$25000 or more	0.333
$3000 to 3999	0.2113
$4000 to 4999	0.1719
$5000 to 5999	0.2439
$6000 to 6999	0.08571
$7000 to 7999	0.1795
$8000 to 9999	0.2126
lt $1000	0.2603

pander(aggregate(vhappy ~ income + female, happiness_subset, mean), type = "grid")

income	female	vhappy
$1000 to 2999	0	0.1667
$10000 - 14999	0	0.2028
$15000 - 19999	0	0.1792
$20000 - 24999	0	0.225
$25000 or more	0	0.313
$3000 to 3999	0	0.1
$4000 to 4999	0	0.2069
$5000 to 5999	0	0.2222
$6000 to 6999	0	0.1154
$7000 to 7999	0	0.2308
$8000 to 9999	0	0.2679
lt $1000	0	0.2222
$1000 to 2999	1	0.1961
$10000 - 14999	1	0.2466
$15000 - 19999	1	0.2194
$20000 - 24999	1	0.238
$25000 or more	1	0.3538
$3000 to 3999	1	0.2549
$4000 to 4999	1	0.1429
$5000 to 5999	1	0.2545
$6000 to 6999	1	0.06818
$7000 to 7999	1	0.1538
$8000 to 9999	1	0.1864
lt $1000	1	0.2826

pander(aggregate(vhappy ~ educ, happiness_subset,mean), type = "grid")

educ	vhappy
0	0.3
1	0
2	0.25
3	0.2308
4	0.4167
5	0.1429
6	0.122
7	0.425
8	0.2462
9	0.1951
10	0.2524
11	0.2317
12	0.2748
13	0.2705
14	0.2871
15	0.2794
16	0.347
17	0.3746
18	0.3325
19	0.329
20	0.4143

pander(aggregate(vhappy ~ tvhours, happiness_subset,mean), type = "grid")

tvhours	vhappy
0	0.3112
1	0.3376
2	0.2867
3	0.2871
4	0.2493
5	0.2754
6	0.2074
7	0.2564
8	0.2462
9	0
10	0.3023
11	0.6667
12	0.16
13	0
14	0.125
15	0.4
16	1
18	0
20	0.2
21	0
22	0

pander(aggregate(vhappy ~ female, happiness_subset,mean), type = "grid")

female	vhappy
0	0.2846
1	0.3076

pander(aggregate(vhappy ~ babies, happiness_subset,mean), type = "grid")

babies	vhappy
0	0.2929
1	0.3036
2	0.3607
3	0.2826
4	0.4
6	0

pander(aggregate(vhappy ~ divorce, happiness_subset,mean), type = "grid")

divorce	vhappy
0	0.2845
1	0.383

pander(aggregate(vhappy ~ regattend, happiness_subset,mean), type = "grid")

regattend	vhappy
0	0.2824
1	0.4118

pander(aggregate(vhappy ~ workstat, happiness_subset,mean), type = "grid")

workstat	vhappy
keeping house	0.3333
temp not working	0.2927
unempl, laid off	0.1443
working fulltime	0.3015
working parttime	0.286

table = aggregate(vhappy ~ workstat, happiness_subset,mean)
ggplot(table, aes(x = workstat, y = vhappy)) + geom_bar(stat="identity") + labs(title = "Vhappy by Workstat")

The probabilities would probably pass a signficance test for independance, but nothing jumps out as an obvious predictor of happiness.

We also want to investiage the distributions of the continuous variables in our subset:

ggplot(happiness_subset, aes(x = tvhours, color = vhappy)) + geom_density() + facet_wrap(~vhappy) + labs(title = "Tvhours by Vhappiness")

ggplot(happiness_subset, aes(x = educ, color = vhappy)) + geom_density() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

ggplot(happiness_subset, aes(x = prestige)) + geom_histogram() + facet_wrap(~vhappy) + labs(title = "Age by Vhappiness")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(happiness_subset, aes(y = educ, x = 1, color = vhappy)) + geom_boxplot() + facet_wrap(~vhappy) + labs(title = "Education by Vhappiness")

It looks like we will not be keeping tvhours, but education looks somewhat promising, though not much better, if any, than the full dataset.

We now set the data up to be input into a predictive model:

Here we want to transform our categorical variables into numeric indicator variables. Income is categorical, but ordinal so we keep its order in tact here.

happiness_subset$workstat <-factor(happiness_subset$workstat)
workstats <- unique(happiness_subset$workstat)
happiness_transform <- happiness_subset
happiness_transform$workstat <- as.character(happiness_transform$workstat)
count = 0
for (stat in workstats){
  happiness_transform$workstat[happiness_transform$workstat == stat] <- count
  count <- count + 1
}
happiness_transform$income <- str_split(string = happiness_transform$income, pattern = "to | - | or", n = 2)
happiness_transform$income <- sapply(happiness_transform$income, FUN = function (x) x[1])
happiness_transform$income <- str_extract(string = happiness_transform$income, "\\d{4,5}")
happiness_transform$income <- as.numeric(happiness_transform$income)
incomes <- unique(happiness_transform$income)
income_sorted <- sort(incomes)

count = 0
for (stat in income_sorted){
  happiness_transform$income[happiness_transform$income == stat] <- count
  count <- count + 1
}
happiness_transform <- as.matrix(as.data.frame(happiness_transform))
happiness_transform <- apply(happiness_transform, 2, as.numeric)

We build and implement our model:

We use random forest because it is powerful right out of the box. We choose variables based on the above summary statistics and plots.

train <- happiness_transform[1:6000,c(1,3,4,5,6,9,len)]
veryhappy_train <- as.character(happiness_transform[1:6000,len])
test <- happiness_transform[6001:rows,c(1,3,4,5,9,len)]
veryhappy_test <- as.character(happiness_transform[6001:rows,len])

fit <- randomForest(as.factor(vhappy) ~ educ + workstat + income + prestige + regattend,
                    data=train, 
                    importance=TRUE, 
                    ntree=2000)
predction <- predict(fit, test)
table(Predicted = predction, Actual = veryhappy_test)

##          Actual
## Predicted    0    1
##         0 1477  600
##         1   32   22

Unfortunatley, our model did not do a very good job in predicting happiness. It put almost all people in the “not very happy” category. I suspect it found some happy people in the corner of space and made the division lines there. Because none of the variables found high numbers of happy people (like 80-90 percent), there were likely unhappy people mixed in, even for values of variables, that one would think would predict a happy person. Perhaps extending the income brackets higher would have helped the dataset. It is also possible we should have grouped “pretty happy” and “very happy” instead of “not too happy” and “pretty happy”. It does seem like we found variables that indicated if a person was more likely to be happy, so it is possible that through tuning our model, we could make better predictions. A model with soft decision boundries could work as well, but such thoughts are beyond the scope of this analysis.

	tvhours	female	prestige	vhappy
3	1	1	29	0
4	3	0	42	0
6	0	1	43	0
8	4	0	44	1
12	2	1	75	0
13	2	0	51	1
14	0	0	50	1
15	1	0	73	0
16	1	1	41	0
18	3	1	59	1

	tvhours	female	prestige	vhappy
3	1	1	29	0
4	3	0	42	0
6	0	1	43	0
8	4	0	44	1
12	2	1	75	0
13	2	0	51	1
14	0	0	50	1
15	1	0	73	0
16	1	1	41	0
18	3	1	59	1

	tvhours	female	prestige	vhappy
3	1	1	29	0
4	3	0	42	0
6	0	1	43	0
8	4	0	44	1
12	2	1	75	0
13	2	0	51	1
14	0	0	50	1
15	1	0	73	0
16	1	1	41	0
18	3	1	59	1