HRS Data Analysis (WIP)

Initial Idea: Looking at the correlation between income and cognition levels. For all analysis I used wave 4. The next step would be looking at the results across the waves 1-15.

HRS$H4ATOTB0<-HRS$H4ATOTB
HRS<- HRS[order(HRS$H4ATOTB),]


for(i in 1:21384){
  if(HRS$H4ATOTB0[i]<0) HRS$H4ATOTB0[i]<-0
}

for(i in 1:length(HRS$HHIDPN)){
  HRS$H4ATOTB0P[i]<-HRS$H4ATOTB0[i]/sum(HRS$H4ATOTB0,na.rm = TRUE)
}

HRS$H4ATOTB0I<-1:42053


#Here we are assiging all people with negative wealth as zero I am unsure of any complications as a result of this action.This allows for a nice scale of wealth with zero to 100 on the graphs if negative wealth remains then the total will be over 100% with entires creating -% of total wealth. Negative wealth is created by having more debt than assets. However there are questions that these negative wealth holders posess as one has about -4.3 million in debt. I belive that they are not really in poverty and they are stitting pretty but just curretly have lots of bills for some abnormal reason (No way to confirm tho). 


ggplot(data= HRS,aes(x= H4ATOTB0))+
  geom_histogram(fill= 'lightblue', color= 'black', breaks= c(seq(0,100000, by=10000)))+
  theme_minimal()+
  scale_x_continuous(breaks= c(seq(0,100000, by=10000)),labels=c(seq(0,90000, by=10000), "100000"))+
labs(x= 'Wealth ', title = 'Wealth distribution between 0-100K')

#calculating number of observations with over $1m total wealth


ggplot(data= HRS,aes(x= H4ATOTB0))+
  geom_histogram(fill= 'lightblue', color= 'black', breaks= c(seq(0,1000000, by=100000)))+
  theme_minimal()+
  scale_x_continuous(breaks= c(seq(0,1000000, by=100000)),labels=c(seq(0,900000, by=100000), "1000000"))+
  scale_y_continuous(breaks = c(seq(0,10000, by=1000)))+
  labs(x= 'Total Wealth', title = 'Wealth distribution between 0 and 1 million', caption = 'Note: There are 427 observations with over a million total wealth.' )

#Top 10% responsible for 56% of total wealth

Here are two graphs that display the distribution of wealth. Note that people with negative wealth were assigned a zero. This was discussed in the R code itself in the form of comments. Now moving on to looking at cognition.

#summary(HRS$H4ATOTB)            #missing 14,000 entries, could be probalamatic when comparing wealth level vs total cognition summary score. Could be more
#summary(HRS$S4MSTOT)            #Also age is not looked at here and possible plays a much bigger role. 
lm.out4<-lm(formula = S4MSTOT~H4ATOTB, data = HRS)
summary(lm.out4)                #Looks like wealth is repocaple for about 1% of the variation...

## 
## Call:
## lm(formula = S4MSTOT ~ H4ATOTB, data = HRS)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5789  -1.0267   0.8942   1.8765   2.3389 
## 
## Coefficients:
##                   Estimate     Std. Error t value            Pr(>|t|)    
## (Intercept) 12.96633171939  0.02902036149   446.8 <0.0000000000000002 ***
## H4ATOTB      0.00000036045  0.00000003368    10.7 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.233 on 7479 degrees of freedom
##   (34572 observations deleted due to missingness)
## Multiple R-squared:  0.01508,    Adjusted R-squared:  0.01495 
## F-statistic: 114.5 on 1 and 7479 DF,  p-value: < 0.00000000000000022

ggplot(data = HRS,aes(y= H4ATOTB, x= S4MSTOT))+        
  geom_point()+ labs(title= 'Total Wealth and Cognition score', y= 'Wealth', x='score')

#sum(HRS$H4ATOTB0P[HRS$H4ATOTB0I<12500], na.rm = TRUE) #bottom 58% own 10% of total wealth 
#(12500/21384) 
#I am not 100% sure if all the assumption of the t.test so I would not be confident with the significant levels untl more research is done
#comparing the bottom 58% (10% of total wealth) vs top 42% (own 90% of total wealth) 
t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>12500&HRS$R4AGEY_B>=66&HRS$R4AGEY_B<=70],HRS$S4MSTOT[HRS$H4ATOTB0I<=12500&HRS$R4AGEY_B>=66&HRS$R4AGEY_B<=70]) #1.18 diff

## 
##  Welch Two Sample t-test
## 
## data:  HRS$S4MSTOT[HRS$H4ATOTB0I > 12500 & HRS$R4AGEY_B >= 66 & HRS$R4AGEY_B <=  and HRS$S4MSTOT[HRS$H4ATOTB0I <= 12500 & HRS$R4AGEY_B >= 66 & HRS$R4AGEY_B <=     70] and     70]
## t = 10.44, df = 923.71, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9584981 1.4022852
## sample estimates:
## mean of x mean of y 
##  13.77184  12.59145

t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>12500&HRS$R4AGEY_B>=71&HRS$R4AGEY_B<=75],HRS$S4MSTOT[HRS$H4ATOTB0I<=12500&HRS$R4AGEY_B>=71&HRS$R4AGEY_B<=75]) #1.05 diff

## 
##  Welch Two Sample t-test
## 
## data:  HRS$S4MSTOT[HRS$H4ATOTB0I > 12500 & HRS$R4AGEY_B >= 71 & HRS$R4AGEY_B <=  and HRS$S4MSTOT[HRS$H4ATOTB0I <= 12500 & HRS$R4AGEY_B >= 71 & HRS$R4AGEY_B <=     75] and     75]
## t = 8.2875, df = 955.61, p-value = 0.0000000000000003888
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8034161 1.3019607
## sample estimates:
## mean of x mean of y 
##  13.54424  12.49155

#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>12500&HRS$R4AGEY_B>=76&HRS$R4AGEY_B<=80],HRS$S4MSTOT[HRS$H4ATOTB0I<=12500&HRS$R4AGEY_B>=76&HRS$R4AGEY_B<=80]) #.80 diff
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>12500&HRS$R4AGEY_B>=81&HRS$R4AGEY_B<=85],HRS$S4MSTOT[HRS$H4ATOTB0I<=12500&HRS$R4AGEY_B>=81&HRS$R4AGEY_B<=85]) #1.01 diff
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>12500&HRS$R4AGEY_B>=86&HRS$R4AGEY_B<=90],HRS$S4MSTOT[HRS$H4ATOTB0I<=12500&HRS$R4AGEY_B>=86&HRS$R4AGEY_B<=90]) #.804 diff not significant at a 95% conf
#.test(HRS$S4MSTOT[HRS$H4ATOTB0I>12500&HRS$R4AGEY_B>=91&HRS$R4AGEY_B<=95],HRS$S4MSTOT[HRS$H4ATOTB0I<=12500&HRS$R4AGEY_B>=91&HRS$R4AGEY_B<=95]) #2.8415 sample size is about 30 

#bottom 10% vs top 90% 
t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>2138&HRS$R4AGEY_B>=66&HRS$R4AGEY_B<=70],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=66&HRS$R4AGEY_B<=70]) #1.47 diff

## 
##  Welch Two Sample t-test
## 
## data:  HRS$S4MSTOT[HRS$H4ATOTB0I > 2138 & HRS$R4AGEY_B >= 66 & HRS$R4AGEY_B <=  and HRS$S4MSTOT[HRS$H4ATOTB0I <= 2138 & HRS$R4AGEY_B >= 66 & HRS$R4AGEY_B <=     70] and     70]
## t = 3.599, df = 39.359, p-value = 0.0008823
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.6469164 2.3060313
## sample estimates:
## mean of x mean of y 
##  13.32263  11.84615

t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>2138&HRS$R4AGEY_B>=71&HRS$R4AGEY_B<=75],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=71&HRS$R4AGEY_B<=75]) #3.398 diff

## 
##  Welch Two Sample t-test
## 
## data:  HRS$S4MSTOT[HRS$H4ATOTB0I > 2138 & HRS$R4AGEY_B >= 71 & HRS$R4AGEY_B <=  and HRS$S4MSTOT[HRS$H4ATOTB0I <= 2138 & HRS$R4AGEY_B >= 71 & HRS$R4AGEY_B <=     75] and     75]
## t = 5.5449, df = 40.711, p-value = 0.000001956
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.160057 4.635710
## sample estimates:
## mean of x mean of y 
## 13.202762  9.804878

#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>2138&HRS$R4AGEY_B>=76&HRS$R4AGEY_B<=80],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=76&HRS$R4AGEY_B<=80]) # not sig
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>2138&HRS$R4AGEY_B>=81&HRS$R4AGEY_B<=85],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=81&HRS$R4AGEY_B<=85]) #2.83 diff
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>2138&HRS$R4AGEY_B>=86&HRS$R4AGEY_B<=90],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=86&HRS$R4AGEY_B<=90]) # not sig
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>2138&HRS$R4AGEY_B>=91&HRS$R4AGEY_B<=95],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=91&HRS$R4AGEY_B<=95]) #not enough y observ
#top 10% vs bottom 10%
t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>19201&HRS$R4AGEY_B>=66&HRS$R4AGEY_B<=70],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=66&HRS$R4AGEY_B<=70]) #2.17 diff

## 
##  Welch Two Sample t-test
## 
## data:  HRS$S4MSTOT[HRS$H4ATOTB0I > 19201 & HRS$R4AGEY_B >= 66 & HRS$R4AGEY_B <=  and HRS$S4MSTOT[HRS$H4ATOTB0I <= 2138 & HRS$R4AGEY_B >= 66 & HRS$R4AGEY_B <=     70] and     70]
## t = 5.2032, df = 41.747, p-value = 0.000005571
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.326032 3.006870
## sample estimates:
## mean of x mean of y 
##  14.01261  11.84615

t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>19201&HRS$R4AGEY_B>=71&HRS$R4AGEY_B<=75],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=71&HRS$R4AGEY_B<=75]) #4.09 diff

## 
##  Welch Two Sample t-test
## 
## data:  HRS$S4MSTOT[HRS$H4ATOTB0I > 19201 & HRS$R4AGEY_B >= 71 & HRS$R4AGEY_B <=  and HRS$S4MSTOT[HRS$H4ATOTB0I <= 2138 & HRS$R4AGEY_B >= 71 & HRS$R4AGEY_B <=     75] and     75]
## t = 6.6085, df = 42.506, p-value = 0.00000005013
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.844053 5.343426
## sample estimates:
## mean of x mean of y 
## 13.898618  9.804878

#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>19201&HRS$R4AGEY_B>=76&HRS$R4AGEY_B<=80],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=76&HRS$R4AGEY_B<=80]) # not sig lack ofob
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>19201&HRS$R4AGEY_B>=81&HRS$R4AGEY_B<=85],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=81&HRS$R4AGEY_B<=85]) #3.26 diff
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>19201&HRS$R4AGEY_B>=86&HRS$R4AGEY_B<=90],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=86&HRS$R4AGEY_B<=90]) # not sig lack ofob
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>19201&HRS$R4AGEY_B>=91&HRS$R4AGEY_B<=95],HRS$S4MSTOT[HRS$H4ATOTB0I<=2138&HRS$R4AGEY_B>=91&HRS$R4AGEY_B<=95]) # not sig 'lack ob

#top 20% vs bottom 20% (4276.8 vs )
t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>17107&HRS$R4AGEY_B>=66&HRS$R4AGEY_B<=70],HRS$S4MSTOT[HRS$H4ATOTB0I<=4277&HRS$R4AGEY_B>=66&HRS$R4AGEY_B<=70]) #2.32 diff

## 
##  Welch Two Sample t-test
## 
## data:  HRS$S4MSTOT[HRS$H4ATOTB0I > 17107 & HRS$R4AGEY_B >= 66 & HRS$R4AGEY_B <=  and HRS$S4MSTOT[HRS$H4ATOTB0I <= 4277 & HRS$R4AGEY_B >= 66 & HRS$R4AGEY_B <=     70] and     70]
## t = 7.9748, df = 105.96, p-value = 0.000000000001895
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.709517 2.840758
## sample estimates:
## mean of x mean of y 
##  13.96264  11.68750

t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>17107&HRS$R4AGEY_B>=71&HRS$R4AGEY_B<=75],HRS$S4MSTOT[HRS$H4ATOTB0I<=4277&HRS$R4AGEY_B>=71&HRS$R4AGEY_B<=75]) #2.24 diff

## 
##  Welch Two Sample t-test
## 
## data:  HRS$S4MSTOT[HRS$H4ATOTB0I > 17107 & HRS$R4AGEY_B >= 71 & HRS$R4AGEY_B <=  and HRS$S4MSTOT[HRS$H4ATOTB0I <= 4277 & HRS$R4AGEY_B >= 71 & HRS$R4AGEY_B <=     75] and     75]
## t = 6.3336, df = 112.06, p-value = 0.000000005126
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.548303 2.958035
## sample estimates:
## mean of x mean of y 
##  13.71395  11.46078

#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>17107&HRS$R4AGEY_B>=76&HRS$R4AGEY_B<=80],HRS$S4MSTOT[HRS$H4ATOTB0I<=4277&HRS$R4AGEY_B>=76&HRS$R4AGEY_B<=80]) # 1.38 diff
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>17107&HRS$R4AGEY_B>=81&HRS$R4AGEY_B<=85],HRS$S4MSTOT[HRS$H4ATOTB0I<=4277&HRS$R4AGEY_B>=81&HRS$R4AGEY_B<=85]) #3.26 diff
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>17107&HRS$R4AGEY_B>=86&HRS$R4AGEY_B<=90],HRS$S4MSTOT[HRS$H4ATOTB0I<=4277&HRS$R4AGEY_B>=86&HRS$R4AGEY_B<=90]) # Not sig 
#t.test(HRS$S4MSTOT[HRS$H4ATOTB0I>17107&HRS$R4AGEY_B>=91&HRS$R4AGEY_B<=95],HRS$S4MSTOT[HRS$H4ATOTB0I<=4277&HRS$R4AGEY_B>=91&HRS$R4AGEY_B<=95])

This chunk of code shows a graph of an individuals wealth and their cognition score. I would not put to much weight on the graph as age is not addressed like it is in the T-Test that show the difference in means. However, this does not account for education level. EI higher educated people have better cognitive functions and better paying jobs. Might need to break down that as well. I do not believe there is much left to go on here the next step would to be filter for education, current health afflictions etc… etc… It might be worth looking at commonalities between low cognition scores and higher ones and researching more along those lines

Looking at the code book life expectations stood out to me as an interesting variable and I wondered if income could play a role in ones expectations to live to 75

#--------Looking at how income might effect life expectations------------# 

#summary(HRS$R4LIV75)
#summary(HRS$R4LIV75P)
#summary(HRS$R4EFFORT)


t.test(HRS$R4LIV75[HRS$H4ATOTB0I>17107],HRS$R4LIV75[HRS$H4ATOTB0I<=4277])

## 
##  Welch Two Sample t-test
## 
## data:  HRS$R4LIV75[HRS$H4ATOTB0I > 17107] and HRS$R4LIV75[HRS$H4ATOTB0I <= 4277]
## t = 14.166, df = 2972.5, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12.04113 15.90981
## sample estimates:
## mean of x mean of y 
##  72.20237  58.22690

t.test(HRS$R4LIV75P[HRS$H4ATOTB0I>17107],HRS$R4LIV75P[HRS$H4ATOTB0I<=4277])

## 
##  Welch Two Sample t-test
## 
## data:  HRS$R4LIV75P[HRS$H4ATOTB0I > 17107] and HRS$R4LIV75P[HRS$H4ATOTB0I <= 4277]
## t = -7.688, df = 8502.8, p-value = 0.00000000000001662
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.615429 -1.552667
## sample estimates:
## mean of x mean of y 
##  81.56536  83.64941

#here we can see that the top 20% think that they will live longer and the bottom 20% have a lower expectations.
#Looking at the actual probabilities both are much closer then there actual expectations.

t.test(HRS$R4AGEY_B[HRS$H4ATOTB0I<=4277],HRS$R4AGEY_B[HRS$H4ATOTB0I>17107])

## 
##  Welch Two Sample t-test
## 
## data:  HRS$R4AGEY_B[HRS$H4ATOTB0I <= 4277] and HRS$R4AGEY_B[HRS$H4ATOTB0I > 17107]
## t = 6.6165, df = 8147.4, p-value = 0.00000000003909
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.138776 2.097605
## sample estimates:
## mean of x mean of y 
##  67.25742  65.63923

summary(HRS$R4AGEY_B[HRS$H4ATOTB0I<=4277])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.00   58.00   65.00   67.26   77.00  105.00

summary(HRS$R4AGEY_B[HRS$H4ATOTB0I>17107])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   32.00   58.00   65.00   65.64   73.00  100.00   20669

#looking at this it seems that the average age is pretty close but significantly by a small amount 

t.test(HRS$S4LIV75[HRS$H4ATOTB0I>12500],HRS$S4LIV75[HRS$H4ATOTB0I<=12500])

## 
##  Welch Two Sample t-test
## 
## data:  HRS$S4LIV75[HRS$H4ATOTB0I > 12500] and HRS$S4LIV75[HRS$H4ATOTB0I <= 12500]
## t = 13.309, df = 6708.6, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.632999 10.269909
## sample estimates:
## mean of x mean of y 
##  70.23907  61.28761

t.test(HRS$S4LIV75P[HRS$H4ATOTB0I>12500],HRS$S4LIV75P[HRS$H4ATOTB0I<=12500])

## 
##  Welch Two Sample t-test
## 
## data:  HRS$S4LIV75P[HRS$H4ATOTB0I > 12500] and HRS$S4LIV75P[HRS$H4ATOTB0I <= 12500]
## t = 7.4659, df = 13964, p-value = 0.0000000000000876
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.101396 1.885616
## sample estimates:
## mean of x mean of y 
##  80.36981  78.87630

#Here the difference narrows a bit but still supports the orignal claim that wealth effects how we precive our mortality 

#Attempt to make a visual representation of the above claime. 
T4Live75B2<- HRS$R4LIV75[HRS$H4ATOTB0I<=4277]
T4Live75T2<- HRS$R4LIV75[HRS$H4ATOTB0I>17107]
T4Live75B2P<-HRS$R4LIV75P[HRS$H4ATOTB0I<=4277]
T4Live75T2P<-HRS$R4LIV75P[HRS$H4ATOTB0I>17107]

T4lifeBEP<- data.frame(T4Live75B2,T4Live75B2P)
T4lifeTEP<- data.frame(T4Live75T2,T4Live75T2P)
summary(T4lifeBEP$T4Live75B2) #1712 valid obs

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   40.00   50.00   58.23   90.00  100.00    2567

summary(T4lifeTEP$T4Live75T2) #1942 valid obs

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    50.0    75.0    72.2    90.0   100.0   23004

grid.arrange(
ggplot(data = T4lifeBEP, aes(x= T4Live75B2, y=T4Live75B2P))+
  geom_hex()+
  scale_alpha_continuous(limits = c(60,100))+
  labs(x= 'Personal estimate of living to 75', y = 'Actual Probability', title = 'Bottom 20%')+
  scale_y_continuous(limits = c(55,90),breaks = c(seq(50,100, by=5)))+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, vjust = 0, face = 'bold')),


ggplot(data = T4lifeTEP, aes(x= T4Live75T2, y=T4Live75T2P))+
  geom_hex(show.legend = FALSE)+
  scale_alpha_continuous(limits = c(60,100))+
  labs(x= 'Personal estimate of living to 75', y = 'Actual Probability', title = 'Top 20%')+
  scale_y_continuous(limits = c(55,90),breaks = c(seq(50,100, by=5)))+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, vjust = 0, face = 'bold')),

nrow=2)

To me there looks like their might be a possibility that it does. However, access to health care and current health status probably has more of an impact then wealth. These are some areas worth exploring if this was to be looked at more closely.

At this point I stoped to look at some distributions of deomgraphic Identifiers. Comments are in the code. Nothing jumped out as unormal.

#----Going back to look at the distribution of basic identifiers------#
#----No hypothesis are being made just looking at how single variables are distributed---#
#----------------------Age--------------------------------------------#
#summary(HRS$R4AGEY_B)

ggplot(data = HRS, aes(x=R4AGEY_B))+
  geom_histogram(binwidth = 10,color= 'black', fill= '#2cfaa8', size= 1)+
  labs(x= 'Age', y = 'Count', title = '4th Wave Age Distribution')+
  scale_y_continuous(limits=c(0,8000),breaks = seq(0,8000,1000))+
  scale_x_continuous(limits=c(35,105),breaks = seq(0,105,10))+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

## Warning: Removed 20697 rows containing non-finite values (stat_bin).

#Age seems pretty focused on the retirment age then decreases age increase proboably due to death. The younger ages are probably related to proxy interviews 
#Gender is well balanced according to the code book

#-----Education level-----------#
ggplot(data = HRS, aes(x=S4EDUC))+
  geom_bar(color= 'black', fill= '#fa2ce5', size= 1)+
  labs(x= 'Education', y = 'Count', title = '4th Wave Education Distribution')+
  scale_x_discrete( limits=c('1.Lt High-school', '2.GED', '3.High-school graduate', '4.Some college', '5.College and above'))+
  scale_y_continuous(limits = c(0,5000))+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

## Warning: Removed 27701 rows containing non-finite values (stat_count).

#-----------Religion-----------#

ggplot(data = HRS, aes(x=S4RELIG))+
  geom_bar(color= 'black', fill= '#2ce5fa', size= 1)+
  labs(x= 'Religion', y = 'Count', title = '4th Wave Religion Distribution')+
  scale_x_discrete(limits=c('1.Protestant', '2.Catholic', '3.Jewish', '4.None/no pref', '5.Other'))+
  scale_y_continuous(breaks= c(seq(0,11000, by=1000)))+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

## Warning: Removed 27748 rows containing non-finite values (stat_count).

#I think age and Education are what I would expect. Religion is really interesting as Protestant dominates. I would have expected more None/ Pref but this might be more of bias of my curent culture

#--------Race----------------#

ggplot(data = HRS, aes(x=S4RACEM))+
  geom_bar(color= 'black', fill= '#fa2c7e', size= 1)+
  labs(x= 'Race', y = 'Count', title = '4th Wave Race Distribution')+
  scale_x_discrete(limits=c('1.White/Caucasian', '2.Black/African American', '3.Other'))+
  scale_y_continuous(breaks= c(seq(0,15000, by=2000)))+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

## Warning: Removed 27696 rows containing non-finite values (stat_count).

#Mostly White according to the code book African American  was oversampled. It might be hard to do comparisons due to sample size

#---------------Gender------------------------------#

ggplot(data = HRS, aes(x=S4GENDER))+
  geom_bar(color= 'black', fill= '#8394cf', size= 1)+
  labs(x= 'Gender', y = 'Count', title = '4th Wave Gender Distribution')+
  scale_x_discrete(limits= c('1.Male','2.Female'))+
  scale_y_continuous(limits=c(0,8000),breaks= c(seq(0,10000, by=2000)))+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

## Warning: Removed 27537 rows containing non-finite values (stat_count).

The next variable that struck me was retirment satisfaction. There is alot of graphs and Chunks to sift through here. I will have a summary below this chunk.

#---------------------------------------------------------------------------------------------#
#----------Looking for variables that have a strong effect on Retirment satisfaction----------#
#---------------------------------------------------------------------------------------------#
#RwRETSAT--- also includes spouces-----#
#summary(HRS$R4RETSAT)
ggplot(data = HRS, aes(x=R4RETSAT))+
  geom_bar(color= 'black', fill= '#08F7BD', size= 1.5)+
  labs(x= 'Satisfaction Level', y = 'Count', title = '4th Wave Satisfaction')+
  scale_x_discrete(limits= c('1.Very', '2.Moderately', '3.Not at all'))+
  scale_y_continuous(limits=c(0,5000),breaks= c(seq(0,5000, by=1000)))+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
             axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'), axis.text = element_text(colour = 'white'), panel.grid.major.x = element_line(color = NA))

## Warning: Removed 34814 rows containing non-finite values (stat_count).

#What might cause satisfaction in retirment 'Money'
#0-100K
HRS%>% filter(!is.na(H4ATOTB0)) %>% filter(!is.na(R4RETSAT))%>% ggplot(aes(x= H4ATOTB0, fill=R4RETSAT))+
  geom_histogram(color= 'black', breaks= c(seq(0,100000, by=10000)))+
  scale_fill_discrete(limits= c('1.Very', '2.Moderately', '3.Not at all'))+
  scale_x_continuous(breaks= c(seq(0,100000, by=10000)),labels=c(seq(0,90000, by=10000), "100000"))+
  scale_y_continuous(breaks= c(seq(0,1000, by=100)))+
  labs(x= 'Wealth ', title = 'Total wealth and retirment satisfaction')+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

#0-1million
HRS%>% filter(!is.na(H4ATOTB0)) %>% filter(!is.na(R4RETSAT))%>% ggplot(aes(x= H4ATOTB0, fill=R4RETSAT))+
  geom_histogram(color= 'black', breaks= c(seq(0,1000000, by=100000)))+
  theme_minimal()+
  scale_x_continuous(breaks= c(seq(0,1000000, by=100000)),labels=c(seq(0,900000, by=100000), "1000000"))+
  scale_y_continuous(breaks = c(seq(0,10000, by=1000)))+
  labs(x= 'Total Wealth', title = 'Wave 4 Wealth distribution between 0 and 1 million', caption = 'Note: There are 427 observations with over a million total wealth.' )+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

#Looks like people with more money are more satisfied with their retirment
#Might be worth looking at a proportions test to determine independence

#-----Income---H4ITOT----------#
#summary(HRS$H4ITOT)
#looking at income levels between 0-100K

HRS%>% filter(!is.na(H4ATOTB0)) %>% filter(!is.na(R4RETSAT))%>% ggplot(aes(x= H4ITOT, fill=R4RETSAT))+
  geom_histogram(color= 'black', breaks= c(seq(0,100000, by=10000)))+
  scale_fill_discrete(limits= c('1.Very', '2.Moderately', '3.Not at all'))+
  scale_x_continuous(breaks= c(seq(0,100000, by=10000)),labels=c(seq(0,90000, by=10000), "100000"))+
  scale_y_continuous(breaks= c(seq(0,2000, by=200)))+
  labs(x= 'Income ', title = 'Income level and retirment satisfaction')+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

HRS%>% filter(!is.na(H4ATOTB0)) %>% filter(!is.na(R4RETSAT))%>% ggplot(aes(x= H4ITOT, fill=R4RETSAT))+
  geom_histogram(color= 'black', breaks= c(seq(100000,300000 ,by=50000)))+
  scale_fill_discrete(limits= c('1.Very', '2.Moderately', '3.Not at all'))+
  scale_x_continuous(breaks= c(seq(100000,300000 ,by=50000)),labels=c(seq(100000,300000 ,by=50000)))+
  scale_y_continuous()+
  labs(x= 'Income ', title = 'Income level and retirment satisfaction')+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

#note the sample decrease and the as their are fewer people with high salaries. Thus it is best to see if the percentages change as income level comes up
#However income might not be the best factor for retirment satisfaction as income will likly ends. 
HRS%>% filter(!is.na(S4EDUC)) %>% filter(!is.na(R4RETSAT))%>% ggplot(aes(x=S4EDUC, fill=R4RETSAT ))+
  geom_bar(color= 'black', size= 1)+
  labs(x= 'Education', y = 'Count', title = 'Education and Retirment satisfaction')+
  scale_x_discrete( limits=c('1.Lt High-school', '2.GED', '3.High-school graduate', '4.Some college', '5.College and above'))+
  scale_y_continuous(breaks= c(seq(0,2000, by=200)))+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'), panel.grid.major.x = element_line(color = NA))

#table(HRS$S4EDUC,HRS$R4RETSAT)
#prop.table(table(HRS$S4EDUC,HRS$R4RETSAT),1)
#edusat<-chisq.test(HRS$S4EDUC,HRS$R4RETSAT)
#edusat$statistic
#edusat$observed
#edusat$expected

Though I think that the way they worded this question is correct I belive that it may be more interesting merge Moderately and Not at all. I think this works well because who wants to moderately enjoy retirment the answer is grey and displays disinterest. focusing on those who are Very happy and those who are sort of meh is more appealing research wise as no one wants to have a moderatly satisfying retirment people want a very satisfying retirment and that is that is the question I want to focus on. Furthermore, having and A B answer potentially opens up more models like logistic regression. I will creat a new varible that combines ‘Moderately and Not at all’

#--------------------------------------------------------------------------------#
#----Combining Moderately and Not at at all in wave 4 retirment satisfaction-----# 
#--------------------------------------------------------------------------------#

#str(HRS$R4RETSAT)
HRS$R4RETSATB<-HRS$R4RETSAT
levels(HRS$R4RETSATB)

## [1] "1.Very"       "2.Moderately" "3.Not at all"

levels(HRS$R4RETSATB)<-c("Very" ,"Moderate to None","Moderate to None")
levels(HRS$R4RETSATB)

## [1] "Very"             "Moderate to None"

HRS%>%filter(!is.na(R4RETSATB))%>% ggplot(aes(x=R4RETSATB))+
  geom_bar(color= 'black', fill= '#08F7BD', size= 1.5)+
  labs(x= 'Satisfaction Level', y = 'Count', title = '4th Wave Satisfaction')+
  scale_y_continuous(limits=c(0,5000),breaks= c(seq(0,5000, by=1000)))+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

#distribution of satisfaction wealth 0-100K
HRS%>% filter(!is.na(H4ATOTB0)) %>% filter(!is.na(R4RETSATB))%>% ggplot(aes(x= H4ATOTB0, fill=R4RETSATB))+
  geom_histogram(color= 'black', breaks= c(seq(0,100000, by=10000)))+
  scale_x_continuous(breaks= c(seq(0,100000, by=10000)),labels=c(seq(0,90000, by=10000), "100000"))+
  scale_y_continuous(breaks= c(seq(0,1000, by=100)))+
  labs(x= 'Wealth ', title = 'Total wealth and retirment satisfaction')+
  theme_ft_rc()+
  theme(plot.title = element_text(hjust = 0.5, family = 'mono', face= 'bold', size = 21),axis.title.x = element_text(hjust = 0.5,family ='mono' ,face = 'bold' ,size = 14, color = 'White'),
        axis.title.y= element_text(hjust = 0.5,family = 'mono', size = 14, color = 'White', vjust = 5, face = 'bold'))

#As wealth increase satisfaction increase.... as seen before 
#Percentages of very to not
HRS$H4wealthbin <- cut(HRS$H4ATOTB0, breaks = c(seq(0, 150000, by = 10000), Inf), right = FALSE)
#table(HRS$H4wealthbin,HRS$R4RETSATB)
satisfactionwealth<-as.data.frame(prop.table(table(HRS$H4wealthbin,HRS$R4RETSATB),1))
colnames(satisfactionwealth)<- c("Wealth","Satisfaction","Percentage")

satisfactionwealth%>%filter(Satisfaction== "Very")%>%ggplot(aes(x= Wealth, y= Percentage))+
  geom_col()+
  scale_x_discrete(labels=c(seq(10000,150000, by=10000), '+150000'))+
  labs(title = 'Wave 4 Retirment satisfaction and wealth ', y= "Percent of Very satisfied")

#----Income level--------#
#summary(HRS$H4ITOT)
grid.arrange(
HRS%>% filter(!is.na(H4ITOT)) %>% filter(!is.na(R4RETSATB))%>% ggplot(aes(x= H4ITOT, fill=R4RETSATB))+
  geom_histogram(color= 'black',breaks= c(seq(0,200000, by=10000)))+
  scale_x_continuous(breaks= c(seq(0,200000, by=20000)),labels=c(seq(0,190000, by=20000), "200000")),

HRS%>% filter(!is.na(H4ITOT)) %>% filter(!is.na(R4RETSATB))%>% ggplot(aes(x= H4ITOT, fill=R4RETSATB))+
  geom_density(color= 'black',position = 'fill')+
  scale_x_continuous(breaks= c(seq(0,200000, by=20000)),labels=c(seq(0,190000, by=20000), "200000"), limits = c(0,200000)))

## Warning: Removed 80 rows containing non-finite values (stat_density).

#As Income increase so does satisfaction level however it does not appear to be linear...
#Money loses impact on satisfaction as it increases. going from 20000 to 40000 has a bigger impact then 40000 to 60000
# income past 120000 has a much smaller sample size which explains the high volitility of satisfied disatisfied 
#THis is obvious but inflation needs to be accounted for when doing cross wave analysis.
#Due to the smaller sample size of those outside 100K income levels I would not rely on them
#for claims but I the satisfaction level is at a level I would except.

#----Number of children-------------#
#table(HRS$H4CHILD,HRS$R4RETSATB)
#prop.table(table(HRS$H4CHILD,HRS$R4RETSATB),1)
R4satChild<- as.data.frame(prop.table(table(HRS$H4CHILD,HRS$R4RETSATB),1))
colnames(R4satChild)<- c("Children","Satisfaction","Percentage")
levels(R4satChild$Children)<- c('1','2','3','4','5','6','7','8','9','10','10+','10+','10+','10+','10+','10+','10+','10+','10+','10+')
R4satChild%>%filter(Satisfaction== "Very")%>%ggplot(aes(x= Children , y= Percentage))+
  geom_col()+
  scale_x_discrete(limits= c('1','2','3','4','5','6','7','8','9','10'))

## Warning: Removed 10 rows containing missing values (position_stack).

#We can see that number of children does not seem to have a real effect on retirment satisfaction

#--Gender-----------------------#
table(HRS$S4GENDER,HRS$R4RETSATB)

##           
##            Very Moderate to None
##   1.Male   1174              642
##   2.Female 1708             1028

prop.table(table(HRS$S4GENDER,HRS$R4RETSATB),1)

##           
##                 Very Moderate to None
##   1.Male   0.6464758        0.3535242
##   2.Female 0.6242690        0.3757310

#Very even

#---Self Report of Health------#
#table(HRS$R4SHLT,HRS$R4RETSATB)
#prop.table(table(HRS$R4SHLT,HRS$R4RETSATB),1)
R4HNRSP<- as.data.frame(prop.table(table(HRS$R4SHLT,HRS$R4RETSATB),1))
colnames(R4HNRSP)<- c("SelfReportedHealth","Satisfaction","Percentage")
R4HNRSP%>%filter(Satisfaction== "Very")%>%ggplot(aes(x= SelfReportedHealth , y= Percentage))+
  geom_col()+
  labs(title = 'Self Reported Health and Retirment Satisfaction', x= 'Health', y='Percent')

#Self-Reported Health of the person seems to play quite a large role in retirment satisfaction  Look at the corelation of INcome and self- Reported Health
#Seems to play a much larger role then income. Should look at Income lower income levels and their self reported health 

#---AGE---#
#table(HRS$R4AGEY_B,HRS$R4RETSATB)
#prop.table(table(HRS$R4AGEY_E,HRS$R4RETSATB),1)
R4AGESAT<- as.data.frame(prop.table(table(HRS$R4AGEY_E,HRS$R4RETSATB),1))
colnames(R4AGESAT)<- c("Age","Satisfaction","Percentage")
grid.arrange(
HRS %>%filter(!is.na(R4AGEY_E))%>% filter(!is.na(R4RETSATB))%>%ggplot(aes(x=R4AGEY_E, fill= R4RETSATB))+
  geom_histogram()+ labs(title = 'AGE and Retirment satisfaction'),

HRS %>%filter(!is.na(R4AGEY_E))%>% filter(!is.na(R4RETSATB))%>%ggplot(aes(x=R4AGEY_E, fill= R4RETSATB))+
  geom_density(position = 'fill')+
  scale_x_continuous())

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#It is hard to tell if age plays a significant factor without more wave analysis.
#Though it seems that people are more satisfied when they retire in their later years. (However note the small sample size for those with early retirment)
#One hypothesis is that people who retired early were forced to by a medical condition. 
#ALso note the drop in very satisfied in the one hundresds. THis sample is also very small so it is hard
#to make any inferences about these observations. 
#Conclusion: Age of retirment does not seem to play a factor in satisfaction 

#---Education---#

#--Need to look at this variable and determin if it is actually the spouces education or motheres education...
#---Does the respondent have an education level--#

table(HRS$S4EDUC,HRS$R4RETSATB)

##                         
##                          Very Moderate to None
##   1.Lt High-school        598              567
##   2.GED                   126               75
##   3.High-school graduate  995              540
##   4.Some college          578              282
##   5.College and above     560              180

prop.table(table(HRS$S4EDUC,HRS$R4RETSATB),1)

##                         
##                               Very Moderate to None
##   1.Lt High-school       0.5133047        0.4866953
##   2.GED                  0.6268657        0.3731343
##   3.High-school graduate 0.6482085        0.3517915
##   4.Some college         0.6720930        0.3279070
##   5.College and above    0.7567568        0.2432432

#As education level increases so does the chance they are satisfied with retirment.
#Worth to look at if this research area is pursued. 

#--Marital Status-----#
table(HRS$R4MSTAT,HRS$R4RETSATB)

##                          
##                           Very Moderate to None
##   1.Married               2793             1566
##   2.Married,spouse absent   25               34
##   3.Partnered               64               69
##   4.Separated               29               57
##   5.Divorced               243              256
##   6.Separated/divorced       0                0
##   7.Widowed               1085              800
##   8.Never married          116               97

prop.table(table(HRS$R4MSTAT,HRS$R4RETSATB),1)

##                          
##                                Very Moderate to None
##   1.Married               0.6407433        0.3592567
##   2.Married,spouse absent 0.4237288        0.5762712
##   3.Partnered             0.4812030        0.5187970
##   4.Separated             0.3372093        0.6627907
##   5.Divorced              0.4869739        0.5130261
##   6.Separated/divorced                              
##   7.Widowed               0.5755968        0.4244032
##   8.Never married         0.5446009        0.4553991

#-----Vigorus phys activity 3/wk----------#
table(HRS$R4VIGACT,HRS$R4RETSATB)

##        
##         Very Moderate to None
##   0.No  2470             2079
##   1.Yes 1884              801

prop.table(table(HRS$R4VIGACT,HRS$R4RETSATB),1)

##        
##              Very Moderate to None
##   0.No  0.5429765        0.4570235
##   1.Yes 0.7016760        0.2983240

#seems that the more active people have better chance of enjoying retirment... probably corelates to health a bit

#----Smoke Cigarettes-----#
#Smoke ever
table(HRS$R4SMOKEV,HRS$R4RETSATB)

##        
##         Very Moderate to None
##   0.No  1688             1005
##   1.Yes 2625             1858

prop.table(table(HRS$R4SMOKEV,HRS$R4RETSATB),1)

##        
##              Very Moderate to None
##   0.No  0.6268102        0.3731898
##   1.Yes 0.5855454        0.4144546

#smoke now
table(HRS$R4SMOKEN,HRS$R4RETSATB)

##        
##         Very Moderate to None
##   0.No  3888             2387
##   1.Yes  468              496

prop.table(table(HRS$R4SMOKEN,HRS$R4RETSATB),1)

##        
##              Very Moderate to None
##   0.No  0.6196016        0.3803984
##   1.Yes 0.4854772        0.5145228

#People who smoke seem to not enjoy retirment as much.... might be worth looking at the other waves......

I think that looking at Retirment satisfaction and Variables that correlate to it would be a good research topic/ area. Though it might be better to focus on only a major variables like wealth/income and health. I would also argue that this research topic aligns with the consumer well being and due to the large sample size of lower income levels could help provide insights to the underserved Americans.

HRS Data Analysis (WIP)

Chris Beland

January 21, 2020