knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.align = 'center')library(ggplot2)
library(plyr)
library(dplyr)
library(rockchalk)
library(vcd)brfss2013 <- read.csv("brfss2013.csv")Individual states and U.S territories collected data in 2013 (About 1% were collected in 2014) through landline and cell phone surveys of 491,773 randomly selected adults aged over 18. There are 491,775 samples across 330 variables covering a broad range of preventive health practices and risky behaviours linked to chronic health conditions.
The study is observational; a self-report survey
The data are specific to individual states and territories. Inferences can be generalized to the U.S adult population broken down by state.
We can derive insights about associations, trends and patterns but cannot claim causal links between variables in observational data. Experiments are usually required to establish causality.
Further information about the survey is available on the Behavioural Risk Factor System (BRFSS) website.
A full description of dataset variables is available in the BRFSS Codebook.
Is there an association between income, access to health care and health status?
U.S president, Barack Obama, partly justified his signature domestic policy initiative, The Affordable Care Act, by the need address the inferior health outcomes of low income groups by increasing their access to health care. I wish to attempt to adduce statistical evidence for the existence of associations between income, access to health care and ultimate health outcomes.
I will investigate 3 variables:
income_group
income2, a categorical variable with 8 levelsgen_health
genhlthhealth_plan
hlthpln1Is there a relationship between income and risky hehaviours; specifically smoking and drinking
The objective here is to find out if low income earners’ poor health status is also associated with their own lifestyle choices. Do they bear some responsibility for their poor health beyond their relative lack of access to health care (if such is the case)?
I will examine these 3 variables:
income_group
income, a categorical variable with 8 levelssmoke_days
smokday2ave_drink
avedrnk2Is there an association between men’s marital status and income levels - adjusting for educational attainment?
Pursuing my investigation of income disparities and the variables with which they may be associated, I wish to examine the validity of one of the claims made by some American social conservatives in favour of marriage. This question is especially relevant as Americans are waiting ever longer to get married. Indeed many are ruling out this option entirely.
This research question considers one of the numerous pro-marriage claims made by familyfacts.org, a website run by The Heritage Foundation, an American conservative think tank. They emphatically claim that married men make more money.
I will examine these 3 variables:
income_group
income, a categorical variable with 8 levelsmarital_status
marital with 6 levelseducation_level
educa with 6 levelsIs there an association between income, access to health care and health status?
Data Preparation
The code below extracts the data required for this analysis, removes rows with missing values, renames some of the variables and transforms income2 by collapsing its 8 levels to 3 to ease analysis and visualization of patterns in the data.
Only respondents without children were considered. This is to improve comparability between respondents.
Additional data wrangling steps are described with the code.
## Subset variables required for this analysis
Q1Data <- dplyr::select(brfss2013, X_state, iyear, dispcode,
pvtresd1, children, genhlth, hlthpln1, income2)
## Rename variables with more descriptive names
Q1Data <- plyr::rename(Q1Data, c('X_state' ='state',
'pvtresd1' = 'private_resid',
'genhlth' = 'gen_health',
'hlthpln1' = 'health_plan',
'income2' = 'income'))
## Restrict data to respondents without children
Q1Data <- filter(Q1Data, children == 0)
## Remove rows with missing values
Q1Data <- Q1Data[complete.cases(Q1Data),]
## Recode income variable to combine data into 3 larger groups: low income <$25,000; mid income $25,000-75,000; high income > $75,000
Q1Data <- mutate(Q1Data, income_group = income)
Q1Data$income_group <- combineLevels(Q1Data$income_group,
levs = c("Less than $10,000", "Less than $15,000", "Less than $20,000",
"Less than $25,000"), newLabel = "Low Income")
Q1Data$income_group <- combineLevels(Q1Data$income_group,
levs = c("Less than $35,000", "Less than $50,000",
"Less than $75,000"),
newLabel = "Mid Income")
Q1Data$income_group <- combineLevels(Q1Data$income_group,
levs = c("$75,000 or more"),
newLabel = "High Income")Exploratory Analysis
Low income groups make up just over 30% of this reduced sample of 230,858 observations.
attach(Q1Data)
options(digits = 2)
addmargins(table(income_group),1)## income_group
## Low Income Mid Income High Income Sum
## 73429 101118 56311 230858
prop.table(table(Q1Data$income_group))##
## Low Income Mid Income High Income
## 0.32 0.44 0.24
However 67.5% of respondents reporting themselves as being in poor health are from low income groups. More than twice the expected proportion if there were no association between income and overall health. 27.2% of middle income earners say they are in poor health whilst “only” 5.4% of high income earners fall into this category.
health_table <- table(Q1Data$gen_health, Q1Data$income_group,
dnn = c("Health Status", "Income Group"))
health_table2 <- prop.table(health_table, 1)
health_table3 <- addmargins(health_table2, 2)
health_table3## Income Group
## Health Status Low Income Mid Income High Income Sum
## Excellent 0.163 0.418 0.419 1.000
## Fair 0.529 0.375 0.096 1.000
## Good 0.332 0.471 0.197 1.000
## Poor 0.675 0.272 0.054 1.000
## Very good 0.197 0.480 0.322 1.000
The plot shows that as reported health status deteriorates from “excellent” to “poor”, the proportional representation of low income earners in each category increases; the opposite pattern is visible for high incomes. There does seem to be a relationship between income and health status.
plot(health_table2,
xlab = "Health Status",
ylab = "Income Group",
main = "Self Reported Health Status by Income Group",
color = "steelblue") Low income respondents’ poorer health may well be due to their relative lack of access to health care.
The bar chart below shows almost all high income earners are covered by a health plan.
At the other end of the income scale, low income earners are significantly less likely to enjoy such coverage.
ggplot(data = Q1Data, aes(x = income_group, fill = health_plan)) +
geom_bar(position = 'fill') +
labs(x = "income group",
y = "%",
title = "Low Incomes Less Likely to Have Health Plans",
fill = "Health Plan") +
theme_classic()Conclusion
The table and the mosaic plot below combine all three variables - income, health coverage and health status.
There is an association between income levels and health status; lower income earners report worse health status. As we move up the income scale reported health status improves.
Both income levels and health status seem associated with access to health care. High income, healthier groups almost all benefit from health plans whilst low income, less healthy groups report reduced access to health care. It is important to mention, as seen in the plot, that most Americans do benefit from health coverage. Low income groups are disproportionately represented amongst those who do not.
The plot cells are color coded to show the direction and strength of the associations between the 3 variables.
options(digits = 3)
health_table3 <- xtabs(~ gen_health + health_plan + income_group, data = Q1Data)
health_table4 <- prop.table(health_table3,3)
ftable(health_table4)## income_group Low Income Mid Income High Income
## gen_health health_plan
## Excellent No 0.012938 0.008080 0.004173
## Yes 0.062537 0.132855 0.249330
## Fair No 0.031064 0.007140 0.001154
## Yes 0.221956 0.123153 0.058408
## Good No 0.045513 0.017633 0.004813
## Yes 0.284247 0.321990 0.249560
## Poor No 0.015743 0.002235 0.000408
## Yes 0.130248 0.040497 0.014722
## Very good No 0.027428 0.015250 0.006020
## Yes 0.168326 0.331168 0.411412
plot3 <- xtabs(~ gen_health + income_group + health_plan, data = Q1Data)
mosaic(plot3, shade = TRUE, main = "")For greater emphasis, the association plot below highlights the cells in which there are more or fewer observations than expected.
When the cell is pink and below the dashed line, this indicates fewer than expected observations for the group; blue and above the line indicates more observations than expected. The width of the cell indicates the magnitude of the deviation.
assoc(plot3, main = "Association btw Income, Health Access & Health Status", shade = TRUE)Is there a relationship between income and risky hehaviours; specifically smoking and drinking
Data Preparation
The code below extracts the data required for this analysis, removes rows with missing values, renames some of the variables and transforms income2 by collapsing its 8 levels into 3 to ease analysis and visualization of patterns in the data.
Only respondents without children were considered. This is to improve comparability between survey respondents.
Additional data wrangling steps are described with the code.
## Subset variables required for this analysis
Q2Data <- dplyr::select(brfss2013, X_state, iyear, dispcode,
pvtresd1, children,income2, smokday2, avedrnk2)
## Rename variables with more descriptive names
Q2Data <- plyr::rename(Q2Data, c('X_state' = 'state', 'pvtresd1' = 'private_resid',
'income2' = 'income', 'smokday2' = 'smoke_days',
'avedrnk2' = 'ave_drink'))
## Restrict data to respondents without children so as to harmonize subsequent income grouping
Q2Data <- filter(Q2Data, children == 0)
## Remove rows with missing values
Q2Data <- Q2Data[complete.cases(Q2Data),]
## Recode income variable to combine data into 3 larger groups: <$25,000; $25,000-75,000; > $75,000
Q2Data <- mutate(Q2Data, income_group = income)
Q2Data$income_group <- combineLevels(Q2Data$income_group,
levs = c("Less than $10,000", "Less than $15,000",
"Less than $20,000", "Less than $25,000"),
newLabel = "Low Income")
Q2Data$income_group <- combineLevels(Q2Data$income_group,
levs = c("Less than $35,000", "Less than $50,000",
"Less than $75,000"),
newLabel = "Mid Income")
Q2Data$income_group <- combineLevels(Q2Data$income_group,
levs = c("$75,000 or more"),
newLabel = "High Income")Exploratory Analysis
Income
Income levels are defined as (“Low Income <25,000” “Mid Income 25,000-75,000” “High Income >75,000”)
attach(Q2Data)
options(digits = 2)
addmargins(table(income_group, dnn = "Income Group"),1)## Income Group
## Low Income Mid Income High Income Sum
## 12927 26555 16944 56426
The table above shows the counts in the different income categories. The reduced sample has 55,659 observations.
The table below shows the proportions. About half the respondents in this dataset are middle income and the remaining half are roughly divided between low and high income groups.
income_table1 <-prop.table(table(income_group, dnn = "Income Group"))
addmargins(income_table1, 1)## Income Group
## Low Income Mid Income High Income Sum
## 0.23 0.47 0.30 1.00
smoke_days
Most respondents (73.4%) in this sample never smoke. However around 15,000 (> 25%) smoke either some days or every day. This imbalance should not prevent the revelation of any patterns that may exist in the data.
smoke_table1 <- table(smoke_days, dnn = "frequency of days now smoking")
smoke_table1## frequency of days now smoking
## Every day Not at all Some days
## 10946 41389 4091
smoke_table2 <- prop.table(smoke_table1)
smoke_table2## frequency of days now smoking
## Every day Not at all Some days
## 0.194 0.734 0.073
ave_drink - average number of alcoholic drinks consumed per day in the past 30 days
The boxplot reveals the presence of some extreme outliers. The summary statistics confirm this. The median number of daily drinks is 2, but the observations stretch to 76.
boxplot(ave_drink, main = "Average no. of drinks per day", col = "grey")summary(Q2Data$ave_drink)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 2 2 2 76
The most extreme observations just seem implausible.
Excluding the top 1% of observations removes almost all the extreme values.
75% of the remaining respondents had 1 or 2 drinks a day on average during the month before the survey. The remainder had more; up to a maximum of 9.
Q2Data <- filter(Q2Data, ave_drink < quantile(ave_drink, 0.99))
boxplot(Q2Data$ave_drink,
main = "Average no. of drinks per day (without extreme outliers)",
col = "steelblue")summary(Q2Data$ave_drink)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 2 2 2 9
A close examination of the table below reveals some evidence for association between the 3 variables.
About 30% of low income earners smoke “every day” whilst 18.5% of middle income and only 11.8% of high income earners do.
The pattern repeats itself among those who smoke “some days”.
options(digits = 2)
options(scipen = 999)
cont_table1 <- xtabs(~smoke_days +ave_drink + income_group, data = Q2Data)
cont_table2 <-addmargins(prop.table(cont_table1,3),2)
cont_table3 <- ftable(cont_table2)
cont_table3## income_group Low Income Mid Income High Income
## smoke_days ave_drink
## Every day 1 0.089502 0.058126 0.034245
## 2 0.085059 0.057211 0.041082
## 3 0.053321 0.032170 0.021760
## 4 0.027533 0.015399 0.008383
## 5 0.014044 0.008805 0.005410
## 6 0.022217 0.009643 0.004756
## 7 0.002460 0.001372 0.000713
## 8 0.004681 0.001868 0.001189
## 9 0.000793 0.000152 0.000119
## Sum 0.299611 0.184746 0.117658
## Not at all 1 0.321511 0.405207 0.383234
## 2 0.157026 0.225797 0.310583
## 3 0.054511 0.069485 0.084542
## 4 0.023486 0.025728 0.026754
## 5 0.011823 0.010939 0.010761
## 6 0.012933 0.010215 0.008502
## 7 0.002460 0.001410 0.001308
## 8 0.003015 0.002897 0.002021
## 9 0.000793 0.000229 0.000119
## Sum 0.587559 0.751906 0.827824
## Some days 1 0.040387 0.020964 0.016885
## 2 0.036420 0.023403 0.019976
## 3 0.018250 0.009491 0.010048
## 4 0.007141 0.004688 0.004281
## 5 0.003729 0.001906 0.001962
## 6 0.005634 0.001830 0.001011
## 7 0.000397 0.000457 0.000059
## 8 0.000793 0.000534 0.000178
## 9 0.000079 0.000076 0.000119
## Sum 0.112830 0.063348 0.054518
Focusing on the relationship between income and drinking:
Roughly equal proportions report taking 1 drink a day on average - higher incomes slightly less. A smaller proportion of low income earners consume 2 drinks per day on average but. As the number of average daily drinks goes up beyond 2, the data shows lower income groups consuming slightly more than other groups.
It is worth mentioning that relatively few respondents report consuming more than 4 drinks per day on average.
table(Q2Data$ave_drink, Q2Data$income_group, dnn = c("Av. Daily Drinks", "Income Group"))## Income Group
## Av. Daily Drinks Low Income Mid Income High Income
## 1 5689 12706 7306
## 2 3510 8039 6251
## 3 1589 2916 1957
## 4 733 1202 663
## 5 373 568 305
## 6 514 569 240
## 7 67 85 35
## 8 107 139 57
## 9 21 12 6
plot1 <- Q2Data %>%
group_by(income_group, ave_drink) %>%
summarise(count = n()) %>%
mutate(drink_percent = count/sum(count))
ggplot(data = plot1, aes(x = income_group, y = drink_percent*100, fill = as.factor(ave_drink)))+
geom_bar(stat = "identity")+
ylab("%")+
xlab("Income Group")+
ggtitle("Drink vs Income")+
labs(fill = "average daily drinks")+
theme_classic()The mosaic plot below offers an alternative view.
drinktable1 <- prop.table(table(Q2Data$ave_drink, Q2Data$income_group),2)
mosaicplot(drinktable1, main = "Income vs Average Drinks p/Day", color = "steelblue")In both plots it is difficult to identify any clear differences in the patterns of alcohol consumption between the income groups. Associations plots may be of help in clarifying the situation as they are based on rigorous statistical analyses of association.
The plot below offers a good visualization of the associations between all 3 variables.
We see that low income respondents are more represented than expected among those who smoke “every day” and “some days”. The same is true within the drink segments although the association is weaker; low income groups are over represented within each segment except among those who do not smoke at all. The opposite is true for high income respondents. A consistent result.
Associations between income, drinking and those who smoke only some days are very weak.
assoc(~ smoke_days + income_group + ave_drink,
data = Q2Data,
shade = TRUE,
main = "Income, Smoking & Drinking")Conclusion
The association between income and regular smoking seems clearly established. Regarding the association between those variables with drinking: although some association is apparent in the data, the number of observations drop significantly as the reported number of daily drinks consumed goes up. This calls for caution in drawing any conclusions.
In addition to health care access and income, poorer Americans’ health status may also be associated with certain risky behaviours such as regular smoking and perhaps drinking.
Taking into account educational attainment, is there an association between men’s marital status and income levels?
Income levels are defined as (“Low Income <25,000” “Mid Income 25,000-75,000” “High Income >75,000”)
Data Preparation
The data set is restricted to males only as required by the research question.
Data preparation steps are similar to those performed for the previous questions and are described with the code.
## Subset variables required for this analysis
Q3Data <- dplyr::select(brfss2013, X_state, iyear, dispcode, pvtresd1, employ1,
sex, income2, marital, educa)
## Rename variables with more descriptive names
Q3Data <- plyr::rename(Q3Data, c(X_state = "state", pvtresd1 = "private_resid",
income2 = "income", marital = "marital_status", educa = "education_level",
employ1 = "employ_status"))
## Exclude females
Q3Data <- filter(Q3Data, sex == "Male")
## Restrict marriage status to 3 levels: 'married', 'never married' and
## 'unmarried couple'
Q3Data <- filter(Q3Data, marital_status == "Married" | marital_status == "A member of an unmarried couple" |
marital_status == "Never married")
## Exclude some levels from `education_level`
Q3Data <- filter(Q3Data, education_level != "Never attended school or only kindergarten")
Q3Data <- filter(Q3Data, education_level != "Grades 1 through 8 (Elementary)")
## Rename `education_level` factor levels
Q3Data$education_level <- plyr::revalue(Q3Data$education_level, c(`Grades 9 though 11 (Some high school)` = "some high school",
`Grade 12 or GED (High school graduate)` = "high school grad.", `College 1 year to 3 years (Some college or technical school)` = "some college",
`College 4 years or more (College graduate)` = "college grad."))
## Rename 1 `marrital_status` level
Q3Data$marital_status <- plyr::revalue(Q3Data$marital_status, c(`A member of an unmarried couple` = "Unmarried couple"))## Recode income variable to combine data into 3 larger groups: <$25,000; $25,000-75,000; > $75,000
Q3Data <- mutate(Q3Data, income_group = income)
Q3Data$income_group <- combineLevels(Q3Data$income_group,
levs = c("Less than $10,000", "Less than $15,000",
"Less than $20,000", "Less than $25,000"),
newLabel = "Low Income")
Q3Data$income_group <- combineLevels(Q3Data$income_group,
levs = c("Less than $35,000", "Less than $50,000",
"Less than $75,000"),
newLabel = "Mid Income")
Q3Data$income_group <- combineLevels(Q3Data$income_group,
levs = c("$75,000 or more"),
newLabel = "High Income")
## Remove rows with missing values from key variables: "marital_status", "education_level" and "income_group"
Q3Data <- Q3Data[complete.cases(Q3Data[,8:10]),]
## Filter variable "employ_status" to remove levels: "A student" and "Unable to work"
Q3Data <- filter(Q3Data, employ_status != "Unable to work" & employ_status != "A student" )
## Drop unused factor levels
Q3Data$marital_status <- droplevels(Q3Data$marital_status)
Q3Data$employ_status <- droplevels(Q3Data$employ_status)
Q3Data$education_level <- droplevels(Q3Data$education_level)Exploratory Analysis
Summaries of the 3 categorical variables
Income levels are defined as (“Low Income <25,000” “Mid Income 25,000-75,000” “High Income >75,000”)
attach(Q3Data)
summary(income_group)## Low Income Mid Income High Income
## 20841 55127 49432
summary(marital_status)## Unmarried couple Married Never married
## 4468 95689 25243
summary(education_level)## some college college grad. high school grad. some high school
## 31630 54472 34079 5219
Almost 89% of high income earners in this data set are married compared to about 50% of low income earners.
This and other interesting insights are shown in the tables and visualized in the mosaic plot. There does seem to be an association between marital status and income.
options(digits = 2)
mar_table1 <- table(Q3Data$marital_status, Q3Data$income_group,
dnn = c("Income Group", "Marital Status"))
mar_table2 <- prop.table(mar_table1,2)
addmargins(mar_table2, 1)## Marital Status
## Income Group Low Income Mid Income High Income
## Unmarried couple 0.063 0.034 0.025
## Married 0.499 0.751 0.888
## Never married 0.438 0.214 0.087
## Sum 1.000 1.000 1.000
plot(mar_table2, col = "steelblue",
main = "Relationship b/w Marital Status and Income",
xlab = "Marital Status",
ylab = "Income Group")There are certainly several other variables that could conceivably explain income levels. One possible confounder, in that it could correlate with both marital status and income, is educational attainment. familyfacts.org, the organisation whose claims directly inspired this research question, accounted for education level in their research.
The table below offers some insights.
In this sample 54,472 respondents are college graduates. 46% of those are married, 35% have never been and a similar proportion describe themselves as unmarried couples.
The table shows that married men dominate the high income category at every level of educational attainment.
This is preliminary evidence of association between the variables
options(digits = 1)
table(Q3Data$education_level)##
## some college college grad. high school grad. some high school
## 31630 54472 34079 5219
mar_table3 <- xtabs(~income_group + education_level+marital_status, data = Q3Data)
mar_table4 <- addmargins(prop.table(mar_table3,3),2)
mar_table5 <- ftable(mar_table4)
mar_table5## marital_status Unmarried couple Married Never married
## income_group education_level
## Low Income some college 0.081 0.026 0.103
## college grad. 0.039 0.017 0.071
## high school grad. 0.122 0.050 0.152
## some high school 0.053 0.016 0.035
## Sum 0.295 0.109 0.362
## Mid Income some college 0.131 0.128 0.129
## college grad. 0.138 0.143 0.183
## high school grad. 0.124 0.144 0.140
## some high school 0.030 0.017 0.016
## Sum 0.424 0.433 0.468
## High Income some college 0.059 0.092 0.039
## college grad. 0.177 0.299 0.099
## high school grad. 0.041 0.063 0.029
## some high school 0.004 0.004 0.003
## Sum 0.281 0.459 0.170
The marital_status labels on the right vertical axis are impossible to read in the association plot below. Nevertheless, the plot still provides a visual depiction of the associations described above.
assoc(~education_level + income_group + marital_status, data = Q3Data, shade = TRUE)