Setup

Part 1: Data

The observations in the sample are collected through random sampling of adults in the USA. Landline telephone interviews are used to collect data from randomly selected adults living in a household, whereas celluler telephone interviews are used for adults who reside in a private residence or college housing. So, this is an observational study and thus no random assignments are made. Therefore, no causal conclusion can be drawn from these data. Only correlational inference can be made in this case. However, results from these data are generalizable to the larger population.


Part 2: Research questions

Research quesion 1: Is there a relationship between general health status of the respondents and their health care coverage status? Does the income level have anything to do with their ability to access healthcare plan?

Health care coverage is a widely talked about issue in the USA. The income level of individuals may impact their ability to access healthcare. Therefore, it will be interesting to explore what the data say.

Research quesion 2: Is there a correlation between inadequate sleep time and the general health status of the respondents?

Inadequate sleep is known for contributing to the poor health of individuals over time.

Research quesion 3: What’s the relationship between high blood pressure and smoking habits? What’s the rate of diabetes positivity in terms of physical activity?

Smoking is known to bad for health and exercise is helpful for diabetes patients. So, it will be interesting to dig into the related data.


Part 3: Exploratory data analysis

Research quesion 1:

## 'data.frame':    491775 obs. of  3 variables:
##  $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ hlthpln1: Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ income2 : Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...

All three variables here are categorical variables with multiple levels.

##        Var1    Yes    No
## 1 Excellent  77024  8126
## 2 Very good 144394 14168
## 3      Good 130269 19631
## 4      Fair  56882  9599
## 5      Poor  24389  3452

This dataframe shows the count of respondents in terms of their health status and healthcare coverage. I will convert these counts into proportions in order to visualize.

##        Var1  Yes   No
## 1 Excellent 0.18 0.15
## 2 Very good 0.33 0.26
## 3      Good 0.30 0.36
## 4      Fair 0.13 0.17
## 5      Poor 0.06 0.06
## Using Var1 as id variables
##         Var1 Health_Plan Percentage
## 1  Excellent         Yes       0.18
## 2  Very good         Yes       0.33
## 3       Good         Yes       0.30
## 4       Fair         Yes       0.13
## 5       Poor         Yes       0.06
## 6  Excellent          No       0.15
## 7  Very good          No       0.26
## 8       Good          No       0.36
## 9       Fair          No       0.17
## 10      Poor          No       0.06

We see that over three-quarters (Good+Very good+Excellent) of the respondents reported having good health regardless of their healthcare coverage status.

Let’s explore the health care coverage in terms of income level.

##   Var1 Less than $10,000 Less than $15,000 Less than $20,000 Less than $25,000
## 1  Yes             18732             21143             26695             33312
## 2   No              6551              5558              8061              8295
##   Less than $35,000 Less than $50,000 Less than $75,000 $75,000 or more
## 1             41738             55575             61732          113023
## 2              7024              5824              3414            2771
## [1] "Var1"              "Less than $10,000" "Less than $15,000"
## [4] "Less than $20,000" "Less than $25,000" "Less than $35,000"
## [7] "Less than $50,000" "Less than $75,000" "$75,000 or more"
##   Var1 Less_than_10k Less_than_15k Less_than_20k Less_than_25k Less_than_35k
## 1  Yes         18732         21143         26695         33312         41738
## 2   No          6551          5558          8061          8295          7024
##   Less_than_50k Less_than_75k k75k_or_more
## 1         55575         61732       113023
## 2          5824          3414         2771

The columns are renamed here.

##   Var1 Less_than_10k Less_than_15k Less_than_20k Less_than_25k Less_than_35k
## 1  Yes          0.74          0.79          0.77           0.8          0.86
## 2   No          0.26          0.21          0.23           0.2          0.14
##   Less_than_50k Less_than_75k k75k_or_more
## 1          0.91          0.95         0.98
## 2          0.09          0.05         0.02
## Using Var1 as id variables

Converted the frequency table into proportions.

We see that as the income level grows, so does the proportion of respondents having access to healthcare coverage with an exception in the \(15\)K-\(20\)K range. So, the correlation is evident.

Research quesion 2:

## 'data.frame':    491775 obs. of  2 variables:
##  $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ sleptim1: num  NA 6 9 8 6 8 7 6 8 8 ...

No wonder ‘sleep time’ is a numerical variable. Let’s have a look at some of the summary statistics.

##   sleep_mean sleep_median sleep_sd sleep_min sleepmax
## 1   7.042359            7  1.42982         0       16

For simplicity, I’ve removed data points representing sleep time more than \(16\) hours. The average sleep time is roughly \(7\) hours with a standard deviation of \(1.43\) hours.

## Warning in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
## non-uniform 'Rounding' sampler used

We can see that the median sleep time of the respondents who reported poor health is less than the other four groups. So, health status and sleep time seems to be correlated.

Research quesion 3:

## 'data.frame':    491775 obs. of  5 variables:
##  $ bphigh4 : Factor w/ 4 levels "Yes","Yes, but female told only during pregnancy",..: 1 3 3 3 1 1 1 1 3 3 ...
##  $ diabete3: Factor w/ 4 levels "Yes","Yes, but female told only during pregnancy",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ weight2 : Factor w/ 570 levels "",".b","100",..: 154 30 63 31 169 128 9 1 139 73 ...
##  $ smokday2: Factor w/ 3 levels "Every day","Some days",..: 3 NA 2 NA 3 NA 3 1 NA NA ...
##  $ exerany2: Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
##   bphigh4 diabete3 weight2   smokday2 exerany2
## 1     Yes       No     250 Not at all       No
## 2      No       No     127       <NA>      Yes
## 3      No       No     160  Some days       No
## 4      No       No     128       <NA>      Yes
## 5     Yes       No     265 Not at all       No
## 6     Yes       No     225       <NA>      Yes

Changed weight to a numeric variable.

##    wt_mean wt_median    wt_sd wt_min wt_max
## 1 80.21934        73 59.64982      1    570

It looks like not everybody reported their weights in pounds.

Let’s explore the relationship between high blood pressure and smoking habits.
##         Var1   Yes    No
## 1  Every day 20645 33489
## 2  Some days  8062 13005
## 3 Not at all 67848 67710
##         Var1  Yes   No
## 1  Every day 0.21 0.29
## 2  Some days 0.08 0.11
## 3 Not at all 0.70 0.59
## Using Var1 as id variables
##         Var1 high.bp Percentage
## 1  Every day     Yes       0.21
## 2  Some days     Yes       0.08
## 3 Not at all     Yes       0.70
## 4  Every day      No       0.29
## 5  Some days      No       0.11
## 6 Not at all      No       0.59

Seventy percent (\(70\)%) of the respondents reporting high blood pressure do not smoke at all.

Diabetes vs Exercise
##   Var1    Yes    No
## 1  Yes  35003 23434
## 3   No 288417 97793
##   Var1  Yes   No
## 1  Yes 0.11 0.19
## 3   No 0.89 0.81
## Using Var1 as id variables
##   Var1 Exercise Percentage
## 1  Yes      Yes       0.11
## 2   No      Yes       0.89
## 3  Yes       No       0.19
## 4   No       No       0.81

\(1\) out of \(10\) respondents who do physical activity reported having diabetes. On the other hand, \(1\) in \(5\) who do not do any physical activity reported having diabetes. So, the diabetes positivity rate among respondents who do not do physical activity is almost twice the rate of respondents who are physically active. Although the correlation between diabetes and exercise is somewhat obvious, we cannot conclude that lack of exercise causes diabetes.