R Markdown

Dataset - National Health Interview Adult Survey

The dataset is data from the 2021 National Health Interview Adult Survey. The survey contained questions related to household and family composition, demographics about the survey taker, satisfaction with life, health insurance, medication, immunization, preventive screenings, and multiple health problems such as hypertension, cardiovascular conditions, cancer, vision, hearing, mobility, and more.

This survey is important in following the health of American’s based on many different factors of their lives. Looking at previous surveys can also help to see trends in Americans’ health.

Questions

1. Does education level play a role in the mental or physical health?
2. What are some health issues that correlate to other health issues?
3: What health issues are more common among certain demographics?
4: Has COVID possibly had an effect on certain health issues?
5: Is there a link between physical health and mental health?

Columns

General Health

1: Excellent
2: Very Good
3: Good
4: Fair
5: Poor
7: Refused
8: Not Ascertained
9: Don't Know

2: Life Satisfaction

1: Very Satisfied
2: Satisfied
3: Dissatisfied
4: Very Dissatisfied
7: Refused
8: Not Ascertained
9: Don't Know

3: General Demographics

Classification of County Lived In
  1: Large central metro
  2: Large fringe metro
  3: Medium and small metro
  4: Nonmetropolitan
  
Household Region
  1: Northeast
  2: Midwest
  3: South
  4: West
  
Age
  18-84: 18-84 with number corresponding
  85: 85+
  97: Refused
  98: Not Ascertained
  99: Don't Know
  
Age 65+
  1: Less than 65
  2: 65 or older
  7: Refused
  8: Not Ascertained
  9: Don't Know
  
Sex
  1: Male
  2: Female
  7: Refused
  8: Not Ascertained
  9: Don't Know
  

Education Level

0: Never attended/Kindergarten only
1: Grade 1-11
2: 12th grade, no diploma
3: GED or equivalent
4: High School Graduate
5: Some college, no degree
6: Associate degree: occupational, technical, or vocational program
7: Associate degree: academic program
8: Bachelor's degree
9: Master's degree
10: Professional School or Doctoral degree
97: Refused
98: Not Ascertained
99: Don't Know

Weight

Person's weight in lbs

Height

Person's height in ???

Medical Problems

Questions were laid out as... 
  Told you have (condition)?
  Told you have (condition) on 2 or more visits?
  Had (condition) in past 12 months?
  
...with the possible responses being,
  1: Yes. 1 answered if respondant is taking medication to control the issue
  2: No
  7: Refused
  8: Not Ascertained
  9: Don't Know
  

Cancer

Types Included
  1
  2
  3
  4

Age when first told had (type) cancer?
  1-84: 1-84 years, with the corresponding number
  85: 85+ years
  97: Refused
  98: Not Ascertained
  99: Don't Know

Others

Days Missed Work
  0-129: 0 to 129 with corresponding value
  130: 130+ days
  997: Refused
  998: Not Ascertained
  999: Don't Know"

Unclear Data (Week 5 Assignment)

Most of the column names were unclear until I read the Codebook, however it was often easy to tell what category something fell under such as EDUCP_A, likely had something to do with education, while variable with CAN in them had to do with Cancer. I have an Excel sheet of the data where I have the columns color coded by if I know them from the codebook, if they are not in the codebook, or if I will not be using that column. Some of these unclear ones are the ones that start with DRK, PA18, MOD, VIG, and STR. I am still working on figuring those out.

Among the columns I do know, there are a few that I am unclear about. Among the cancer ones, they are asked what age were they told they have colon-rectal cancer. However, two other questions ask about colon cancer and rectal cancer, so I am trying to figure out if those are the same things, or separated.

dfColonRectal <- adult22[ , c("COLRCAGETC_A", "COLONAGETC_A", "RECTUAGETC_A")]  

dfColonRectalAge <-subset(dfColonRectal, COLRCAGETC_A<="85")
#count(dfColonRectalAge) = 196
#print(dfColonRectalAge)

dfColonRectalAgeTest <-subset(dfColonRectalAge, COLRCAGETC_A==COLONAGETC_A | COLRCAGETC_A==RECTUAGETC_A)
#count(dfColonRectalAgeTest) = 196

Both have 196, so that means they have the same age that they put for ColoRectal in either Colon or Rectal. So this won’t cause problems for the data, I just have to make sure I don’t include ColoRectal and Colon, or ColoRectal and Rectal as separate cancers. Such as if I am counting how many types of cancer one person has.

 library(ggplot2)

# Weight
mean(adult22$WEIGHTLBTC_A)
## [1] 246.2174
max(adult22$WEIGHTLBTC_A)
## [1] 999
min(adult22$WEIGHTLBTC_A)
## [1] 100
# Age
mean(adult22$AGEP_A)
## [1] 53.05092
max(adult22$AGEP_A)
## [1] 99
min(adult22$AGEP_A)
## [1] 18
# Age65+
ggplot(adult22, aes(x = AGE65)) +
  geom_bar()
## Warning: Removed 27525 rows containing non-finite values (`stat_count()`).

# Sex
ggplot(adult22, aes(x = SEX_A)) +
  geom_bar()

# Education Level

ggplot(adult22, aes(x = EDUCP_A)) +
  geom_bar()

#General Health
mean(adult22$PHSTAT_A)
## [1] 2.440273
ggplot(adult22, aes(x = PHSTAT_A)) +
  geom_bar()

plot(adult22$AGEP_A , adult22$PHSTAT_A)
  abline(lm(adult22$PHSTAT_A ~ adult22$AGEP_A), col = "red", lwd = 3)

#Weight and Health

# dfWeightFilter <- adult22[adult22$WEIGHTLBTC_A < '997', ]

dfWeightFilter <- adult22 %>% 
  filter(WEIGHTLBTC_A <= 900)

plot(dfWeightFilter$WEIGHTLBTC_A)

dfWHFilter <- dfWeightFilter %>% 
  filter(PHSTAT_A <= 6)

dfHighHealth <- dfWHFilter %>%
  filter(PHSTAT_A < 3 )

dfHighWeight <- dfWHFilter %>%
  filter(WEIGHTLBTC_A >= 250 )

Weight1 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A < '150', ])
Weight2 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A > '150' & dfWeightFilter$WEIGHTLBTC_A <= '200', ])
Weight3 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A > '200' & dfWeightFilter$WEIGHTLBTC_A <= '250', ])
Weight4 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A <= '250', ])

dfWeightCount <- data.frame(Weight1, Weight2, Weight3, Weight4)
print(dfWeightCount)
##   Weight1 Weight2 Weight3 Weight4
## 1    6451   11717    5008   24210
plot(dfWHFilter$WEIGHTLBTC_A, dfWHFilter$PHSTAT_A, xlab = "Weight", ylab = "General Health")

plot(dfHighWeight$WEIGHTLBTC_A, dfHighWeight$PHSTAT_A, xlab = "Weight", ylab = "General Health")

hist(dfWeightFilter$WEIGHTLBTC_A, )

#Weight and Height

plot(adult22$WEIGHTLBTC_A, adult22$HEIGHTTC_A)

# Group_By

library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(dplyr)

dfEdu <- adult22 %>% group_by(adult22$EDUCP_A)

mean(dfEdu$EDUCP_A) 
## [1] 6.443528
# which is an associate degree

# Probability of at least an associate degree (6, 7, 8, 9, 10)

prob_Associate_Up<- nrow(dfEdu[dfEdu$EDUCP_A >= '6' & dfEdu$EDUCP_A <= '10', ])

prob_All <- nrow(dfEdu)

prob_Associate_Up/prob_All
## [1] 0
# Probability of below grade 12

prob_Under_12 <- nrow(dfEdu[dfEdu$EDUCP_A <= '1', ])

prob_Under_12/prob_All
## [1] 0.06802647
# Probability of associate or higher and positive life satisfaction

prob_Associate_Satisfied <- nrow(dfEdu[dfEdu$EDUCP_A >= '6' & dfEdu$EDUCP_A <= '10' & dfEdu$LSATIS4_A <= '2', ])

prob_Associate_Satisfied/prob_Associate_Up
## [1] NaN
#Probability of below grade 12 and satisfied
prob_Under12_Satisfied <- nrow(dfEdu[dfEdu$EDUCP_A <= '1' & dfEdu$LSATIS4_A <= '2', ])

prob_Under12_Satisfied/prob_Under_12
## [1] 0.917597
plot
## function (x, y, ...) 
## UseMethod("plot")
## <bytecode: 0x7fe24f2e3220>
## <environment: namespace:base>
#Probability of normal BMI(18.5 to 24.9) and general health

dfHealth <- adult22 %>% group_by(adult22$PHSTAT_A)

prob_NormBMI <- nrow(dfHealth[dfHealth$BMICAT_A == '2', ])

prob_NormBMI/prob_All
## [1] 0.307186
prob_NormBMI_GoodHealth <- nrow(dfHealth[dfHealth$BMICAT_A == '2' & dfHealth$PHSTAT_A <= '4', ])

prob_NormBMI_GoodHealth/prob_NormBMI
## [1] 0.970332
#Probability of overweight BMI and positive/negative health

prob_OverweightBMI <- nrow(dfHealth[dfHealth$BMICAT_A == '3', ])

prob_OverweightBMI/prob_All
## [1] 0.3357926
prob_OverweightBMI_GoodHealth <- nrow(dfHealth[dfHealth$BMICAT_A == '3' & dfHealth$PHSTAT_A <= '4', ])

prob_OverweightBMI_GoodHealth/prob_OverweightBMI
## [1] 0.9696284
prob_OverweightBMI_BadHealth <- nrow(dfHealth[dfHealth$BMICAT_A == '3' & dfHealth$PHSTAT_A == '5', ])

prob_OverweightBMI_BadHealth/prob_OverweightBMI
## [1] 0.02994076
prob_GoodHealth <- nrow(dfHealth[dfHealth$PHSTAT_A <= '4', ])

# How many of all BMIs considered themselves to be in good health

prob_GoodHealth/prob_All
## [1] 0.9626053
# About 96% of people considered themselves to be in good, or greater health. Even among different BMIs, the percent that considered themselves to be in good health was above 90%.

# Why do most people see themselves to be in good health, or were most of the survey takers healthy in general? -- Check the more specific medical issues

BMI New Column

# Sort BMI by Underweight, Normal, Overweight, Obese

adult22_raw <- adult22

adult22BMI <- adult22_raw

adult22BMI <-
  adult22BMI |>
    group_by(adult22BMI$BMICAT_A) |>
    mutate(BMI_Status = ifelse(BMICAT_A == 1,
                                 "Under", 
                               ifelse(BMICAT_A == 3,
                                 "Over",
                                 ifelse(BMICAT_A == 4,
                                 "Obese",
                                 ifelse(BMICAT_A,
                                 "Normal",
                                 "Unknown"))))) |>
    ungroup()

Normal <- nrow(adult22BMI[adult22BMI$BMI_Status == 'Normal',])
# Life Satisfaction and General Health

prob_GoodLS_Health <- nrow(dfHealth[dfHealth$LSATIS4_A <= '2' & dfHealth$PHSTAT_A <= '4', ])
prob_GoodLS_Health/prob_All
## [1] 0.9275976
#Prob out of those who have high general health
prob_GoodLS_Health/prob_GoodHealth
## [1] 0.9636323
#Bad life satisfaction and bad health out of all
prob_BadLS_Health <- nrow(dfHealth[dfHealth$LSATIS4_A >= '3' & dfHealth$LSATIS4_A <=4 & dfHealth$PHSTAT_A == '5', ])

prob_BadLS_Health/prob_All
## [1] 0.01182597
#Bad life satisfaction among those with low health
prob_Low_LS <- nrow(dfHealth[dfHealth$PHSTAT_A == '5',])

prob_BadLS_Health/prob_Low_LS
## [1] 0.3180934
plot(adult22$EDUCP_A , adult22$LSATIS4_A)
  abline(lm(adult22$LSATIS4_A ~ adult22$EDUCP_A), col = "red", lwd = 3)

# Because the survey was mostly multiple choice, there are not any major anomalies. The only thing that falls out of the typical range of responses are the "don't know, refuse, or not ascertained" but even those have specific values that are consistent across questions. 

# There were a few strange ones among these, such as a few people putting "don't know/not ascertained" for their age, which is something they should know. Probably a wrong click or just not paying attention?
#Education Dataframe Sample
dfEduSample <- dfEdu[ , c("EDUCP_A")]  
dfEdu1 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu2 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu3 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu4 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu5 <- sample_n(dfEduSample,100, replace = TRUE)
print(dfEdu1)
## # A tibble: 100 × 1
##    EDUCP_A
##      <int>
##  1       8
##  2       9
##  3      10
##  4       4
##  5       8
##  6       7
##  7       4
##  8       9
##  9       1
## 10       7
## # ℹ 90 more rows
print(mean(dfEdu1$EDUCP_A))
## [1] 6.64
print(dfEdu2)
## # A tibble: 100 × 1
##    EDUCP_A
##      <int>
##  1       9
##  2       3
##  3      10
##  4       4
##  5       4
##  6       8
##  7       9
##  8       4
##  9       4
## 10       8
## # ℹ 90 more rows
print(mean(dfEdu2$EDUCP_A))
## [1] 5.58
print(dfEdu3)
## # A tibble: 100 × 1
##    EDUCP_A
##      <int>
##  1       5
##  2       8
##  3       6
##  4       4
##  5       4
##  6       7
##  7       7
##  8      99
##  9      10
## 10       8
## # ℹ 90 more rows
print(mean(dfEdu3$EDUCP_A))
## [1] 6.98
print(dfEdu4)
## # A tibble: 100 × 1
##    EDUCP_A
##      <int>
##  1       4
##  2       5
##  3       4
##  4       5
##  5       8
##  6       9
##  7       5
##  8       8
##  9       8
## 10       7
## # ℹ 90 more rows
print(mean(dfEdu4$EDUCP_A))
## [1] 7.48
print(dfEdu5)
## # A tibble: 100 × 1
##    EDUCP_A
##      <int>
##  1       4
##  2       4
##  3       1
##  4       5
##  5       4
##  6      10
##  7       4
##  8       8
##  9       4
## 10       4
## # ℹ 90 more rows
print(mean(dfEdu5$EDUCP_A))
## [1] 5.56
# The average tends to be between 5 (some college) and 8 (Bachelor's degree), among all the samples. However if any sample ends up with the 97,98, or 99 that correspond with "don't know", then the sample will be greatly skewed.
dfWeightHeightSample <- dfHealth[ , c("WEIGHTLBTC_A", "HEIGHTTC_A")]  
dfWH1 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH2 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH3 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH4 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH5 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
print(dfWH1)
## # A tibble: 100 × 2
##    WEIGHTLBTC_A HEIGHTTC_A
##           <int>      <int>
##  1          128         65
##  2          250         72
##  3          143         64
##  4          215         71
##  5          997         63
##  6          996         96
##  7          124         64
##  8          996         96
##  9          160         63
## 10          215         74
## # ℹ 90 more rows
print(dfWH2)
## # A tibble: 100 × 2
##    WEIGHTLBTC_A HEIGHTTC_A
##           <int>      <int>
##  1          160         64
##  2          240         70
##  3          200         72
##  4          180         72
##  5          110         62
##  6          180         64
##  7          996         96
##  8          215         69
##  9          240         69
## 10          200         70
## # ℹ 90 more rows
print(dfWH3)
## # A tibble: 100 × 2
##    WEIGHTLBTC_A HEIGHTTC_A
##           <int>      <int>
##  1          130         59
##  2          175         71
##  3          180         65
##  4          150         71
##  5          999         65
##  6          155         71
##  7          140         65
##  8          215         63
##  9          996         96
## 10          190         62
## # ℹ 90 more rows
print(dfWH4)
## # A tibble: 100 × 2
##    WEIGHTLBTC_A HEIGHTTC_A
##           <int>      <int>
##  1          122         63
##  2          131         64
##  3          996         96
##  4          110         60
##  5          150         69
##  6          144         63
##  7          168         64
##  8          180         64
##  9          178         68
## 10          165         63
## # ℹ 90 more rows
print(dfWH5)
## # A tibble: 100 × 2
##    WEIGHTLBTC_A HEIGHTTC_A
##           <int>      <int>
##  1          195         73
##  2          170         66
##  3          138         63
##  4          170         70
##  5          140         71
##  6          175         71
##  7          210         63
##  8          129         68
##  9          120         64
## 10          996         96
## # ℹ 90 more rows
plot(dfWH1$WEIGHTLBTC_A,dfWH1$HEIGHTTC_A,type="p",main="Normal Distribution",xlab="Weight(lbs)",ylab="Height")
 points(dfWH2$WEIGHTLBTC_A,dfWH2$HEIGHTTC_A, col="green")
 points(dfWH3$WEIGHTLBTC_A,dfWH3$HEIGHTTC_A,col="blue")
 points(dfWH4$WEIGHTLBTC_A,dfWH4$HEIGHTTC_A,col="red")
 points(dfWH5$WEIGHTLBTC_A,dfWH5$HEIGHTTC_A,col="yellow")

 # Among the samples, they tend to stay in the same corner/area for weight and height. They also tend to have around the same number of outliers(97,98,99 for "don't know").
dfGenHealthSample <- dfHealth[ , c("PHSTAT_A")]  
dfGH1 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH2 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH3 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH4 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH5 <- sample_n(dfGenHealthSample,100, replace = TRUE)

# Average
print(mean(dfGH1$PHSTAT_A))
## [1] 2.34
print(mean(dfGH2$PHSTAT_A))
## [1] 2.27
print(mean(dfGH3$PHSTAT_A))
## [1] 2.18
print(mean(dfGH4$PHSTAT_A))
## [1] 2.45
print(mean(dfGH5$PHSTAT_A))
## [1] 2.41
# The average tends to be between 2 and 3, which makes sense because the general health among all survey takers is often a 2 (Very good) or 3 (Good).

Looking at data among cancer types

Types:

BLADDCAN_A BLOODCAN_A BONECAN_A BRAINCAN_A BREASCAN_A CERVICAN_A ESOPHCAN_A GALLBCAN_A LARYNCAN_A LEUKECAN_A LIVERCAN_A LUNGCAN_A LYMPHCAN_A MELANCAN_A MOUTHCAN_A OVARYCAN_A PANCRCAN_A PROSTCAN_A SKNMCAN_A SKNNMCAN_A SKNDKCAN_A STOMACAN_A THROACAN_A THYROCAN_A UTERUCAN_A HDNCKCAN_A COLRCCAN_A OTHERCANP_A

Number of reported cancers: NUMCAN_A

Age Told has Cancer

BLADDAGETC_A BLOODAGETC_A BONEAGETC_A BRAINAGETC_A BREASAGETC_A CERVIAGETC_A COLONAGETC_A ESOPHAGETC_A GALLBAGETC_A LARYNAGETC_A LEUKEAGETC_A LIVERAGETC_A LUNGAGETC_A LYMPHAGETC_A MELANAGETC_A MOUTHAGETC_A OVARYAGETC_A PANCRAGETC_A PROSTAGETC_A SKNMAGETC_A SKNNMAGETC_A SKNDKAGETC_A STOMAAGETC_A THROAAGETC_A THYROAGETC_A UTERUAGETC_A HDNCKAGETC_A COLRCAGETC_A OTHERAGETC_A

# Cancers df

dfCancer <- adult22 %>% 
  filter(NUMCAN_A > 0 & NUMCAN_A < 7)

ggplot(dfCancer, aes(x = NUMCAN_A)) +
  geom_bar()

ggplot(dfCancer, aes(NUMCAN_A, LSATIS4_A, colour=NUMCAN_A)) + 
    geom_line() + 
    geom_point()

ggplot(dfCancer, aes(NUMCAN_A, AGEP_A, colour=NUMCAN_A)) + 
    geom_line() + 
    geom_point()

plot(dfCancer$AGEP_A, dfCancer$NUMCAN_A)
  abline(lm(dfCancer$NUMCAN_A ~ dfCancer$AGEP_A), col = "red", lwd = 3)

# Age CI of those with cancer
resultCAN <- t.test(dfCancer$AGEP_A)
confidence_intervalCAN <- resultCAN$conf.int
confidence_intervalCAN
## [1] 68.11267 68.97246
## attr(,"conf.level")
## [1] 0.95
mean(dfCancer$AGEP_A)
## [1] 68.54257
# Age CI of those without cancer
dfNoCancer <- adult22 %>% 
  filter(NUMCAN_A == 0)

# Age CI of those with no cancer
resultNOCAN <- t.test(dfNoCancer$AGEP_A)
confidence_intervalNONE <- resultNOCAN$conf.int
confidence_intervalNONE
## [1] 50.60551 51.06352
## attr(,"conf.level")
## [1] 0.95
mean(dfNoCancer$AGEP_A)
## [1] 50.83452
# Age CI of all
result <- t.test(adult22$AGEP_A)
confidence_interval <- result$conf.int
confidence_interval
## [1] 52.83239 53.26945
## attr(,"conf.level")
## [1] 0.95
mean(adult22$AGEP_A)
## [1] 53.05092