Dataset

The dataset is data from the 2021 National Health Interview Adult Survey. The survey contained questions related to household and family composition, demographics about the survey taker, satisfaction with life, health insurance, medication, immunization, preventive screenings, and multiple health problems such as hypertension, cardiovascular conditions, cancer, vision, hearing, mobility, and more.

This survey is important in following the health of American’s based on many different factors of their lives. Looking at previous surveys can also help to see trends in Americans’ health.

For the data used, responses of “refused”,“don’t know” and “not ascertained” were filtered out as these responses had a tendency to be much smaller than the others and were not helpful for what was being looked at.

Columns

General Health

1. Excellent 
2. Very Good 
3. Good 
4. Fair 
5. Poor 

Life Satisfaction

1. Very Satisfied 
2. Satisfied 
3. Dissatisfied 
4. Very Dissatisfied

Race (Single and Multiple Race Groups)

1.  White only
2.  Black/African American only
3.  Asian only
4.  AIAN only (American Indians and Alaska Natives)
5.  AIAN and any other group
6.  Other
7.  Hispanic who did not identify with one of the other categories

Education Level

0. Never attended/Kindergarten only 
1. Grade 1-11 
2. 12th grade, nodiploma 
3. GED or equivalent 
4. High School Graduate 
5. Some college, nodegree 
6. Associate degree: occupational, technical, or vocational program 
7. Associate degree: academic program 
8. Bachelor's degree 
9. Master's degree 
10. Professional School or Doctoral degree

BMI

1. Underweight 
2. Normal 
3. Overweight 
4. Obese

Vision

Do you wear glasses or contact lenses?
1. Yes 
2. No
Do you have difficulty seeing? 
1. No difficulty 
2. Some difficulty 
3. A lot of difficulty 
4. Cannot do at all

Cancer

Separate columns for each type of cancer.

Number of reported types of cancer. 
0. No cancer 
1-3. 1 to 3 
4. 4 or more

Demographics

Looking at demographics can help to give a bit of an idea about what sort of data we should look at.

Sex


dfSexFilter <- adult22 %>% 
  filter(SEX_A == 1 | SEX_A == 2)

dfSexFilter <-
  dfSexFilter |>
    group_by(dfSexFilter$SEX_A) |>
    mutate(Sex_Status = ifelse(SEX_A == 1,
                                 "Male", 
                                 "Female")) |>

  ungroup()
ggplot(dfSexFilter, aes(x = Sex_Status, fill=Sex_Status)) + xlab("Sex") +
  geom_bar()

Age

The far right bar, 85, is for those who are 85+.



dfAgeFilter <- adult22 %>% 
  filter(AGEP_A < 86)

ggplot(dfAgeFilter, aes(x = AGEP_A)) + xlab("Age") +
  geom_bar()

The largest group based on age were those that were 85 or older.

Education Level

dfEduFilter <- adult22 %>% 
  filter(EDUCP_A < 97)


dfEduFilter <-
  dfEduFilter |>
    group_by(dfEduFilter$EDUCP_A) |>
    mutate(Edu_Status = ifelse(EDUCP_A == 1,
                                "Grade 1-11", 
                               ifelse(EDUCP_A == 2,
                                 "12th Grade, no Diploma",
                                 ifelse(EDUCP_A == 3,
                                 "GED or Equivalent",
                                 ifelse(EDUCP_A == 4,
                                 "High School Graduate",
                                 ifelse(EDUCP_A == 5,
                                 "Some College, no Degree",
                                 ifelse(EDUCP_A == 6,
                                 "Associate degree: occupational, technical, or vocational program",
                                 ifelse(EDUCP_A == 7,
                                 "Associate degree: academic program",
                                 ifelse(EDUCP_A == 8,
                                 "Bachelor's degree",
                                 ifelse(EDUCP_A == 9,
                                 "Master's degree ",
                                 ifelse(EDUCP_A == 10,
                                 "Professional School or Doctoral degree",
                                 ifelse(EDUCP_A == 97,
                                 "Refused",
                                "Don't Know")))))))))))) |> 
ungroup()
  
  
dfEduFilter$Edu_Status <- factor(dfEduFilter$Edu_Status, levels = c("Grade 1-11", "12th Grade, no Diploma", "GED or Equivalent","High School Graduate", "Some College, no Degree", "Associate degree: occupational, technical, or vocational program", "Associate degree: academic program", "Bachelor's degree", "Master's degree ", "Professional School or Doctoral degree", "Refused", "Don't Know"))

ggplot(dfEduFilter, aes(x = EDUCP_A, fill=Edu_Status)) +
  geom_bar() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) + xlab("Education Level")

Race

dfRaceFilter <- adult22 %>% 
  filter(RACEALLP_A <= 8 )

dfRaceFilter2 <- dfRaceFilter  %>% 
  filter(RACEALLP_A != 7 )


dfRaceFilter2 <-
  dfRaceFilter2 |>
    group_by(dfRaceFilter2$RACEALLP_A) |>
    mutate(Race = ifelse(RACEALLP_A == 1,
                                "White", 
                               ifelse(RACEALLP_A == 2,
                                 "African American/Black",
                                 ifelse(RACEALLP_A == 3,
                                 "Asian",
                                 ifelse(RACEALLP_A == 4,
                                 "AIAN*",
                                 ifelse(RACEALLP_A == 5,
                                 "AIAN + Other Group",
                                 ifelse(RACEALLP_A == 6,
                                 "Other",
                                 ifelse(RACEALLP_A == 8,
                                 "Hispanic**",
                                "Don't Know")))))))) |> 
ungroup()
  
ggplot(dfRaceFilter2, aes(x = Race, fill=Race)) +
  geom_bar() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

There was a much larger amount of those who identified as white who took the survey.

General Physical Health


dfGHFilter <- adult22 %>% 
  filter(PHSTAT_A < '7')

ggplot(dfGHFilter, aes(x = factor(PHSTAT_A))) + scale_x_discrete(labels=c("Excellent", "Very Good", "Good", "Fair", "Poor")) +
  geom_bar() + 
  xlab("General Physical Health")

NA
NA
NA

Most people put their General Physical Health in the more positive side (Excellent, Very good, Good) side of the scale.

Life Satisfaction


dfFilteredLS <- adult22 %>% 
  filter(LSATIS4_A < '5')


ggplot(dfFilteredLS, aes(x = factor(LSATIS4_A))) + scale_x_discrete(labels=c("Very Satisfied", "Satisfied", "Dissatisfied", "Very Dissatisfied")) +
  geom_bar() +
  xlab("Life Satisfaction")

NA
NA
NA
NA

Most people considered themselves to be positively satisfied with life. The difference between the positive and negative side of the scale was larger than for General Physical Health.

BMI

BMI can often be contributed to health and certain levels can be contributed to health issues. However, BMI is fundamentally flawed as it does not take into consideration muscle fat. For example, a body builder could be considered to have an “overweight” BMI, but would be much healthier than someone who had mostly non-muscle fat.

Histogram

hist(dfFilteredBMI$BMICAT_A)

The distribution was relatively equal, except for the Underweight category, so there was a potential for this category to cause problems.

Bootstrap

This is to help us get a better idea of the average Life Satisfaction and General Physical Health based on a person’s BMI category, by creating samples.


dfFilteredLSBMI <- adult22 %>%
  filter(BMICAT_A < 5 & LSATIS4_A <5)

dfFilteredPHBMI <- adult22 %>%
  filter(BMICAT_A < 5 & PHSTAT_A <7)

boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}

Bootstrap: Life Satisfaction vs BMI

df_ciLS <- dfFilteredLSBMI |>
  group_by(BMICAT_A) |>
  summarise(ci_lower = boot_ci(LSATIS4_A, mean)['lower'],
            mean_LS = mean(LSATIS4_A),
            ci_upper = boot_ci(LSATIS4_A, mean)['upper'])

df_ciLS |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = BMICAT_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_LS, y = BMICAT_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "Life Satisfaction by BMI Category",
       x = "Life Satisfaction",
       y = "BMI Category",
       color = '')

Category 1, underweight, seems to have a much larger range than the other 3, which like stated above, could be an issue because of how small the distribution was.

Bootstrap: General Physical Health vs BMI

df_ciPH <- dfFilteredPHBMI |>
  group_by(BMICAT_A) |>
  summarise(ci_lower = boot_ci(PHSTAT_A, mean)['lower'],
            mean_PH = mean(PHSTAT_A),
            ci_upper = boot_ci(PHSTAT_A, mean)['upper'])

df_ciPH |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = BMICAT_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_PH, y = BMICAT_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "General Health by BMI Category",
       x = "General Health",
       y = "BMI Category",
       color = '')

There were way less people in the underweight BMI (1) than the other categories, similar to the issue in Life Satisfaction, so that group does not give as much information, or as accurate of information as the others and was removed for the graphs below.

General Health by BMI Status


dfBMIGH <- adult22 %>%
  filter(PHSTAT_A < 7)

dfNormalBMI <- dfBMIGH %>%
  filter(BMICAT_A == 2 )

dfOverBMI <- dfBMIGH %>%
  filter(BMICAT_A == 3 )

dfObeseBMI <- dfBMIGH %>%
  filter(BMICAT_A == 4 )

BMIGH <- data.frame(                                            
   Status  = c(2, 3, 4),       
   Physical_Health= c( mean(dfNormalBMI$PHSTAT_A), mean(dfOverBMI$PHSTAT_A), mean(dfObeseBMI$PHSTAT_A)))

ggplot(BMIGH, aes(x = factor(Status), y = Physical_Health)) + 
  geom_point() +
  scale_x_discrete(labels= c( "Normal", "Overweight", "Obese")) +
  xlab("BMI Status") +
  ylab("Average General Physical Health")

NA
NA

Life Satisfaction by BMI Status


dfBMILS <- adult22 %>%
  filter(LSATIS4_A < 5)

dfNormalBMILS <- dfBMILS %>%
  filter(BMICAT_A == 2 )

dfOverBMILS <- dfBMILS %>%
  filter(BMICAT_A == 3 )

dfObeseBMILS <- dfBMILS %>%
  filter(BMICAT_A == 4 )

BMILS <- data.frame(                                            
   Status  = c(2, 3, 4),       
   Life_Satis= c( mean(dfNormalBMILS$LSATIS4_A), mean(dfOverBMILS$LSATIS4_A), mean(dfObeseBMILS$LSATIS4_A)))

ggplot(BMILS, aes(x = factor(Status), y = Life_Satis)) + 
  geom_point() +
  scale_x_discrete(labels= c("Normal", "Overweight", "Obese")) +
  xlab("BMI Status") +
  ylab("Average Life Satisfaction")

NA
NA

Education Level

Vs General Physical Health


dfFilteredEdu <- dfGHFilter %>%
  filter(EDUCP_A < 97)

df_ciEdu <- dfFilteredEdu |>
  group_by(EDUCP_A) |>
  summarise(ci_lower = boot_ci(PHSTAT_A, mean)['lower'],
            mean_PH = mean(PHSTAT_A),
            ci_upper = boot_ci(PHSTAT_A, mean)['upper'])


df_ciEdu |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = EDUCP_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_PH, y = EDUCP_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "General Physical Health by Education",
       x = "Physical Health",
       y = "Education",
       color = '')

We can see that as a person’s education level decreases, their average physical health generally decreases, however it was still in the same physical health category. There seems to be a few groups forming, such as the bottom 3, the middle 4, and the top 3.

Vs Life Satisfaction


dfFilteredEduLS <- dfFilteredLS %>%
  filter(EDUCP_A < 97)

df_ciEduLS <- dfFilteredEduLS |>
  group_by(EDUCP_A) |>
  summarise(ci_lower = boot_ci(LSATIS4_A, mean)['lower'],
            mean_LS = mean(LSATIS4_A),
            ci_upper = boot_ci(LSATIS4_A, mean)['upper'])


df_ciEduLS |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = EDUCP_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_LS, y = EDUCP_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "Life Satisfaction by Education",
       x = "Life Satisfaction",
       y = "Education",
       color = '')

Similar to general health, there is a trend of as education level decreases, life satisfaction decreases.

Vision

dfFilteredVis <- adult22 %>%
  filter(VISIONDF_A <4)

hist(dfFilteredVis$VISIONDF_A)


hist(dfFilteredVis$WEARGLSS_A)

Vision vs General Physical Health

dfGlasses <- adult22 %>%
  filter(PHSTAT_A < 5)

dfNoGlasses <- dfGlasses %>%
  filter(WEARGLSS_A == '2')

dfWearGlasses <- dfGlasses %>%
  filter(WEARGLSS_A== '1' )

WearGlasses <- data.frame(                                            
   Status  = c(1, 2),       
   Gen_Health= c(mean(dfWearGlasses$PHSTAT_A), mean(dfNoGlasses$PHSTAT_A)))


ggplot(WearGlasses, aes(x = factor(Status), y = Gen_Health)) + 
  geom_point() +
  scale_x_discrete(labels= c("Wear Glasses/Contacts", "No Glasses/Contacts")) +
  xlab("")+
  ylab("Average Physical Health")

dfVision <- dfFilteredVis %>%
  filter(PHSTAT_A < 5)

dfNoneVis <- dfVision %>%
  filter(VISIONDF_A == '1')

dfSomeVis <- dfVision %>%
  filter(VISIONDF_A== '2' )

dfAlotVis <- dfVision %>%
  filter(VISIONDF_A== '3' )

dfCannotVis <- dfVision %>%
  filter(VISIONDF_A== '4' )

VisDifficulty <- data.frame(                                            
   Status  = c(1, 2,3,4),       
   Gen_Health= c(mean(dfNoneVis$PHSTAT_A), mean(dfSomeVis$PHSTAT_A), mean(dfAlotVis$PHSTAT_A), mean(dfCannotVis$PHSTAT_A)))


ggplot(VisDifficulty, aes(x = factor(Status), y = Gen_Health)) + 
  geom_point() +
  scale_x_discrete(labels= c("None", "Some", "A Lot","Cannot See")) +
  xlab("Difficulty Seeing") +
  ylab("Average Physical Health")

Vision vs Life Satisfaction

dfGlassesLS <- adult22 %>%
  filter(LSATIS4_A < 7)

dfNoGlassesLS <- dfGlassesLS %>%
  filter(WEARGLSS_A == '2')

dfWearGlassesLS <- dfGlassesLS %>%
  filter(WEARGLSS_A== '1' )

WearGlassesLS <- data.frame(                                            
   Status  = c(1, 2),       
   Life_Satis= c(mean(dfWearGlassesLS$LSATIS4_A), mean(dfNoGlassesLS$LSATIS4_A)))


ggplot(WearGlassesLS, aes(x = factor(Status), y = Life_Satis)) + 
  geom_point() +
  scale_x_discrete(labels= c("Wear Glasses/Contacts", "No Glasses/Contacts")) +
  xlab("")+
  ylab("Average Life Satisfaction")

Difficulty Seeing and Life Satisfaction

dfVisionLS <- dfFilteredVis %>%
  filter(LSATIS4_A < 7)

dfNoneVisLS <- dfVisionLS %>%
  filter(VISIONDF_A == '1')

dfSomeVisLS <- dfVisionLS %>%
  filter(VISIONDF_A== '2' )

dfAlotVisLS <- dfVisionLS %>%
  filter(VISIONDF_A== '3' )

dfCannotVisLS <- dfVisionLS %>%
  filter(VISIONDF_A== '4' )

VisDifficultyLS <- data.frame(                                            
   Status  = c(1, 2,3,4),       
   Life_Satis= c(mean(dfNoneVisLS$LSATIS4_A), mean(dfSomeVisLS$LSATIS4_A), mean(dfAlotVisLS$LSATIS4_A), mean(dfCannotVisLS$LSATIS4_A)))


ggplot(VisDifficultyLS, aes(x = factor(Status), y = Life_Satis)) + 
  geom_point() +
  scale_x_discrete(labels= c("None", "Some", "A Lot","Cannot See")) +
  xlab("Difficulty Seeing") +
  ylab("Average Life Satisfaction")

For both physical health and life satisfaction, there was a decrease for those who wear glasses/contacts. Since people can be split into positive and negative life satisfaction, we can try a binary response.

dfVisionAll <- dfFilteredVis %>%
  filter(WEARGLSS_A< '3' )

dfVisionAllLS <-
  dfVisionAll |>
    group_by(dfVisionAll$LSATIS4_A) |>
    mutate(LifeSatis_Status = ifelse(LSATIS4_A == 1,
                                0, 
                                ifelse(LSATIS4_A == 2,
                                0, 
                                 1))) |> 
  ungroup()

model <- glm(LifeSatis_Status ~ VISIONDF_A, data = dfVisionAllLS,
             family = binomial(link = 'logit'))

model$coefficients
(Intercept)  VISIONDF_A 
  -4.325735    1.004927 
sigmoid <- \(x) 1 / (1 + exp(-(-4.325735 + 1.004927 * x)))

dfVisionAllLS |>
  ggplot(mapping = aes(x = VISIONDF_A, y = LifeSatis_Status)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Life Satisfaction Binary Response - Difficulty Seeing") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

Between Positive Life Satisfaction(0) and Negative Life Satisfaction(1). We can see that there is a small increase in probability that as someone’s vision gets worst, the life satisfaction will become more negative.

Smoking

Smoking, Life Satisfaction, and General Health

  1. Current, every day smoker
  2. Current, some days smoker
  3. Former smoker
  4. Non-Smoker

dfFilteredLSSmoke <- dfFilteredLS %>%
  filter(SMKCIGST_A < 5)


dfFilteredLSSmoke |>
  ggplot(mapping = aes(x = SMKCIGST_A, y = LSATIS4_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal() + xlab("Smoking Status") + ylab("Life Satisfaction")

  

dfFilteredPHSmoke <- dfGHFilter %>%
  filter(SMKCIGST_A < 5)


dfFilteredPHSmoke |>
  ggplot(mapping = aes(x = SMKCIGST_A, y = PHSTAT_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal() + xlab("Smoking Status") + ylab("General Health")

NA
NA

There is a bit of a decrease in life satisfaction and general health the more someone smokes.

Smoking and Lung Cancer - Binary Response

We can split people up into someone who has smoked and someone who has not smoked, so we can try a binary response.


dfSmokeCan <- adult22 %>%
  filter(SMKCIGST_A < 5 & LUNGCAN_A <3)

dfSmokeCan <-
  dfSmokeCan |>
    group_by(dfSmokeCan$SMKCIGST_A) |>
    mutate(Smoke = ifelse(SMKCIGST_A == "4",
                                0,
                                 1)) |> 
  ungroup()

model <- glm(Smoke ~ LUNGCAN_A, data = dfSmokeCan,
             family = binomial(link = 'logit'))

model$coefficients
(Intercept)   LUNGCAN_A 
   3.264507   -1.755153 
sigmoid <- \(x) 1 / (1 + exp(-(3.264507 -1.755153 * x)))

dfSmokeCan |>
  ggplot(mapping = aes(x = LUNGCAN_A, y = Smoke)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Binary Response: Lung Cancer and Smoking") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

This graph shows that for those who have lung cancer (1), there is a higher probability of them being a smoker or former smoker(1).

100 Cigarettes Smoked and Lung Cancer - Binary Response

We can also split the smokers up into those who have smoked 100 cigarettes, and those who have not.

dfSmoke100 <- adult22 %>%
  filter(SMKEV_A < 3 & LUNGCAN_A <3)

dfSmoke100Can <-
  dfSmoke100 |>
    group_by(dfSmoke100$SMKEV_A) |>
    mutate(Smoke_100 = ifelse(SMKEV_A == "2",
                                0,
                                 1)) |> 
  ungroup()

model <- glm(Smoke_100 ~ LUNGCAN_A, data = dfSmoke100Can,
             family = binomial(link = 'logit'))

model$coefficients
(Intercept)   LUNGCAN_A 
   3.263094   -1.753740 
sigmoid <- \(x) 1 / (1 + exp(-(3.263094 -1.753740 * x)))

dfSmoke100Can |>
  ggplot(mapping = aes(x = LUNGCAN_A, y = Smoke_100)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Binary Response: Lung Cancer and Smoked 100 Cigarettes") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

This graph shows that for those who have lung cancer (1), there is a higher probability of them having smoked 100 cigarettes(1). The model is extremely close to the previous model. So, how many of the smokers with lung cancer smoked 100 cigarettes?

LungCan <- nrow(dfSmoke100[dfSmoke100$LUNGCAN_A == '1',])
Smoked100 <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '1',])
Smoked100Can <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '1' & dfSmoke100$LUNGCAN_A == '1', ])
NotSmoked100Can <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '2' & dfSmoke100$LUNGCAN_A == '1', ])

LungCan
[1] 116
Smoked100Can
[1] 95
95/116*100
[1] 81.89655
Smoked100
[1] 1511
NotSmoked100Can
[1] 21
95/1511*100
[1] 6.287227

Out of 116 people who stated they have lung cancer, 95 people, or 81.9% had smoked at least 100 cigarettes. But out of 1511 people who have smoked at least 100 cigarettes, only 21, or 6.3% have lung cancer.

Covid

Month of last Covid vaccine for those who got it for the first time.

fCovidVax <- adult22

dfCovidVaxFirst <- dfCovidVax %>%
  filter(CVDVAC1M_A != 'NA' & CVDVAC1Y_A !='NA')

dfCovidVaxLast <- dfCovidVax %>%
  filter(CVDVAC2M_A != 'NA' & CVDVAC2Y_A !='NA')

dfCovidVaxFirst$LastCovidVaxFirstTime <- with(dfCovidVaxFirst, sprintf("%d-01-%02d", CVDVAC1M_A, CVDVAC1Y_A))

dfCovidVaxLast$LastCovidVaxNotFirstTime <- with(dfCovidVaxLast, sprintf("%d-01-%02d", CVDVAC2M_A, CVDVAC2Y_A))

dfCovidVaxFirst_ <- dfCovidVaxFirst %>%
  filter(CVDDIAG_A < 3)

dfCovidVaxLast_ <- dfCovidVaxLast %>%
  filter(CVDDIAG_A < 3)

dfCovidVaxFirst1 <- dfCovidVaxFirst_ |>
  select(LastCovidVaxFirstTime, CVDDIAG_A)

dfCovidVaxLast1 <- dfCovidVaxLast_ |>
  select(LastCovidVaxNotFirstTime, CVDDIAG_A)

dftransf <- dfCovidVaxFirst1 |>
  select(LastCovidVaxFirstTime, CVDDIAG_A)

dftransf['date']<- as.Date(dftransf$LastCovidVaxFirstTime, format="%m-%d-%Y")

dftransfilt <- dftransf %>% filter(!is.na(date))

dfFirstV <- dftransfilt |>
  select(date, CVDDIAG_A)

dffirstVCount<- dfFirstV %>% group_by(date) %>%tally() 

dfMerge <- merge(dffirstVCount, dfFirstVSum, by = "date", all.x=TRUE, all.y=TRUE)

dfMerge['percent'] = dfMerge['Frequency']/dfMerge['n']

dfMerge <- dfMerge %>% filter(percent >.0000001)
dfMerge |>
  ggplot(mapping = aes(x = date, y = n)) +
  geom_line() +
  labs(title = "Date of First Covid Vaccine") +
  theme_hc() +
  scale_x_date(date_labels = "%Y (%b)") + ylab("Number of People")

The first vaccine came out in Dec 2021, which is around where the graph starts to increase. There was an increase in April and May 2021. In June 2021 when it started to decrease, there was a slowdown in people getting the vaccine and a record low number of cases.

Source: https://www.kff.org/coronavirus-covid-19/poll-finding/kff-covid-19-vaccine-monitor-june-2021/

On September 9 2021, Biden made an announcement requiring federal employees to be vaccinated by the end of January 2022, which is around the time of the largest peak between Sept 2021 and January 2022.

Source: https://www.whitehouse.gov/briefing-room/presidential-actions/2021/09/09/executive-order-on-requiring-coronavirus-disease-2019-vaccination-for-federal-employees/

Cancer

Cancer by Sex

Told has cancer.


dfFemale = adult22 %>%
  filter(SEX_A == 2)

dfFemaleCan = dfFemale %>%
  filter(CANEV_A == 1)

dfMale = adult22 %>%
  filter(SEX_A == 1)

dfMaleCan = dfMale %>%
  filter(CANEV_A == 1)

TotalF <- nrow(dfFemale)
TotalM <- nrow(dfMale)
PercentF <- (nrow(dfFemaleCan)/nrow(dfFemale))*100
PercentM <- (nrow(dfMaleCan)/nrow(dfMale))*100

PercentCan <- data.frame(                                            
   Sex  = c("Female", "Male"),       
   Percent= c((PercentF), (PercentM)))

ggplot(PercentCan, aes(x = Sex, y=Percent)) +
  geom_point()

Percent by Type of Cancer

dfCan <- adult22 %>%
  filter(NUMCAN_A < 7 & NUMCAN_A !=0)

dfBlad <- dfCan %>%
  filter(BLADDCAN_A == 1)

dfBlood <- dfCan %>%
  filter(BLOODCAN_A == 1)

dfBrain <- dfCan %>%
  filter(BRAINCAN_A == 1)

dfBone <- dfCan %>%
  filter(BONECAN_A == 1)

dfBreast <- dfCan %>%
  filter(BREASCAN_A == 1)

dfCervical <- dfCan %>%
  filter(CERVICAN_A == 1)

dfEsoph <- dfCan %>%
  filter(BLADDCAN_A == 1)

dfGall <- dfCan %>%
  filter(ESOPHCAN_A == 1)

dfLarynx <- dfCan %>%
  filter(LARYNCAN_A == 1)

dfLeuk <- dfCan %>%
  filter(LEUKECAN_A == 1)

dfLiver <- dfCan %>%
  filter(LIVERCAN_A == 1)

dfLung <- dfCan %>%
  filter(LUNGCAN_A == 1)

dfLymph <- dfCan %>%
  filter(LYMPHCAN_A == 1)

dfMela <- dfCan %>%
  filter(MELANCAN_A == 1)

dfMouth <- dfCan %>%
  filter(MOUTHCAN_A == 1)

dfOvary <- dfCan %>%
  filter(OVARYCAN_A == 1)

dfPanc <- dfCan %>%
  filter(PANCRCAN_A== 1)

dfProst <- dfCan %>%
  filter(PROSTCAN_A == 1)

dfSkinMela <- dfCan %>%
  filter(SKNMCAN_A == 1)

dfSkinNoMela <- dfCan %>%
  filter(SKNNMCAN_A == 1)

dfSkinUnknown <- dfCan %>%
  filter(SKNDKCAN_A == 1)

dfStomach <- dfCan %>%
  filter(STOMACAN_A == 1)

dfThroat <- dfCan %>%
  filter(THROACAN_A == 1)

dfThyroid <- dfCan %>%
  filter(THYROCAN_A == 1)

dfUterus <- dfCan %>%
  filter(UTERUCAN_A == 1)

dfHandNeck <- dfCan %>%
  filter(HDNCKCAN_A == 1)

dfColRec <- dfCan %>%
  filter(COLRCCAN_A == 1)

dfOther <- dfCan %>%
  filter(OTHERCANP_A == 1)
TBladF <- nrow(dfBlad %>%
  filter(SEX_A == 2))/TotalF*100
TBloodF <- nrow(dfBlood %>%
  filter(SEX_A == 2))/TotalF*100
TBoneF <- nrow(dfBone %>%
  filter(SEX_A == 2))/TotalF*100
TBrainF <- nrow(dfBrain %>%
  filter(SEX_A == 2))/TotalF*100
TBreastF <- nrow(dfBreast %>%
  filter(SEX_A == 2))/TotalF*100
TEsophF <- nrow(dfEsoph %>%
  filter(SEX_A == 2))/TotalF*100
TCervF <- nrow(dfCervical %>%
  filter(SEX_A == 2))/TotalF*100
TGallF <- nrow(dfGall %>%
  filter(SEX_A == 2))/TotalF*100
TLarynxF <- nrow(dfLarynx %>%
  filter(SEX_A == 2))/TotalF*100
TLeukF <- nrow(dfLeuk %>%
  filter(SEX_A == 2))/TotalF*100
TLungF <- nrow(dfLung %>%
  filter(SEX_A == 2))/TotalF*100
TLiverF <- nrow(dfLiver %>%
  filter(SEX_A == 2))/TotalF*100
TMelaF <- nrow(dfMela %>%
  filter(SEX_A == 2))/TotalF*100
TMouthF <- nrow(dfMouth %>%
  filter(SEX_A == 2))/TotalF*100
TOvaryF <- nrow(dfOvary %>%
  filter(SEX_A == 2))/TotalF*100
TPancF <- nrow(dfBlad %>%
  filter(SEX_A == 2))/TotalF*100
TSkinMelaF <- nrow(dfSkinMela %>%
  filter(SEX_A == 2))/TotalF*100
TSkinNoMelaF <- nrow(dfSkinNoMela %>%
  filter(SEX_A == 2))/TotalF*100
TSkinUnknownF <- nrow(dfSkinUnknown %>%
  filter(SEX_A == 2))/TotalF*100
TStomachF <- nrow(dfStomach %>%
  filter(SEX_A == 2))/TotalF*100
TThyroidF <- nrow(dfThyroid %>%
  filter(SEX_A == 2))/TotalF*100
TThroatF <- nrow(dfThroat %>%
  filter(SEX_A == 2))/TotalF*100
TUterusF <- nrow(dfUterus %>%
  filter(SEX_A == 2))/TotalF*100
THandNeckF <- nrow(dfHandNeck %>%
  filter(SEX_A == 2))/TotalF*100
TColRecF <- nrow(dfColRec %>%
  filter(SEX_A == 2))/TotalF*100
TOtherF <- nrow(dfOther %>%
  filter(SEX_A == 2))/TotalF*100
Cancer Among Females
CancerF <- data.frame(                                            
   Cancer  = c("Bladder", "Blood","Brain","Bone","Breast","Esophagus","Cervical","GallBladder","Larynx","Leukemia","Lung","Liver","Melanoma","Mouth","Ovarian","Pancreas","Skin Melanoma","Skin-Not Melanoma","Skin-Unknown","Stomach","Thyroid","Throat","Uterus","Hand and Neck","ColonRectal","Other"),       
   Percent= c((TBladF), (TBloodF),(TBrainF),(TBoneF),(TBreastF),(TEsophF),(TCervF),(TGallF),(TLarynxF),(TLeukF),(TLungF),(TLiverF),(TMelaF),(TMouthF),(TOvaryF),(TPancF),(TSkinMelaF),(TSkinNoMelaF),(TSkinUnknownF),(TStomachF),(TThyroidF),(TThroatF),(TUterusF),(THandNeckF),(TColRecF),(TOtherF)))

ggplot(CancerF, aes(x = Percent, y=Cancer)) +
  geom_point() + labs(title = "Cancer Among Females")

Cancer Among Males
TBladM <- nrow(dfBlad %>%
  filter(SEX_A == 1))/TotalM*100
TBloodM <- nrow(dfBlood %>%
  filter(SEX_A == 1))/TotalM*100
TBoneM <- nrow(dfBone %>%
  filter(SEX_A == 1))/TotalM*100
TBrainM <- nrow(dfBrain %>%
  filter(SEX_A == 1))/TotalM*100
TBreastM <- nrow(dfBreast %>%
  filter(SEX_A == 1))/TotalM*100
TEsophM <- nrow(dfEsoph %>%
  filter(SEX_A == 1))/TotalM*100
TGallM <- nrow(dfGall %>%
  filter(SEX_A == 1))/TotalM*100
TLarynxM <- nrow(dfLarynx %>%
  filter(SEX_A == 1))/TotalM*100
TLeukM <- nrow(dfLeuk %>%
  filter(SEX_A == 1))/TotalM*100
TLungM <- nrow(dfLung %>%
  filter(SEX_A == 1))/TotalM*100
TLiverM <- nrow(dfLiver %>%
  filter(SEX_A == 1))/TotalM*100
TMelaM <- nrow(dfMela %>%
  filter(SEX_A == 1))/TotalM*100
TMouthM <- nrow(dfMouth %>%
  filter(SEX_A == 1))/TotalM*100
TPancM <- nrow(dfPanc %>%
  filter(SEX_A == 1))/TotalM*100
TProstM <- nrow(dfProst %>%
  filter(SEX_A == 1))/TotalM*100
TSkinMelaM <- nrow(dfSkinMela %>%
  filter(SEX_A == 1))/TotalM*100
TSkinNoMelaM <- nrow(dfSkinNoMela %>%
  filter(SEX_A == 1))/TotalM*100
TSkinUnknownM <- nrow(dfSkinUnknown %>%
  filter(SEX_A == 1))/TotalM*100
TStomachM <- nrow(dfStomach %>%
  filter(SEX_A == 1))/TotalM*100
TThyroidM <- nrow(dfThyroid %>%
  filter(SEX_A == 1))/TotalM*100
TThroatM <- nrow(dfThroat %>%
  filter(SEX_A == 1))/TotalM*100
THandNeckM <- nrow(dfHandNeck %>%
  filter(SEX_A == 1))/TotalM*100
TColRecM <- nrow(dfColRec %>%
  filter(SEX_A == 1))/TotalM*100
TOtherM <- nrow(dfOther %>%
  filter(SEX_A == 1))/TotalM*100
CancerM <- data.frame(                                            
   Cancer  = c("Bladder", "Blood","Brain","Bone","Breast","Esophagus","GallBladder","Larynx","Leukemia","Lung","Liver","Melanoma","Mouth","Pancreas","Skin Melanoma","Skin-Not Melanoma","Skin-Unknown","Stomach","Thyroid","Throat","Hand and Neck","ColonRectal","Other"),       
   Percent= c((TBladM), (TBloodM),(TBrainM),(TBoneM),(TBreastM),(TEsophM),(TGallM),(TLarynxM),(TLeukM),(TLungM),(TLiverM),(TMelaM),(TMouthM),(TPancM),(TSkinMelaM),(TSkinNoMelaM),(TSkinUnknownM),(TStomachM),(TThyroidM),(TThroatM),(THandNeckM),(TColRecM),(TOtherM)))

ggplot(CancerM, aes(x = Percent, y=Cancer)) +
  geom_point() + labs(title = "Cancer Among Males")

Cancer Among Both
df_Canmerge <- merge(CancerF,CancerM,by="Cancer", all.x = TRUE) 

names(df_Canmerge)[names(df_Canmerge) == "Percent.x"] <- "Female"
names(df_Canmerge)[names(df_Canmerge) == "Percent.y"] <- "Male"
dfCancerMerged <- melt(df_Canmerge, id.vars="Cancer")
names(dfCancerMerged)[names(dfCancerMerged) == "value"] <- "Percent"
names(dfCancerMerged)[names(dfCancerMerged) == "variable"] <- "Sex"

ggplot(dfCancerMerged, aes(Percent,Cancer, col=Sex)) + 
  geom_point() + labs(title = "Cancer Among Males and Females")

Both males and females had a higher percentage for Skin-Not Melanoma cancer. The highest percentage for cancer was breast cancer in females.In many of the cancer categories, females had a slightly lower percentage.

Summary

While many of the ways to stay healthy are commonly known and advertised, it is interesting to look at the data to get a more specific idea. While many of these did not have much of a difference between categories, there is some information that can be gained that follow many of the health ideas we already know. These include

  1. It is less healthier, mentally and physically, for you to smoke. Smoking at least 100 cigarettes can make this worst and if you end up with lung cancer, it could be contributed to you smoking.
  2. Cancer is important to watch out for, and there are some specifically that are more common based on your sex and overall more common.
  3. While it is mostly unavoidable, those who have difficulty seeing can have worst life satisfaction, so if you can get it fixed by surgery or glasses/contacts, it will be helpful.
  4. It is better to for your mental and physical health to stay around a normal BMI. However, this could be argued based on if your weight comes from non-muscle fat or muscle fat.
  5. A higher education could potentially lead you to feel more satisfied and have better health.
---
title: "2021 National Health Interview Adult Survey"
output: html_notebook
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

 adult22 <- read.csv("~/Documents/DataDive/adult22.csv")
 
library(tidyverse)
library(gapminder)
library(ggthemes)
library(dplyr)
library(ggplot2)
library(magrittr)
library(effsize)
library(pwrss)
library(broom)
library(lindia)
library(boot)
library(car)
library(xts)
library(tsibble)
library(lubridate)
library(reshape2)



 adult22_raw <- adult22
 
```

# Dataset

The dataset is data from the 2021 National Health Interview Adult
Survey. The survey contained questions related to household and family
composition, demographics about the survey taker, satisfaction with
life, health insurance, medication, immunization, preventive screenings,
and multiple health problems such as hypertension, cardiovascular
conditions, cancer, vision, hearing, mobility, and more.

This survey is important in following the health of American's based on
many different factors of their lives. Looking at previous surveys can
also help to see trends in Americans' health.

For the data used, responses of "refused","don't know" and "not
ascertained" were filtered out as these responses had a tendency to be
much smaller than the others and were not helpful for what was being
looked at.

# Columns

### General Health

```         
1. Excellent 
2. Very Good 
3. Good 
4. Fair 
5. Poor 
```

### Life Satisfaction

```         
1. Very Satisfied 
2. Satisfied 
3. Dissatisfied 
4. Very Dissatisfied
```

### Race (Single and Multiple Race Groups)

```         
1.  White only
2.  Black/African American only
3.  Asian only
4.  AIAN only (American Indians and Alaska Natives)
5.  AIAN and any other group
6.  Other
7.  Hispanic who did not identify with one of the other categories
```

### Education Level

```         
0. Never attended/Kindergarten only 
1. Grade 1-11 
2. 12th grade, nodiploma 
3. GED or equivalent 
4. High School Graduate 
5. Some college, nodegree 
6. Associate degree: occupational, technical, or vocational program 
7. Associate degree: academic program 
8. Bachelor's degree 
9. Master's degree 
10. Professional School or Doctoral degree
```

### BMI

```         
1. Underweight 
2. Normal 
3. Overweight 
4. Obese
```

# Vision

```         
Do you wear glasses or contact lenses?
1. Yes 
2. No
```

```         
Do you have difficulty seeing? 
1. No difficulty 
2. Some difficulty 
3. A lot of difficulty 
4. Cannot do at all
```

### Cancer

```         
Separate columns for each type of cancer.

Number of reported types of cancer. 
0. No cancer 
1-3. 1 to 3 
4. 4 or more
```

# Demographics

Looking at demographics can help to give a bit of an idea about what
sort of data we should look at.

### Sex

```{r}

dfSexFilter <- adult22 %>% 
  filter(SEX_A == 1 | SEX_A == 2)

dfSexFilter <-
  dfSexFilter |>
    group_by(dfSexFilter$SEX_A) |>
    mutate(Sex_Status = ifelse(SEX_A == 1,
                                 "Male", 
                                 "Female")) |>

  ungroup()
ggplot(dfSexFilter, aes(x = Sex_Status, fill=Sex_Status)) + xlab("Sex") +
  geom_bar()
```

### Age

The far right bar, 85, is for those who are 85+.

```{r}


dfAgeFilter <- adult22 %>% 
  filter(AGEP_A < 86)

ggplot(dfAgeFilter, aes(x = AGEP_A)) + xlab("Age") +
  geom_bar()
```

The largest group based on age were those that were 85 or older.

### Education Level

```{r}
dfEduFilter <- adult22 %>% 
  filter(EDUCP_A < 97)


dfEduFilter <-
  dfEduFilter |>
    group_by(dfEduFilter$EDUCP_A) |>
    mutate(Edu_Status = ifelse(EDUCP_A == 1,
                                "Grade 1-11", 
                               ifelse(EDUCP_A == 2,
                                 "12th Grade, no Diploma",
                                 ifelse(EDUCP_A == 3,
                                 "GED or Equivalent",
                                 ifelse(EDUCP_A == 4,
                                 "High School Graduate",
                                 ifelse(EDUCP_A == 5,
                                 "Some College, no Degree",
                                 ifelse(EDUCP_A == 6,
                                 "Associate degree: occupational, technical, or vocational program",
                                 ifelse(EDUCP_A == 7,
                                 "Associate degree: academic program",
                                 ifelse(EDUCP_A == 8,
                                 "Bachelor's degree",
                                 ifelse(EDUCP_A == 9,
                                 "Master's degree ",
                                 ifelse(EDUCP_A == 10,
                                 "Professional School or Doctoral degree",
                                 ifelse(EDUCP_A == 97,
                                 "Refused",
                                "Don't Know")))))))))))) |> 
ungroup()
  
  
dfEduFilter$Edu_Status <- factor(dfEduFilter$Edu_Status, levels = c("Grade 1-11", "12th Grade, no Diploma", "GED or Equivalent","High School Graduate", "Some College, no Degree", "Associate degree: occupational, technical, or vocational program", "Associate degree: academic program", "Bachelor's degree", "Master's degree ", "Professional School or Doctoral degree", "Refused", "Don't Know"))

ggplot(dfEduFilter, aes(x = EDUCP_A, fill=Edu_Status)) +
  geom_bar() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) + xlab("Education Level")

```

### Race

```{r}
dfRaceFilter <- adult22 %>% 
  filter(RACEALLP_A <= 8 )

dfRaceFilter2 <- dfRaceFilter  %>% 
  filter(RACEALLP_A != 7 )


dfRaceFilter2 <-
  dfRaceFilter2 |>
    group_by(dfRaceFilter2$RACEALLP_A) |>
    mutate(Race = ifelse(RACEALLP_A == 1,
                                "White", 
                               ifelse(RACEALLP_A == 2,
                                 "African American/Black",
                                 ifelse(RACEALLP_A == 3,
                                 "Asian",
                                 ifelse(RACEALLP_A == 4,
                                 "AIAN*",
                                 ifelse(RACEALLP_A == 5,
                                 "AIAN + Other Group",
                                 ifelse(RACEALLP_A == 6,
                                 "Other",
                                 ifelse(RACEALLP_A == 8,
                                 "Hispanic**",
                                "Don't Know")))))))) |> 
ungroup()
  
ggplot(dfRaceFilter2, aes(x = Race, fill=Race)) +
  geom_bar() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

```

There was a much larger amount of those who identified as white who took
the survey.

### General Physical Health

```{r}

dfGHFilter <- adult22 %>% 
  filter(PHSTAT_A < '7')

ggplot(dfGHFilter, aes(x = factor(PHSTAT_A))) + scale_x_discrete(labels=c("Excellent", "Very Good", "Good", "Fair", "Poor")) +
  geom_bar() + 
  xlab("General Physical Health")
```

Most people put their General Physical Health in the more positive side
(Excellent, Very good, Good) side of the scale.

### Life Satisfaction

```{r}

dfFilteredLS <- adult22 %>% 
  filter(LSATIS4_A < '5')


ggplot(dfFilteredLS, aes(x = factor(LSATIS4_A))) + scale_x_discrete(labels=c("Very Satisfied", "Satisfied", "Dissatisfied", "Very Dissatisfied")) +
  geom_bar() +
  xlab("Life Satisfaction")

```

Most people considered themselves to be positively satisfied with life.
The difference between the positive and negative side of the scale was
larger than for General Physical Health.

# BMI

BMI can often be contributed to health and certain levels can be
contributed to health issues. However, BMI is fundamentally flawed as it
does not take into consideration muscle fat. For example, a body builder
could be considered to have an "overweight" BMI, but would be much
healthier than someone who had mostly non-muscle fat.

### Histogram

```{r}
hist(dfFilteredBMI$BMICAT_A)
```

The distribution was relatively equal, except for the Underweight
category, so there was a potential for this category to cause problems.

### Bootstrap

This is to help us get a better idea of the average Life Satisfaction
and General Physical Health based on a person's BMI category, by
creating samples.

```{r}

dfFilteredLSBMI <- adult22 %>%
  filter(BMICAT_A < 5 & LSATIS4_A <5)

dfFilteredPHBMI <- adult22 %>%
  filter(BMICAT_A < 5 & PHSTAT_A <7)

boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}
```

#### Bootstrap: Life Satisfaction vs BMI

```{r}
df_ciLS <- dfFilteredLSBMI |>
  group_by(BMICAT_A) |>
  summarise(ci_lower = boot_ci(LSATIS4_A, mean)['lower'],
            mean_LS = mean(LSATIS4_A),
            ci_upper = boot_ci(LSATIS4_A, mean)['upper'])

```

```{r}

df_ciLS |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = BMICAT_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_LS, y = BMICAT_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "Life Satisfaction by BMI Category",
       x = "Life Satisfaction",
       y = "BMI Category",
       color = '')
```

Category 1, underweight, seems to have a much larger range than the
other 3, which like stated above, could be an issue because of how small
the distribution was.

#### Bootstrap: General Physical Health vs BMI

```{r}
df_ciPH <- dfFilteredPHBMI |>
  group_by(BMICAT_A) |>
  summarise(ci_lower = boot_ci(PHSTAT_A, mean)['lower'],
            mean_PH = mean(PHSTAT_A),
            ci_upper = boot_ci(PHSTAT_A, mean)['upper'])

```

```{r}

df_ciPH |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = BMICAT_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_PH, y = BMICAT_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "General Health by BMI Category",
       x = "General Health",
       y = "BMI Category",
       color = '')
```

There were way less people in the underweight BMI (1) than the other
categories, similar to the issue in Life Satisfaction, so that group
does not give as much information, or as accurate of information as the
others and was removed for the graphs below.

### General Health by BMI Status

```{r}

dfBMIGH <- adult22 %>%
  filter(PHSTAT_A < 7)

dfNormalBMI <- dfBMIGH %>%
  filter(BMICAT_A == 2 )

dfOverBMI <- dfBMIGH %>%
  filter(BMICAT_A == 3 )

dfObeseBMI <- dfBMIGH %>%
  filter(BMICAT_A == 4 )

BMIGH <- data.frame(                                            
   Status  = c(2, 3, 4),       
   Physical_Health= c( mean(dfNormalBMI$PHSTAT_A), mean(dfOverBMI$PHSTAT_A), mean(dfObeseBMI$PHSTAT_A)))

ggplot(BMIGH, aes(x = factor(Status), y = Physical_Health)) + 
  geom_point() +
  scale_x_discrete(labels= c( "Normal", "Overweight", "Obese")) +
  xlab("BMI Status") +
  ylab("Average General Physical Health")


```

### Life Satisfaction by BMI Status

```{r}

dfBMILS <- adult22 %>%
  filter(LSATIS4_A < 5)

dfNormalBMILS <- dfBMILS %>%
  filter(BMICAT_A == 2 )

dfOverBMILS <- dfBMILS %>%
  filter(BMICAT_A == 3 )

dfObeseBMILS <- dfBMILS %>%
  filter(BMICAT_A == 4 )

BMILS <- data.frame(                                            
   Status  = c(2, 3, 4),       
   Life_Satis= c( mean(dfNormalBMILS$LSATIS4_A), mean(dfOverBMILS$LSATIS4_A), mean(dfObeseBMILS$LSATIS4_A)))

ggplot(BMILS, aes(x = factor(Status), y = Life_Satis)) + 
  geom_point() +
  scale_x_discrete(labels= c("Normal", "Overweight", "Obese")) +
  xlab("BMI Status") +
  ylab("Average Life Satisfaction")


```

# Education Level

### Vs General Physical Health

```{r}

dfFilteredEdu <- dfGHFilter %>%
  filter(EDUCP_A < 97)

df_ciEdu <- dfFilteredEdu |>
  group_by(EDUCP_A) |>
  summarise(ci_lower = boot_ci(PHSTAT_A, mean)['lower'],
            mean_PH = mean(PHSTAT_A),
            ci_upper = boot_ci(PHSTAT_A, mean)['upper'])


df_ciEdu |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = EDUCP_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_PH, y = EDUCP_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "General Physical Health by Education",
       x = "Physical Health",
       y = "Education",
       color = '')
```

We can see that as a person's education level decreases, their average
physical health generally decreases, however it was still in the same
physical health category. There seems to be a few groups forming, such
as the bottom 3, the middle 4, and the top 3.

## Vs Life Satisfaction

```{r}

dfFilteredEduLS <- dfFilteredLS %>%
  filter(EDUCP_A < 97)

df_ciEduLS <- dfFilteredEduLS |>
  group_by(EDUCP_A) |>
  summarise(ci_lower = boot_ci(LSATIS4_A, mean)['lower'],
            mean_LS = mean(LSATIS4_A),
            ci_upper = boot_ci(LSATIS4_A, mean)['upper'])


df_ciEduLS |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = EDUCP_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_LS, y = EDUCP_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "Life Satisfaction by Education",
       x = "Life Satisfaction",
       y = "Education",
       color = '')
```

Similar to general health, there is a trend of as education level
decreases, life satisfaction decreases.

# Vision

```{r}
dfFilteredVis <- adult22 %>%
  filter(VISIONDF_A <4)

hist(dfFilteredVis$VISIONDF_A)

hist(dfFilteredVis$WEARGLSS_A)
```

### Vision vs General Physical Health

```{r}
dfGlasses <- adult22 %>%
  filter(PHSTAT_A < 5)

dfNoGlasses <- dfGlasses %>%
  filter(WEARGLSS_A == '2')

dfWearGlasses <- dfGlasses %>%
  filter(WEARGLSS_A== '1' )

WearGlasses <- data.frame(                                            
   Status  = c(1, 2),       
   Gen_Health= c(mean(dfWearGlasses$PHSTAT_A), mean(dfNoGlasses$PHSTAT_A)))


ggplot(WearGlasses, aes(x = factor(Status), y = Gen_Health)) + 
  geom_point() +
  scale_x_discrete(labels= c("Wear Glasses/Contacts", "No Glasses/Contacts")) +
  xlab("")+
  ylab("Average Physical Health")
```

```{r}
dfVision <- dfFilteredVis %>%
  filter(PHSTAT_A < 5)

dfNoneVis <- dfVision %>%
  filter(VISIONDF_A == '1')

dfSomeVis <- dfVision %>%
  filter(VISIONDF_A== '2' )

dfAlotVis <- dfVision %>%
  filter(VISIONDF_A== '3' )

dfCannotVis <- dfVision %>%
  filter(VISIONDF_A== '4' )

VisDifficulty <- data.frame(                                            
   Status  = c(1, 2,3,4),       
   Gen_Health= c(mean(dfNoneVis$PHSTAT_A), mean(dfSomeVis$PHSTAT_A), mean(dfAlotVis$PHSTAT_A), mean(dfCannotVis$PHSTAT_A)))


ggplot(VisDifficulty, aes(x = factor(Status), y = Gen_Health)) + 
  geom_point() +
  scale_x_discrete(labels= c("None", "Some", "A Lot","Cannot See")) +
  xlab("Difficulty Seeing") +
  ylab("Average Physical Health")
```

### Vision vs Life Satisfaction

```{r}
dfGlassesLS <- adult22 %>%
  filter(LSATIS4_A < 7)

dfNoGlassesLS <- dfGlassesLS %>%
  filter(WEARGLSS_A == '2')

dfWearGlassesLS <- dfGlassesLS %>%
  filter(WEARGLSS_A== '1' )

WearGlassesLS <- data.frame(                                            
   Status  = c(1, 2),       
   Life_Satis= c(mean(dfWearGlassesLS$LSATIS4_A), mean(dfNoGlassesLS$LSATIS4_A)))


ggplot(WearGlassesLS, aes(x = factor(Status), y = Life_Satis)) + 
  geom_point() +
  scale_x_discrete(labels= c("Wear Glasses/Contacts", "No Glasses/Contacts")) +
  xlab("")+
  ylab("Average Life Satisfaction")
```

### Difficulty Seeing and Life Satisfaction

```{r}
dfVisionLS <- dfFilteredVis %>%
  filter(LSATIS4_A < 7)

dfNoneVisLS <- dfVisionLS %>%
  filter(VISIONDF_A == '1')

dfSomeVisLS <- dfVisionLS %>%
  filter(VISIONDF_A== '2' )

dfAlotVisLS <- dfVisionLS %>%
  filter(VISIONDF_A== '3' )

dfCannotVisLS <- dfVisionLS %>%
  filter(VISIONDF_A== '4' )

VisDifficultyLS <- data.frame(                                            
   Status  = c(1, 2,3,4),       
   Life_Satis= c(mean(dfNoneVisLS$LSATIS4_A), mean(dfSomeVisLS$LSATIS4_A), mean(dfAlotVisLS$LSATIS4_A), mean(dfCannotVisLS$LSATIS4_A)))


ggplot(VisDifficultyLS, aes(x = factor(Status), y = Life_Satis)) + 
  geom_point() +
  scale_x_discrete(labels= c("None", "Some", "A Lot","Cannot See")) +
  xlab("Difficulty Seeing") +
  ylab("Average Life Satisfaction")
```

For both physical health and life satisfaction, there was a decrease for
those who wear glasses/contacts. Since people can be split into positive
and negative life satisfaction, we can try a binary response.

```{r}
dfVisionAll <- dfFilteredVis %>%
  filter(WEARGLSS_A< '3' )

dfVisionAllLS <-
  dfVisionAll |>
    group_by(dfVisionAll$LSATIS4_A) |>
    mutate(LifeSatis_Status = ifelse(LSATIS4_A == 1,
                                0, 
                                ifelse(LSATIS4_A == 2,
                                0, 
                                 1))) |> 
  ungroup()

model <- glm(LifeSatis_Status ~ VISIONDF_A, data = dfVisionAllLS,
             family = binomial(link = 'logit'))

model$coefficients



```

```{r}
sigmoid <- \(x) 1 / (1 + exp(-(-4.325735 + 1.004927 * x)))

dfVisionAllLS |>
  ggplot(mapping = aes(x = VISIONDF_A, y = LifeSatis_Status)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Life Satisfaction Binary Response - Difficulty Seeing") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()
```

Between Positive Life Satisfaction(0) and Negative Life Satisfaction(1).
We can see that there is a small increase in probability that as
someone's vision gets worst, the life satisfaction will become more
negative.

# Smoking

### Smoking, Life Satisfaction, and General Health

1.  Current, every day smoker
2.  Current, some days smoker
3.  Former smoker
4.  Non-Smoker

```{r}

dfFilteredLSSmoke <- dfFilteredLS %>%
  filter(SMKCIGST_A < 5)


dfFilteredLSSmoke |>
  ggplot(mapping = aes(x = SMKCIGST_A, y = LSATIS4_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal() + xlab("Smoking Status") + ylab("Life Satisfaction")
  

dfFilteredPHSmoke <- dfGHFilter %>%
  filter(SMKCIGST_A < 5)


dfFilteredPHSmoke |>
  ggplot(mapping = aes(x = SMKCIGST_A, y = PHSTAT_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal() + xlab("Smoking Status") + ylab("General Health")


```

There is a bit of a decrease in life satisfaction and general health the
more someone smokes.

### Smoking and Lung Cancer - Binary Response

We can split people up into someone who has smoked and someone who has
not smoked, so we can try a binary response.

```{r}

dfSmokeCan <- adult22 %>%
  filter(SMKCIGST_A < 5 & LUNGCAN_A <3)

dfSmokeCan <-
  dfSmokeCan |>
    group_by(dfSmokeCan$SMKCIGST_A) |>
    mutate(Smoke = ifelse(SMKCIGST_A == "4",
                                0,
                                 1)) |> 
  ungroup()

model <- glm(Smoke ~ LUNGCAN_A, data = dfSmokeCan,
             family = binomial(link = 'logit'))

model$coefficients

```

```{r}
sigmoid <- \(x) 1 / (1 + exp(-(3.264507 -1.755153 * x)))

dfSmokeCan |>
  ggplot(mapping = aes(x = LUNGCAN_A, y = Smoke)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Binary Response: Lung Cancer and Smoking") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()
```

This graph shows that for those who have lung cancer (1), there is a
higher probability of them being a smoker or former smoker(1).

### 100 Cigarettes Smoked and Lung Cancer - Binary Response

We can also split the smokers up into those who have smoked 100
cigarettes, and those who have not.

```{r}
dfSmoke100 <- adult22 %>%
  filter(SMKEV_A < 3 & LUNGCAN_A <3)

dfSmoke100Can <-
  dfSmoke100 |>
    group_by(dfSmoke100$SMKEV_A) |>
    mutate(Smoke_100 = ifelse(SMKEV_A == "2",
                                0,
                                 1)) |> 
  ungroup()

model <- glm(Smoke_100 ~ LUNGCAN_A, data = dfSmoke100Can,
             family = binomial(link = 'logit'))

model$coefficients

```

```{r}
sigmoid <- \(x) 1 / (1 + exp(-(3.263094 -1.753740 * x)))

dfSmoke100Can |>
  ggplot(mapping = aes(x = LUNGCAN_A, y = Smoke_100)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Binary Response: Lung Cancer and Smoked 100 Cigarettes") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

```

This graph shows that for those who have lung cancer (1), there is a
higher probability of them having smoked 100 cigarettes(1). The model is
extremely close to the previous model. So, how many of the smokers with
lung cancer smoked 100 cigarettes?

```{r}
LungCan <- nrow(dfSmoke100[dfSmoke100$LUNGCAN_A == '1',])
Smoked100 <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '1',])
Smoked100Can <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '1' & dfSmoke100$LUNGCAN_A == '1', ])
NotSmoked100Can <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '2' & dfSmoke100$LUNGCAN_A == '1', ])

LungCan
Smoked100Can
95/116*100
Smoked100
NotSmoked100Can
95/1511*100
```

Out of 116 people who stated they have lung cancer, 95 people, or 81.9%
had smoked at least 100 cigarettes. But out of 1511 people who have
smoked at least 100 cigarettes, only 21, or 6.3% have lung cancer.

# Covid

Month of last Covid vaccine for those who got it for the first time.

```{r}
fCovidVax <- adult22

dfCovidVaxFirst <- dfCovidVax %>%
  filter(CVDVAC1M_A != 'NA' & CVDVAC1Y_A !='NA')

dfCovidVaxLast <- dfCovidVax %>%
  filter(CVDVAC2M_A != 'NA' & CVDVAC2Y_A !='NA')

dfCovidVaxFirst$LastCovidVaxFirstTime <- with(dfCovidVaxFirst, sprintf("%d-01-%02d", CVDVAC1M_A, CVDVAC1Y_A))

dfCovidVaxLast$LastCovidVaxNotFirstTime <- with(dfCovidVaxLast, sprintf("%d-01-%02d", CVDVAC2M_A, CVDVAC2Y_A))

dfCovidVaxFirst_ <- dfCovidVaxFirst %>%
  filter(CVDDIAG_A < 3)

dfCovidVaxLast_ <- dfCovidVaxLast %>%
  filter(CVDDIAG_A < 3)

dfCovidVaxFirst1 <- dfCovidVaxFirst_ |>
  select(LastCovidVaxFirstTime, CVDDIAG_A)

dfCovidVaxLast1 <- dfCovidVaxLast_ |>
  select(LastCovidVaxNotFirstTime, CVDDIAG_A)

dftransf <- dfCovidVaxFirst1 |>
  select(LastCovidVaxFirstTime, CVDDIAG_A)

dftransf['date']<- as.Date(dftransf$LastCovidVaxFirstTime, format="%m-%d-%Y")

dftransfilt <- dftransf %>% filter(!is.na(date))

dfFirstV <- dftransfilt |>
  select(date, CVDDIAG_A)

dffirstVCount<- dfFirstV %>% group_by(date) %>%tally() 

dfMerge <- merge(dffirstVCount, dfFirstVSum, by = "date", all.x=TRUE, all.y=TRUE)

dfMerge['percent'] = dfMerge['Frequency']/dfMerge['n']

dfMerge <- dfMerge %>% filter(percent >.0000001)

```

```{r}
dfMerge |>
  ggplot(mapping = aes(x = date, y = n)) +
  geom_line() +
  labs(title = "Date of First Covid Vaccine") +
  theme_hc() +
  scale_x_date(date_labels = "%Y (%b)") + ylab("Number of People")

```

The first vaccine came out in Dec 2021, which is around where the graph
starts to increase. There was an increase in April and May 2021. In June
2021 when it started to decrease, there was a slowdown in people getting
the vaccine and a record low number of cases.

Source:
<https://www.kff.org/coronavirus-covid-19/poll-finding/kff-covid-19-vaccine-monitor-june-2021/>

On September 9 2021, Biden made an announcement requiring federal
employees to be vaccinated by the end of January 2022, which is around
the time of the largest peak between Sept 2021 and January 2022.

Source:
<https://www.whitehouse.gov/briefing-room/presidential-actions/2021/09/09/executive-order-on-requiring-coronavirus-disease-2019-vaccination-for-federal-employees/>

# Cancer

### Cancer by Sex

#### Told has cancer.

```{r}

dfFemale = adult22 %>%
  filter(SEX_A == 2)

dfFemaleCan = dfFemale %>%
  filter(CANEV_A == 1)

dfMale = adult22 %>%
  filter(SEX_A == 1)

dfMaleCan = dfMale %>%
  filter(CANEV_A == 1)

TotalF <- nrow(dfFemale)
TotalM <- nrow(dfMale)
PercentF <- (nrow(dfFemaleCan)/nrow(dfFemale))*100
PercentM <- (nrow(dfMaleCan)/nrow(dfMale))*100

PercentCan <- data.frame(                                            
   Sex  = c("Female", "Male"),       
   Percent= c((PercentF), (PercentM)))

ggplot(PercentCan, aes(x = Sex, y=Percent)) +
  geom_point()
```

#### Percent by Type of Cancer

```{r}
dfCan <- adult22 %>%
  filter(NUMCAN_A < 7 & NUMCAN_A !=0)

dfBlad <- dfCan %>%
  filter(BLADDCAN_A == 1)

dfBlood <- dfCan %>%
  filter(BLOODCAN_A == 1)

dfBrain <- dfCan %>%
  filter(BRAINCAN_A == 1)

dfBone <- dfCan %>%
  filter(BONECAN_A == 1)

dfBreast <- dfCan %>%
  filter(BREASCAN_A == 1)

dfCervical <- dfCan %>%
  filter(CERVICAN_A == 1)

dfEsoph <- dfCan %>%
  filter(BLADDCAN_A == 1)

dfGall <- dfCan %>%
  filter(ESOPHCAN_A == 1)

dfLarynx <- dfCan %>%
  filter(LARYNCAN_A == 1)

dfLeuk <- dfCan %>%
  filter(LEUKECAN_A == 1)

dfLiver <- dfCan %>%
  filter(LIVERCAN_A == 1)

dfLung <- dfCan %>%
  filter(LUNGCAN_A == 1)

dfLymph <- dfCan %>%
  filter(LYMPHCAN_A == 1)

dfMela <- dfCan %>%
  filter(MELANCAN_A == 1)

dfMouth <- dfCan %>%
  filter(MOUTHCAN_A == 1)

dfOvary <- dfCan %>%
  filter(OVARYCAN_A == 1)

dfPanc <- dfCan %>%
  filter(PANCRCAN_A== 1)

dfProst <- dfCan %>%
  filter(PROSTCAN_A == 1)

dfSkinMela <- dfCan %>%
  filter(SKNMCAN_A == 1)

dfSkinNoMela <- dfCan %>%
  filter(SKNNMCAN_A == 1)

dfSkinUnknown <- dfCan %>%
  filter(SKNDKCAN_A == 1)

dfStomach <- dfCan %>%
  filter(STOMACAN_A == 1)

dfThroat <- dfCan %>%
  filter(THROACAN_A == 1)

dfThyroid <- dfCan %>%
  filter(THYROCAN_A == 1)

dfUterus <- dfCan %>%
  filter(UTERUCAN_A == 1)

dfHandNeck <- dfCan %>%
  filter(HDNCKCAN_A == 1)

dfColRec <- dfCan %>%
  filter(COLRCCAN_A == 1)

dfOther <- dfCan %>%
  filter(OTHERCANP_A == 1)

```

```{r}
TBladF <- nrow(dfBlad %>%
  filter(SEX_A == 2))/TotalF*100
TBloodF <- nrow(dfBlood %>%
  filter(SEX_A == 2))/TotalF*100
TBoneF <- nrow(dfBone %>%
  filter(SEX_A == 2))/TotalF*100
TBrainF <- nrow(dfBrain %>%
  filter(SEX_A == 2))/TotalF*100
TBreastF <- nrow(dfBreast %>%
  filter(SEX_A == 2))/TotalF*100
TEsophF <- nrow(dfEsoph %>%
  filter(SEX_A == 2))/TotalF*100
TCervF <- nrow(dfCervical %>%
  filter(SEX_A == 2))/TotalF*100
TGallF <- nrow(dfGall %>%
  filter(SEX_A == 2))/TotalF*100
TLarynxF <- nrow(dfLarynx %>%
  filter(SEX_A == 2))/TotalF*100
TLeukF <- nrow(dfLeuk %>%
  filter(SEX_A == 2))/TotalF*100
TLungF <- nrow(dfLung %>%
  filter(SEX_A == 2))/TotalF*100
TLiverF <- nrow(dfLiver %>%
  filter(SEX_A == 2))/TotalF*100
TMelaF <- nrow(dfMela %>%
  filter(SEX_A == 2))/TotalF*100
TMouthF <- nrow(dfMouth %>%
  filter(SEX_A == 2))/TotalF*100
TOvaryF <- nrow(dfOvary %>%
  filter(SEX_A == 2))/TotalF*100
TPancF <- nrow(dfBlad %>%
  filter(SEX_A == 2))/TotalF*100
TSkinMelaF <- nrow(dfSkinMela %>%
  filter(SEX_A == 2))/TotalF*100
TSkinNoMelaF <- nrow(dfSkinNoMela %>%
  filter(SEX_A == 2))/TotalF*100
TSkinUnknownF <- nrow(dfSkinUnknown %>%
  filter(SEX_A == 2))/TotalF*100
TStomachF <- nrow(dfStomach %>%
  filter(SEX_A == 2))/TotalF*100
TThyroidF <- nrow(dfThyroid %>%
  filter(SEX_A == 2))/TotalF*100
TThroatF <- nrow(dfThroat %>%
  filter(SEX_A == 2))/TotalF*100
TUterusF <- nrow(dfUterus %>%
  filter(SEX_A == 2))/TotalF*100
THandNeckF <- nrow(dfHandNeck %>%
  filter(SEX_A == 2))/TotalF*100
TColRecF <- nrow(dfColRec %>%
  filter(SEX_A == 2))/TotalF*100
TOtherF <- nrow(dfOther %>%
  filter(SEX_A == 2))/TotalF*100
```

##### Cancer Among Females

```{r}
CancerF <- data.frame(                                            
   Cancer  = c("Bladder", "Blood","Brain","Bone","Breast","Esophagus","Cervical","GallBladder","Larynx","Leukemia","Lung","Liver","Melanoma","Mouth","Ovarian","Pancreas","Skin Melanoma","Skin-Not Melanoma","Skin-Unknown","Stomach","Thyroid","Throat","Uterus","Hand and Neck","ColonRectal","Other"),       
   Percent= c((TBladF), (TBloodF),(TBrainF),(TBoneF),(TBreastF),(TEsophF),(TCervF),(TGallF),(TLarynxF),(TLeukF),(TLungF),(TLiverF),(TMelaF),(TMouthF),(TOvaryF),(TPancF),(TSkinMelaF),(TSkinNoMelaF),(TSkinUnknownF),(TStomachF),(TThyroidF),(TThroatF),(TUterusF),(THandNeckF),(TColRecF),(TOtherF)))

ggplot(CancerF, aes(x = Percent, y=Cancer)) +
  geom_point() + labs(title = "Cancer Among Females")
```

##### Cancer Among Males

```{r}
TBladM <- nrow(dfBlad %>%
  filter(SEX_A == 1))/TotalM*100
TBloodM <- nrow(dfBlood %>%
  filter(SEX_A == 1))/TotalM*100
TBoneM <- nrow(dfBone %>%
  filter(SEX_A == 1))/TotalM*100
TBrainM <- nrow(dfBrain %>%
  filter(SEX_A == 1))/TotalM*100
TBreastM <- nrow(dfBreast %>%
  filter(SEX_A == 1))/TotalM*100
TEsophM <- nrow(dfEsoph %>%
  filter(SEX_A == 1))/TotalM*100
TGallM <- nrow(dfGall %>%
  filter(SEX_A == 1))/TotalM*100
TLarynxM <- nrow(dfLarynx %>%
  filter(SEX_A == 1))/TotalM*100
TLeukM <- nrow(dfLeuk %>%
  filter(SEX_A == 1))/TotalM*100
TLungM <- nrow(dfLung %>%
  filter(SEX_A == 1))/TotalM*100
TLiverM <- nrow(dfLiver %>%
  filter(SEX_A == 1))/TotalM*100
TMelaM <- nrow(dfMela %>%
  filter(SEX_A == 1))/TotalM*100
TMouthM <- nrow(dfMouth %>%
  filter(SEX_A == 1))/TotalM*100
TPancM <- nrow(dfPanc %>%
  filter(SEX_A == 1))/TotalM*100
TProstM <- nrow(dfProst %>%
  filter(SEX_A == 1))/TotalM*100
TSkinMelaM <- nrow(dfSkinMela %>%
  filter(SEX_A == 1))/TotalM*100
TSkinNoMelaM <- nrow(dfSkinNoMela %>%
  filter(SEX_A == 1))/TotalM*100
TSkinUnknownM <- nrow(dfSkinUnknown %>%
  filter(SEX_A == 1))/TotalM*100
TStomachM <- nrow(dfStomach %>%
  filter(SEX_A == 1))/TotalM*100
TThyroidM <- nrow(dfThyroid %>%
  filter(SEX_A == 1))/TotalM*100
TThroatM <- nrow(dfThroat %>%
  filter(SEX_A == 1))/TotalM*100
THandNeckM <- nrow(dfHandNeck %>%
  filter(SEX_A == 1))/TotalM*100
TColRecM <- nrow(dfColRec %>%
  filter(SEX_A == 1))/TotalM*100
TOtherM <- nrow(dfOther %>%
  filter(SEX_A == 1))/TotalM*100
```

```{r}
CancerM <- data.frame(                                            
   Cancer  = c("Bladder", "Blood","Brain","Bone","Breast","Esophagus","GallBladder","Larynx","Leukemia","Lung","Liver","Melanoma","Mouth","Pancreas","Skin Melanoma","Skin-Not Melanoma","Skin-Unknown","Stomach","Thyroid","Throat","Hand and Neck","ColonRectal","Other"),       
   Percent= c((TBladM), (TBloodM),(TBrainM),(TBoneM),(TBreastM),(TEsophM),(TGallM),(TLarynxM),(TLeukM),(TLungM),(TLiverM),(TMelaM),(TMouthM),(TPancM),(TSkinMelaM),(TSkinNoMelaM),(TSkinUnknownM),(TStomachM),(TThyroidM),(TThroatM),(THandNeckM),(TColRecM),(TOtherM)))

ggplot(CancerM, aes(x = Percent, y=Cancer)) +
  geom_point() + labs(title = "Cancer Among Males")
```

##### Cancer Among Both

```{r}
df_Canmerge <- merge(CancerF,CancerM,by="Cancer", all.x = TRUE) 

names(df_Canmerge)[names(df_Canmerge) == "Percent.x"] <- "Female"
names(df_Canmerge)[names(df_Canmerge) == "Percent.y"] <- "Male"

```

```{r}
dfCancerMerged <- melt(df_Canmerge, id.vars="Cancer")
names(dfCancerMerged)[names(dfCancerMerged) == "value"] <- "Percent"
names(dfCancerMerged)[names(dfCancerMerged) == "variable"] <- "Sex"

ggplot(dfCancerMerged, aes(Percent,Cancer, col=Sex)) + 
  geom_point() + labs(title = "Cancer Among Males and Females")

```

Both males and females had a higher percentage for Skin-Not Melanoma
cancer. The highest percentage for cancer was breast cancer in
females.In many of the cancer categories, females had a slightly lower
percentage.

# Summary

While many of the ways to stay healthy are commonly known and
advertised, it is interesting to look at the data to get a more specific
idea. While many of these did not have much of a difference between
categories, there is some information that can be gained that follow
many of the health ideas we already know. These include

1.  It is less healthier, mentally and physically, for you to smoke.
    Smoking at least 100 cigarettes can make this worst and if you end
    up with lung cancer, it could be contributed to you smoking.
2.  Cancer is important to watch out for, and there are some
    specifically that are more common based on your sex and overall more
    common.
3.  While it is mostly unavoidable, those who have difficulty seeing can
    have worst life satisfaction, so if you can get it fixed by surgery
    or glasses/contacts, it will be helpful.
4.  It is better to for your mental and physical health to stay around a
    normal BMI. However, this could be argued based on if your weight
    comes from non-muscle fat or muscle fat.
5.  A higher education could potentially lead you to feel more satisfied
    and have better health.
