Dataset
The dataset is data from the 2021 National Health Interview Adult
Survey. The survey contained questions related to household and family
composition, demographics about the survey taker, satisfaction with
life, health insurance, medication, immunization, preventive screenings,
and multiple health problems such as hypertension, cardiovascular
conditions, cancer, vision, hearing, mobility, and more.
This survey is important in following the health of American’s based
on many different factors of their lives. Looking at previous surveys
can also help to see trends in Americans’ health.
For the data used, responses of “refused”,“don’t know” and “not
ascertained” were filtered out as these responses had a tendency to be
much smaller than the others and were not helpful for what was being
looked at.
Columns
General Health
1. Excellent
2. Very Good
3. Good
4. Fair
5. Poor
Life Satisfaction
1. Very Satisfied
2. Satisfied
3. Dissatisfied
4. Very Dissatisfied
Race (Single and Multiple Race Groups)
1. White only
2. Black/African American only
3. Asian only
4. AIAN only (American Indians and Alaska Natives)
5. AIAN and any other group
6. Other
7. Hispanic who did not identify with one of the other categories
Education Level
0. Never attended/Kindergarten only
1. Grade 1-11
2. 12th grade, nodiploma
3. GED or equivalent
4. High School Graduate
5. Some college, nodegree
6. Associate degree: occupational, technical, or vocational program
7. Associate degree: academic program
8. Bachelor's degree
9. Master's degree
10. Professional School or Doctoral degree
BMI
1. Underweight
2. Normal
3. Overweight
4. Obese
Vision
Do you wear glasses or contact lenses?
1. Yes
2. No
Do you have difficulty seeing?
1. No difficulty
2. Some difficulty
3. A lot of difficulty
4. Cannot do at all
Cancer
Separate columns for each type of cancer.
Number of reported types of cancer.
0. No cancer
1-3. 1 to 3
4. 4 or more
Demographics
Looking at demographics can help to give a bit of an idea about what
sort of data we should look at.
Sex
dfSexFilter <- adult22 %>%
filter(SEX_A == 1 | SEX_A == 2)
dfSexFilter <-
dfSexFilter |>
group_by(dfSexFilter$SEX_A) |>
mutate(Sex_Status = ifelse(SEX_A == 1,
"Male",
"Female")) |>
ungroup()
ggplot(dfSexFilter, aes(x = Sex_Status, fill=Sex_Status)) + xlab("Sex") +
geom_bar()

Age
The far right bar, 85, is for those who are 85+.
dfAgeFilter <- adult22 %>%
filter(AGEP_A < 86)
ggplot(dfAgeFilter, aes(x = AGEP_A)) + xlab("Age") +
geom_bar()

The largest group based on age were those that were 85 or older.
Education Level
dfEduFilter <- adult22 %>%
filter(EDUCP_A < 97)
dfEduFilter <-
dfEduFilter |>
group_by(dfEduFilter$EDUCP_A) |>
mutate(Edu_Status = ifelse(EDUCP_A == 1,
"Grade 1-11",
ifelse(EDUCP_A == 2,
"12th Grade, no Diploma",
ifelse(EDUCP_A == 3,
"GED or Equivalent",
ifelse(EDUCP_A == 4,
"High School Graduate",
ifelse(EDUCP_A == 5,
"Some College, no Degree",
ifelse(EDUCP_A == 6,
"Associate degree: occupational, technical, or vocational program",
ifelse(EDUCP_A == 7,
"Associate degree: academic program",
ifelse(EDUCP_A == 8,
"Bachelor's degree",
ifelse(EDUCP_A == 9,
"Master's degree ",
ifelse(EDUCP_A == 10,
"Professional School or Doctoral degree",
ifelse(EDUCP_A == 97,
"Refused",
"Don't Know")))))))))))) |>
ungroup()
dfEduFilter$Edu_Status <- factor(dfEduFilter$Edu_Status, levels = c("Grade 1-11", "12th Grade, no Diploma", "GED or Equivalent","High School Graduate", "Some College, no Degree", "Associate degree: occupational, technical, or vocational program", "Associate degree: academic program", "Bachelor's degree", "Master's degree ", "Professional School or Doctoral degree", "Refused", "Don't Know"))
ggplot(dfEduFilter, aes(x = EDUCP_A, fill=Edu_Status)) +
geom_bar() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) + xlab("Education Level")

Race
dfRaceFilter <- adult22 %>%
filter(RACEALLP_A <= 8 )
dfRaceFilter2 <- dfRaceFilter %>%
filter(RACEALLP_A != 7 )
dfRaceFilter2 <-
dfRaceFilter2 |>
group_by(dfRaceFilter2$RACEALLP_A) |>
mutate(Race = ifelse(RACEALLP_A == 1,
"White",
ifelse(RACEALLP_A == 2,
"African American/Black",
ifelse(RACEALLP_A == 3,
"Asian",
ifelse(RACEALLP_A == 4,
"AIAN*",
ifelse(RACEALLP_A == 5,
"AIAN + Other Group",
ifelse(RACEALLP_A == 6,
"Other",
ifelse(RACEALLP_A == 8,
"Hispanic**",
"Don't Know")))))))) |>
ungroup()
ggplot(dfRaceFilter2, aes(x = Race, fill=Race)) +
geom_bar() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())

There was a much larger amount of those who identified as white who
took the survey.
General Physical Health
dfGHFilter <- adult22 %>%
filter(PHSTAT_A < '7')
ggplot(dfGHFilter, aes(x = factor(PHSTAT_A))) + scale_x_discrete(labels=c("Excellent", "Very Good", "Good", "Fair", "Poor")) +
geom_bar() +
xlab("General Physical Health")

NA
NA
NA
Most people put their General Physical Health in the more positive
side (Excellent, Very good, Good) side of the scale.
Life Satisfaction
dfFilteredLS <- adult22 %>%
filter(LSATIS4_A < '5')
ggplot(dfFilteredLS, aes(x = factor(LSATIS4_A))) + scale_x_discrete(labels=c("Very Satisfied", "Satisfied", "Dissatisfied", "Very Dissatisfied")) +
geom_bar() +
xlab("Life Satisfaction")

NA
NA
NA
NA
Most people considered themselves to be positively satisfied with
life. The difference between the positive and negative side of the scale
was larger than for General Physical Health.
BMI
BMI can often be contributed to health and certain levels can be
contributed to health issues. However, BMI is fundamentally flawed as it
does not take into consideration muscle fat. For example, a body builder
could be considered to have an “overweight” BMI, but would be much
healthier than someone who had mostly non-muscle fat.
Histogram
hist(dfFilteredBMI$BMICAT_A)

The distribution was relatively equal, except for the Underweight
category, so there was a potential for this category to cause
problems.
Bootstrap
This is to help us get a better idea of the average Life Satisfaction
and General Physical Health based on a person’s BMI category, by
creating samples.
dfFilteredLSBMI <- adult22 %>%
filter(BMICAT_A < 5 & LSATIS4_A <5)
dfFilteredPHBMI <- adult22 %>%
filter(BMICAT_A < 5 & PHSTAT_A <7)
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
boot_func <- \(x, i) func(x[i])
b <- boot(v, boot_func, R = n_iter)
b <- boot.ci(b, conf = conf, type = "perc")
return(c("lower" = b$percent[4],
"upper" = b$percent[5]))
}
Bootstrap: Life Satisfaction vs BMI
df_ciLS <- dfFilteredLSBMI |>
group_by(BMICAT_A) |>
summarise(ci_lower = boot_ci(LSATIS4_A, mean)['lower'],
mean_LS = mean(LSATIS4_A),
ci_upper = boot_ci(LSATIS4_A, mean)['upper'])
df_ciLS |>
ggplot() +
geom_errorbarh(mapping = aes(y = BMICAT_A,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_LS, y = BMICAT_A,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'red')) +
theme_minimal() +
labs(title = "Life Satisfaction by BMI Category",
x = "Life Satisfaction",
y = "BMI Category",
color = '')

Category 1, underweight, seems to have a much larger range than the
other 3, which like stated above, could be an issue because of how small
the distribution was.
Bootstrap: General Physical Health vs BMI
df_ciPH <- dfFilteredPHBMI |>
group_by(BMICAT_A) |>
summarise(ci_lower = boot_ci(PHSTAT_A, mean)['lower'],
mean_PH = mean(PHSTAT_A),
ci_upper = boot_ci(PHSTAT_A, mean)['upper'])
df_ciPH |>
ggplot() +
geom_errorbarh(mapping = aes(y = BMICAT_A,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_PH, y = BMICAT_A,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'red')) +
theme_minimal() +
labs(title = "General Health by BMI Category",
x = "General Health",
y = "BMI Category",
color = '')

There were way less people in the underweight BMI (1) than the other
categories, similar to the issue in Life Satisfaction, so that group
does not give as much information, or as accurate of information as the
others and was removed for the graphs below.
General Health by BMI Status
dfBMIGH <- adult22 %>%
filter(PHSTAT_A < 7)
dfNormalBMI <- dfBMIGH %>%
filter(BMICAT_A == 2 )
dfOverBMI <- dfBMIGH %>%
filter(BMICAT_A == 3 )
dfObeseBMI <- dfBMIGH %>%
filter(BMICAT_A == 4 )
BMIGH <- data.frame(
Status = c(2, 3, 4),
Physical_Health= c( mean(dfNormalBMI$PHSTAT_A), mean(dfOverBMI$PHSTAT_A), mean(dfObeseBMI$PHSTAT_A)))
ggplot(BMIGH, aes(x = factor(Status), y = Physical_Health)) +
geom_point() +
scale_x_discrete(labels= c( "Normal", "Overweight", "Obese")) +
xlab("BMI Status") +
ylab("Average General Physical Health")

NA
NA
Life Satisfaction by BMI Status
dfBMILS <- adult22 %>%
filter(LSATIS4_A < 5)
dfNormalBMILS <- dfBMILS %>%
filter(BMICAT_A == 2 )
dfOverBMILS <- dfBMILS %>%
filter(BMICAT_A == 3 )
dfObeseBMILS <- dfBMILS %>%
filter(BMICAT_A == 4 )
BMILS <- data.frame(
Status = c(2, 3, 4),
Life_Satis= c( mean(dfNormalBMILS$LSATIS4_A), mean(dfOverBMILS$LSATIS4_A), mean(dfObeseBMILS$LSATIS4_A)))
ggplot(BMILS, aes(x = factor(Status), y = Life_Satis)) +
geom_point() +
scale_x_discrete(labels= c("Normal", "Overweight", "Obese")) +
xlab("BMI Status") +
ylab("Average Life Satisfaction")

NA
NA
Education Level
Vs General Physical Health
dfFilteredEdu <- dfGHFilter %>%
filter(EDUCP_A < 97)
df_ciEdu <- dfFilteredEdu |>
group_by(EDUCP_A) |>
summarise(ci_lower = boot_ci(PHSTAT_A, mean)['lower'],
mean_PH = mean(PHSTAT_A),
ci_upper = boot_ci(PHSTAT_A, mean)['upper'])
df_ciEdu |>
ggplot() +
geom_errorbarh(mapping = aes(y = EDUCP_A,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_PH, y = EDUCP_A,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'red')) +
theme_minimal() +
labs(title = "General Physical Health by Education",
x = "Physical Health",
y = "Education",
color = '')

We can see that as a person’s education level decreases, their
average physical health generally decreases, however it was still in the
same physical health category. There seems to be a few groups forming,
such as the bottom 3, the middle 4, and the top 3.
Vs Life Satisfaction
dfFilteredEduLS <- dfFilteredLS %>%
filter(EDUCP_A < 97)
df_ciEduLS <- dfFilteredEduLS |>
group_by(EDUCP_A) |>
summarise(ci_lower = boot_ci(LSATIS4_A, mean)['lower'],
mean_LS = mean(LSATIS4_A),
ci_upper = boot_ci(LSATIS4_A, mean)['upper'])
df_ciEduLS |>
ggplot() +
geom_errorbarh(mapping = aes(y = EDUCP_A,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_LS, y = EDUCP_A,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'red')) +
theme_minimal() +
labs(title = "Life Satisfaction by Education",
x = "Life Satisfaction",
y = "Education",
color = '')

Similar to general health, there is a trend of as education level
decreases, life satisfaction decreases.
Vision
dfFilteredVis <- adult22 %>%
filter(VISIONDF_A <4)
hist(dfFilteredVis$VISIONDF_A)

hist(dfFilteredVis$WEARGLSS_A)

Vision vs General Physical Health
dfGlasses <- adult22 %>%
filter(PHSTAT_A < 5)
dfNoGlasses <- dfGlasses %>%
filter(WEARGLSS_A == '2')
dfWearGlasses <- dfGlasses %>%
filter(WEARGLSS_A== '1' )
WearGlasses <- data.frame(
Status = c(1, 2),
Gen_Health= c(mean(dfWearGlasses$PHSTAT_A), mean(dfNoGlasses$PHSTAT_A)))
ggplot(WearGlasses, aes(x = factor(Status), y = Gen_Health)) +
geom_point() +
scale_x_discrete(labels= c("Wear Glasses/Contacts", "No Glasses/Contacts")) +
xlab("")+
ylab("Average Physical Health")

dfVision <- dfFilteredVis %>%
filter(PHSTAT_A < 5)
dfNoneVis <- dfVision %>%
filter(VISIONDF_A == '1')
dfSomeVis <- dfVision %>%
filter(VISIONDF_A== '2' )
dfAlotVis <- dfVision %>%
filter(VISIONDF_A== '3' )
dfCannotVis <- dfVision %>%
filter(VISIONDF_A== '4' )
VisDifficulty <- data.frame(
Status = c(1, 2,3,4),
Gen_Health= c(mean(dfNoneVis$PHSTAT_A), mean(dfSomeVis$PHSTAT_A), mean(dfAlotVis$PHSTAT_A), mean(dfCannotVis$PHSTAT_A)))
ggplot(VisDifficulty, aes(x = factor(Status), y = Gen_Health)) +
geom_point() +
scale_x_discrete(labels= c("None", "Some", "A Lot","Cannot See")) +
xlab("Difficulty Seeing") +
ylab("Average Physical Health")

Vision vs Life Satisfaction
dfGlassesLS <- adult22 %>%
filter(LSATIS4_A < 7)
dfNoGlassesLS <- dfGlassesLS %>%
filter(WEARGLSS_A == '2')
dfWearGlassesLS <- dfGlassesLS %>%
filter(WEARGLSS_A== '1' )
WearGlassesLS <- data.frame(
Status = c(1, 2),
Life_Satis= c(mean(dfWearGlassesLS$LSATIS4_A), mean(dfNoGlassesLS$LSATIS4_A)))
ggplot(WearGlassesLS, aes(x = factor(Status), y = Life_Satis)) +
geom_point() +
scale_x_discrete(labels= c("Wear Glasses/Contacts", "No Glasses/Contacts")) +
xlab("")+
ylab("Average Life Satisfaction")

Difficulty Seeing and Life Satisfaction
dfVisionLS <- dfFilteredVis %>%
filter(LSATIS4_A < 7)
dfNoneVisLS <- dfVisionLS %>%
filter(VISIONDF_A == '1')
dfSomeVisLS <- dfVisionLS %>%
filter(VISIONDF_A== '2' )
dfAlotVisLS <- dfVisionLS %>%
filter(VISIONDF_A== '3' )
dfCannotVisLS <- dfVisionLS %>%
filter(VISIONDF_A== '4' )
VisDifficultyLS <- data.frame(
Status = c(1, 2,3,4),
Life_Satis= c(mean(dfNoneVisLS$LSATIS4_A), mean(dfSomeVisLS$LSATIS4_A), mean(dfAlotVisLS$LSATIS4_A), mean(dfCannotVisLS$LSATIS4_A)))
ggplot(VisDifficultyLS, aes(x = factor(Status), y = Life_Satis)) +
geom_point() +
scale_x_discrete(labels= c("None", "Some", "A Lot","Cannot See")) +
xlab("Difficulty Seeing") +
ylab("Average Life Satisfaction")

For both physical health and life satisfaction, there was a decrease
for those who wear glasses/contacts. Since people can be split into
positive and negative life satisfaction, we can try a binary
response.
dfVisionAll <- dfFilteredVis %>%
filter(WEARGLSS_A< '3' )
dfVisionAllLS <-
dfVisionAll |>
group_by(dfVisionAll$LSATIS4_A) |>
mutate(LifeSatis_Status = ifelse(LSATIS4_A == 1,
0,
ifelse(LSATIS4_A == 2,
0,
1))) |>
ungroup()
model <- glm(LifeSatis_Status ~ VISIONDF_A, data = dfVisionAllLS,
family = binomial(link = 'logit'))
model$coefficients
(Intercept) VISIONDF_A
-4.325735 1.004927
sigmoid <- \(x) 1 / (1 + exp(-(-4.325735 + 1.004927 * x)))
dfVisionAllLS |>
ggplot(mapping = aes(x = VISIONDF_A, y = LifeSatis_Status)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
labs(title = "Life Satisfaction Binary Response - Difficulty Seeing") +
scale_y_continuous(breaks = c(0, 1)) +
theme_minimal()

Between Positive Life Satisfaction(0) and Negative Life
Satisfaction(1). We can see that there is a small increase in
probability that as someone’s vision gets worst, the life satisfaction
will become more negative.
Smoking
Smoking, Life Satisfaction, and General Health
- Current, every day smoker
- Current, some days smoker
- Former smoker
- Non-Smoker
dfFilteredLSSmoke <- dfFilteredLS %>%
filter(SMKCIGST_A < 5)
dfFilteredLSSmoke |>
ggplot(mapping = aes(x = SMKCIGST_A, y = LSATIS4_A)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = 'darkblue') +
theme_minimal() + xlab("Smoking Status") + ylab("Life Satisfaction")

dfFilteredPHSmoke <- dfGHFilter %>%
filter(SMKCIGST_A < 5)
dfFilteredPHSmoke |>
ggplot(mapping = aes(x = SMKCIGST_A, y = PHSTAT_A)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = 'darkblue') +
theme_minimal() + xlab("Smoking Status") + ylab("General Health")

NA
NA
There is a bit of a decrease in life satisfaction and general health
the more someone smokes.
Smoking and Lung Cancer - Binary Response
We can split people up into someone who has smoked and someone who
has not smoked, so we can try a binary response.
dfSmokeCan <- adult22 %>%
filter(SMKCIGST_A < 5 & LUNGCAN_A <3)
dfSmokeCan <-
dfSmokeCan |>
group_by(dfSmokeCan$SMKCIGST_A) |>
mutate(Smoke = ifelse(SMKCIGST_A == "4",
0,
1)) |>
ungroup()
model <- glm(Smoke ~ LUNGCAN_A, data = dfSmokeCan,
family = binomial(link = 'logit'))
model$coefficients
(Intercept) LUNGCAN_A
3.264507 -1.755153
sigmoid <- \(x) 1 / (1 + exp(-(3.264507 -1.755153 * x)))
dfSmokeCan |>
ggplot(mapping = aes(x = LUNGCAN_A, y = Smoke)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
labs(title = "Binary Response: Lung Cancer and Smoking") +
scale_y_continuous(breaks = c(0, 1)) +
theme_minimal()

This graph shows that for those who have lung cancer (1), there is a
higher probability of them being a smoker or former smoker(1).
100 Cigarettes Smoked and Lung Cancer - Binary Response
We can also split the smokers up into those who have smoked 100
cigarettes, and those who have not.
dfSmoke100 <- adult22 %>%
filter(SMKEV_A < 3 & LUNGCAN_A <3)
dfSmoke100Can <-
dfSmoke100 |>
group_by(dfSmoke100$SMKEV_A) |>
mutate(Smoke_100 = ifelse(SMKEV_A == "2",
0,
1)) |>
ungroup()
model <- glm(Smoke_100 ~ LUNGCAN_A, data = dfSmoke100Can,
family = binomial(link = 'logit'))
model$coefficients
(Intercept) LUNGCAN_A
3.263094 -1.753740
sigmoid <- \(x) 1 / (1 + exp(-(3.263094 -1.753740 * x)))
dfSmoke100Can |>
ggplot(mapping = aes(x = LUNGCAN_A, y = Smoke_100)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
labs(title = "Binary Response: Lung Cancer and Smoked 100 Cigarettes") +
scale_y_continuous(breaks = c(0, 1)) +
theme_minimal()

This graph shows that for those who have lung cancer (1), there is a
higher probability of them having smoked 100 cigarettes(1). The model is
extremely close to the previous model. So, how many of the smokers with
lung cancer smoked 100 cigarettes?
LungCan <- nrow(dfSmoke100[dfSmoke100$LUNGCAN_A == '1',])
Smoked100 <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '1',])
Smoked100Can <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '1' & dfSmoke100$LUNGCAN_A == '1', ])
NotSmoked100Can <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '2' & dfSmoke100$LUNGCAN_A == '1', ])
LungCan
[1] 116
Smoked100Can
[1] 95
95/116*100
[1] 81.89655
Smoked100
[1] 1511
NotSmoked100Can
[1] 21
95/1511*100
[1] 6.287227
Out of 116 people who stated they have lung cancer, 95 people, or
81.9% had smoked at least 100 cigarettes. But out of 1511 people who
have smoked at least 100 cigarettes, only 21, or 6.3% have lung
cancer.
Covid
Month of last Covid vaccine for those who got it for the first
time.
fCovidVax <- adult22
dfCovidVaxFirst <- dfCovidVax %>%
filter(CVDVAC1M_A != 'NA' & CVDVAC1Y_A !='NA')
dfCovidVaxLast <- dfCovidVax %>%
filter(CVDVAC2M_A != 'NA' & CVDVAC2Y_A !='NA')
dfCovidVaxFirst$LastCovidVaxFirstTime <- with(dfCovidVaxFirst, sprintf("%d-01-%02d", CVDVAC1M_A, CVDVAC1Y_A))
dfCovidVaxLast$LastCovidVaxNotFirstTime <- with(dfCovidVaxLast, sprintf("%d-01-%02d", CVDVAC2M_A, CVDVAC2Y_A))
dfCovidVaxFirst_ <- dfCovidVaxFirst %>%
filter(CVDDIAG_A < 3)
dfCovidVaxLast_ <- dfCovidVaxLast %>%
filter(CVDDIAG_A < 3)
dfCovidVaxFirst1 <- dfCovidVaxFirst_ |>
select(LastCovidVaxFirstTime, CVDDIAG_A)
dfCovidVaxLast1 <- dfCovidVaxLast_ |>
select(LastCovidVaxNotFirstTime, CVDDIAG_A)
dftransf <- dfCovidVaxFirst1 |>
select(LastCovidVaxFirstTime, CVDDIAG_A)
dftransf['date']<- as.Date(dftransf$LastCovidVaxFirstTime, format="%m-%d-%Y")
dftransfilt <- dftransf %>% filter(!is.na(date))
dfFirstV <- dftransfilt |>
select(date, CVDDIAG_A)
dffirstVCount<- dfFirstV %>% group_by(date) %>%tally()
dfMerge <- merge(dffirstVCount, dfFirstVSum, by = "date", all.x=TRUE, all.y=TRUE)
dfMerge['percent'] = dfMerge['Frequency']/dfMerge['n']
dfMerge <- dfMerge %>% filter(percent >.0000001)
dfMerge |>
ggplot(mapping = aes(x = date, y = n)) +
geom_line() +
labs(title = "Date of First Covid Vaccine") +
theme_hc() +
scale_x_date(date_labels = "%Y (%b)") + ylab("Number of People")

The first vaccine came out in Dec 2021, which is around where the
graph starts to increase. There was an increase in April and May 2021.
In June 2021 when it started to decrease, there was a slowdown in people
getting the vaccine and a record low number of cases.
Source: https://www.kff.org/coronavirus-covid-19/poll-finding/kff-covid-19-vaccine-monitor-june-2021/
On September 9 2021, Biden made an announcement requiring federal
employees to be vaccinated by the end of January 2022, which is around
the time of the largest peak between Sept 2021 and January 2022.
Source: https://www.whitehouse.gov/briefing-room/presidential-actions/2021/09/09/executive-order-on-requiring-coronavirus-disease-2019-vaccination-for-federal-employees/
Cancer
Cancer by Sex
Told has cancer.
dfFemale = adult22 %>%
filter(SEX_A == 2)
dfFemaleCan = dfFemale %>%
filter(CANEV_A == 1)
dfMale = adult22 %>%
filter(SEX_A == 1)
dfMaleCan = dfMale %>%
filter(CANEV_A == 1)
TotalF <- nrow(dfFemale)
TotalM <- nrow(dfMale)
PercentF <- (nrow(dfFemaleCan)/nrow(dfFemale))*100
PercentM <- (nrow(dfMaleCan)/nrow(dfMale))*100
PercentCan <- data.frame(
Sex = c("Female", "Male"),
Percent= c((PercentF), (PercentM)))
ggplot(PercentCan, aes(x = Sex, y=Percent)) +
geom_point()

Percent by Type of Cancer
dfCan <- adult22 %>%
filter(NUMCAN_A < 7 & NUMCAN_A !=0)
dfBlad <- dfCan %>%
filter(BLADDCAN_A == 1)
dfBlood <- dfCan %>%
filter(BLOODCAN_A == 1)
dfBrain <- dfCan %>%
filter(BRAINCAN_A == 1)
dfBone <- dfCan %>%
filter(BONECAN_A == 1)
dfBreast <- dfCan %>%
filter(BREASCAN_A == 1)
dfCervical <- dfCan %>%
filter(CERVICAN_A == 1)
dfEsoph <- dfCan %>%
filter(BLADDCAN_A == 1)
dfGall <- dfCan %>%
filter(ESOPHCAN_A == 1)
dfLarynx <- dfCan %>%
filter(LARYNCAN_A == 1)
dfLeuk <- dfCan %>%
filter(LEUKECAN_A == 1)
dfLiver <- dfCan %>%
filter(LIVERCAN_A == 1)
dfLung <- dfCan %>%
filter(LUNGCAN_A == 1)
dfLymph <- dfCan %>%
filter(LYMPHCAN_A == 1)
dfMela <- dfCan %>%
filter(MELANCAN_A == 1)
dfMouth <- dfCan %>%
filter(MOUTHCAN_A == 1)
dfOvary <- dfCan %>%
filter(OVARYCAN_A == 1)
dfPanc <- dfCan %>%
filter(PANCRCAN_A== 1)
dfProst <- dfCan %>%
filter(PROSTCAN_A == 1)
dfSkinMela <- dfCan %>%
filter(SKNMCAN_A == 1)
dfSkinNoMela <- dfCan %>%
filter(SKNNMCAN_A == 1)
dfSkinUnknown <- dfCan %>%
filter(SKNDKCAN_A == 1)
dfStomach <- dfCan %>%
filter(STOMACAN_A == 1)
dfThroat <- dfCan %>%
filter(THROACAN_A == 1)
dfThyroid <- dfCan %>%
filter(THYROCAN_A == 1)
dfUterus <- dfCan %>%
filter(UTERUCAN_A == 1)
dfHandNeck <- dfCan %>%
filter(HDNCKCAN_A == 1)
dfColRec <- dfCan %>%
filter(COLRCCAN_A == 1)
dfOther <- dfCan %>%
filter(OTHERCANP_A == 1)
TBladF <- nrow(dfBlad %>%
filter(SEX_A == 2))/TotalF*100
TBloodF <- nrow(dfBlood %>%
filter(SEX_A == 2))/TotalF*100
TBoneF <- nrow(dfBone %>%
filter(SEX_A == 2))/TotalF*100
TBrainF <- nrow(dfBrain %>%
filter(SEX_A == 2))/TotalF*100
TBreastF <- nrow(dfBreast %>%
filter(SEX_A == 2))/TotalF*100
TEsophF <- nrow(dfEsoph %>%
filter(SEX_A == 2))/TotalF*100
TCervF <- nrow(dfCervical %>%
filter(SEX_A == 2))/TotalF*100
TGallF <- nrow(dfGall %>%
filter(SEX_A == 2))/TotalF*100
TLarynxF <- nrow(dfLarynx %>%
filter(SEX_A == 2))/TotalF*100
TLeukF <- nrow(dfLeuk %>%
filter(SEX_A == 2))/TotalF*100
TLungF <- nrow(dfLung %>%
filter(SEX_A == 2))/TotalF*100
TLiverF <- nrow(dfLiver %>%
filter(SEX_A == 2))/TotalF*100
TMelaF <- nrow(dfMela %>%
filter(SEX_A == 2))/TotalF*100
TMouthF <- nrow(dfMouth %>%
filter(SEX_A == 2))/TotalF*100
TOvaryF <- nrow(dfOvary %>%
filter(SEX_A == 2))/TotalF*100
TPancF <- nrow(dfBlad %>%
filter(SEX_A == 2))/TotalF*100
TSkinMelaF <- nrow(dfSkinMela %>%
filter(SEX_A == 2))/TotalF*100
TSkinNoMelaF <- nrow(dfSkinNoMela %>%
filter(SEX_A == 2))/TotalF*100
TSkinUnknownF <- nrow(dfSkinUnknown %>%
filter(SEX_A == 2))/TotalF*100
TStomachF <- nrow(dfStomach %>%
filter(SEX_A == 2))/TotalF*100
TThyroidF <- nrow(dfThyroid %>%
filter(SEX_A == 2))/TotalF*100
TThroatF <- nrow(dfThroat %>%
filter(SEX_A == 2))/TotalF*100
TUterusF <- nrow(dfUterus %>%
filter(SEX_A == 2))/TotalF*100
THandNeckF <- nrow(dfHandNeck %>%
filter(SEX_A == 2))/TotalF*100
TColRecF <- nrow(dfColRec %>%
filter(SEX_A == 2))/TotalF*100
TOtherF <- nrow(dfOther %>%
filter(SEX_A == 2))/TotalF*100
Cancer Among Females
CancerF <- data.frame(
Cancer = c("Bladder", "Blood","Brain","Bone","Breast","Esophagus","Cervical","GallBladder","Larynx","Leukemia","Lung","Liver","Melanoma","Mouth","Ovarian","Pancreas","Skin Melanoma","Skin-Not Melanoma","Skin-Unknown","Stomach","Thyroid","Throat","Uterus","Hand and Neck","ColonRectal","Other"),
Percent= c((TBladF), (TBloodF),(TBrainF),(TBoneF),(TBreastF),(TEsophF),(TCervF),(TGallF),(TLarynxF),(TLeukF),(TLungF),(TLiverF),(TMelaF),(TMouthF),(TOvaryF),(TPancF),(TSkinMelaF),(TSkinNoMelaF),(TSkinUnknownF),(TStomachF),(TThyroidF),(TThroatF),(TUterusF),(THandNeckF),(TColRecF),(TOtherF)))
ggplot(CancerF, aes(x = Percent, y=Cancer)) +
geom_point() + labs(title = "Cancer Among Females")

Cancer Among Males
TBladM <- nrow(dfBlad %>%
filter(SEX_A == 1))/TotalM*100
TBloodM <- nrow(dfBlood %>%
filter(SEX_A == 1))/TotalM*100
TBoneM <- nrow(dfBone %>%
filter(SEX_A == 1))/TotalM*100
TBrainM <- nrow(dfBrain %>%
filter(SEX_A == 1))/TotalM*100
TBreastM <- nrow(dfBreast %>%
filter(SEX_A == 1))/TotalM*100
TEsophM <- nrow(dfEsoph %>%
filter(SEX_A == 1))/TotalM*100
TGallM <- nrow(dfGall %>%
filter(SEX_A == 1))/TotalM*100
TLarynxM <- nrow(dfLarynx %>%
filter(SEX_A == 1))/TotalM*100
TLeukM <- nrow(dfLeuk %>%
filter(SEX_A == 1))/TotalM*100
TLungM <- nrow(dfLung %>%
filter(SEX_A == 1))/TotalM*100
TLiverM <- nrow(dfLiver %>%
filter(SEX_A == 1))/TotalM*100
TMelaM <- nrow(dfMela %>%
filter(SEX_A == 1))/TotalM*100
TMouthM <- nrow(dfMouth %>%
filter(SEX_A == 1))/TotalM*100
TPancM <- nrow(dfPanc %>%
filter(SEX_A == 1))/TotalM*100
TProstM <- nrow(dfProst %>%
filter(SEX_A == 1))/TotalM*100
TSkinMelaM <- nrow(dfSkinMela %>%
filter(SEX_A == 1))/TotalM*100
TSkinNoMelaM <- nrow(dfSkinNoMela %>%
filter(SEX_A == 1))/TotalM*100
TSkinUnknownM <- nrow(dfSkinUnknown %>%
filter(SEX_A == 1))/TotalM*100
TStomachM <- nrow(dfStomach %>%
filter(SEX_A == 1))/TotalM*100
TThyroidM <- nrow(dfThyroid %>%
filter(SEX_A == 1))/TotalM*100
TThroatM <- nrow(dfThroat %>%
filter(SEX_A == 1))/TotalM*100
THandNeckM <- nrow(dfHandNeck %>%
filter(SEX_A == 1))/TotalM*100
TColRecM <- nrow(dfColRec %>%
filter(SEX_A == 1))/TotalM*100
TOtherM <- nrow(dfOther %>%
filter(SEX_A == 1))/TotalM*100
CancerM <- data.frame(
Cancer = c("Bladder", "Blood","Brain","Bone","Breast","Esophagus","GallBladder","Larynx","Leukemia","Lung","Liver","Melanoma","Mouth","Pancreas","Skin Melanoma","Skin-Not Melanoma","Skin-Unknown","Stomach","Thyroid","Throat","Hand and Neck","ColonRectal","Other"),
Percent= c((TBladM), (TBloodM),(TBrainM),(TBoneM),(TBreastM),(TEsophM),(TGallM),(TLarynxM),(TLeukM),(TLungM),(TLiverM),(TMelaM),(TMouthM),(TPancM),(TSkinMelaM),(TSkinNoMelaM),(TSkinUnknownM),(TStomachM),(TThyroidM),(TThroatM),(THandNeckM),(TColRecM),(TOtherM)))
ggplot(CancerM, aes(x = Percent, y=Cancer)) +
geom_point() + labs(title = "Cancer Among Males")

Cancer Among Both
df_Canmerge <- merge(CancerF,CancerM,by="Cancer", all.x = TRUE)
names(df_Canmerge)[names(df_Canmerge) == "Percent.x"] <- "Female"
names(df_Canmerge)[names(df_Canmerge) == "Percent.y"] <- "Male"
dfCancerMerged <- melt(df_Canmerge, id.vars="Cancer")
names(dfCancerMerged)[names(dfCancerMerged) == "value"] <- "Percent"
names(dfCancerMerged)[names(dfCancerMerged) == "variable"] <- "Sex"
ggplot(dfCancerMerged, aes(Percent,Cancer, col=Sex)) +
geom_point() + labs(title = "Cancer Among Males and Females")

Both males and females had a higher percentage for Skin-Not Melanoma
cancer. The highest percentage for cancer was breast cancer in
females.In many of the cancer categories, females had a slightly lower
percentage.
Summary
While many of the ways to stay healthy are commonly known and
advertised, it is interesting to look at the data to get a more specific
idea. While many of these did not have much of a difference between
categories, there is some information that can be gained that follow
many of the health ideas we already know. These include
- It is less healthier, mentally and physically, for you to smoke.
Smoking at least 100 cigarettes can make this worst and if you end up
with lung cancer, it could be contributed to you smoking.
- Cancer is important to watch out for, and there are some
specifically that are more common based on your sex and overall more
common.
- While it is mostly unavoidable, those who have difficulty seeing can
have worst life satisfaction, so if you can get it fixed by surgery or
glasses/contacts, it will be helpful.
- It is better to for your mental and physical health to stay around a
normal BMI. However, this could be argued based on if your weight comes
from non-muscle fat or muscle fat.
- A higher education could potentially lead you to feel more satisfied
and have better health.
---
title: "2021 National Health Interview Adult Survey"
output: html_notebook
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

 adult22 <- read.csv("~/Documents/DataDive/adult22.csv")
 
library(tidyverse)
library(gapminder)
library(ggthemes)
library(dplyr)
library(ggplot2)
library(magrittr)
library(effsize)
library(pwrss)
library(broom)
library(lindia)
library(boot)
library(car)
library(xts)
library(tsibble)
library(lubridate)
library(reshape2)



 adult22_raw <- adult22
 
```

# Dataset

The dataset is data from the 2021 National Health Interview Adult
Survey. The survey contained questions related to household and family
composition, demographics about the survey taker, satisfaction with
life, health insurance, medication, immunization, preventive screenings,
and multiple health problems such as hypertension, cardiovascular
conditions, cancer, vision, hearing, mobility, and more.

This survey is important in following the health of American's based on
many different factors of their lives. Looking at previous surveys can
also help to see trends in Americans' health.

For the data used, responses of "refused","don't know" and "not
ascertained" were filtered out as these responses had a tendency to be
much smaller than the others and were not helpful for what was being
looked at.

# Columns

### General Health

```         
1. Excellent 
2. Very Good 
3. Good 
4. Fair 
5. Poor 
```

### Life Satisfaction

```         
1. Very Satisfied 
2. Satisfied 
3. Dissatisfied 
4. Very Dissatisfied
```

### Race (Single and Multiple Race Groups)

```         
1.  White only
2.  Black/African American only
3.  Asian only
4.  AIAN only (American Indians and Alaska Natives)
5.  AIAN and any other group
6.  Other
7.  Hispanic who did not identify with one of the other categories
```

### Education Level

```         
0. Never attended/Kindergarten only 
1. Grade 1-11 
2. 12th grade, nodiploma 
3. GED or equivalent 
4. High School Graduate 
5. Some college, nodegree 
6. Associate degree: occupational, technical, or vocational program 
7. Associate degree: academic program 
8. Bachelor's degree 
9. Master's degree 
10. Professional School or Doctoral degree
```

### BMI

```         
1. Underweight 
2. Normal 
3. Overweight 
4. Obese
```

# Vision

```         
Do you wear glasses or contact lenses?
1. Yes 
2. No
```

```         
Do you have difficulty seeing? 
1. No difficulty 
2. Some difficulty 
3. A lot of difficulty 
4. Cannot do at all
```

### Cancer

```         
Separate columns for each type of cancer.

Number of reported types of cancer. 
0. No cancer 
1-3. 1 to 3 
4. 4 or more
```

# Demographics

Looking at demographics can help to give a bit of an idea about what
sort of data we should look at.

### Sex

```{r}

dfSexFilter <- adult22 %>% 
  filter(SEX_A == 1 | SEX_A == 2)

dfSexFilter <-
  dfSexFilter |>
    group_by(dfSexFilter$SEX_A) |>
    mutate(Sex_Status = ifelse(SEX_A == 1,
                                 "Male", 
                                 "Female")) |>

  ungroup()
ggplot(dfSexFilter, aes(x = Sex_Status, fill=Sex_Status)) + xlab("Sex") +
  geom_bar()
```

### Age

The far right bar, 85, is for those who are 85+.

```{r}


dfAgeFilter <- adult22 %>% 
  filter(AGEP_A < 86)

ggplot(dfAgeFilter, aes(x = AGEP_A)) + xlab("Age") +
  geom_bar()
```

The largest group based on age were those that were 85 or older.

### Education Level

```{r}
dfEduFilter <- adult22 %>% 
  filter(EDUCP_A < 97)


dfEduFilter <-
  dfEduFilter |>
    group_by(dfEduFilter$EDUCP_A) |>
    mutate(Edu_Status = ifelse(EDUCP_A == 1,
                                "Grade 1-11", 
                               ifelse(EDUCP_A == 2,
                                 "12th Grade, no Diploma",
                                 ifelse(EDUCP_A == 3,
                                 "GED or Equivalent",
                                 ifelse(EDUCP_A == 4,
                                 "High School Graduate",
                                 ifelse(EDUCP_A == 5,
                                 "Some College, no Degree",
                                 ifelse(EDUCP_A == 6,
                                 "Associate degree: occupational, technical, or vocational program",
                                 ifelse(EDUCP_A == 7,
                                 "Associate degree: academic program",
                                 ifelse(EDUCP_A == 8,
                                 "Bachelor's degree",
                                 ifelse(EDUCP_A == 9,
                                 "Master's degree ",
                                 ifelse(EDUCP_A == 10,
                                 "Professional School or Doctoral degree",
                                 ifelse(EDUCP_A == 97,
                                 "Refused",
                                "Don't Know")))))))))))) |> 
ungroup()
  
  
dfEduFilter$Edu_Status <- factor(dfEduFilter$Edu_Status, levels = c("Grade 1-11", "12th Grade, no Diploma", "GED or Equivalent","High School Graduate", "Some College, no Degree", "Associate degree: occupational, technical, or vocational program", "Associate degree: academic program", "Bachelor's degree", "Master's degree ", "Professional School or Doctoral degree", "Refused", "Don't Know"))

ggplot(dfEduFilter, aes(x = EDUCP_A, fill=Edu_Status)) +
  geom_bar() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank()) + xlab("Education Level")

```

### Race

```{r}
dfRaceFilter <- adult22 %>% 
  filter(RACEALLP_A <= 8 )

dfRaceFilter2 <- dfRaceFilter  %>% 
  filter(RACEALLP_A != 7 )


dfRaceFilter2 <-
  dfRaceFilter2 |>
    group_by(dfRaceFilter2$RACEALLP_A) |>
    mutate(Race = ifelse(RACEALLP_A == 1,
                                "White", 
                               ifelse(RACEALLP_A == 2,
                                 "African American/Black",
                                 ifelse(RACEALLP_A == 3,
                                 "Asian",
                                 ifelse(RACEALLP_A == 4,
                                 "AIAN*",
                                 ifelse(RACEALLP_A == 5,
                                 "AIAN + Other Group",
                                 ifelse(RACEALLP_A == 6,
                                 "Other",
                                 ifelse(RACEALLP_A == 8,
                                 "Hispanic**",
                                "Don't Know")))))))) |> 
ungroup()
  
ggplot(dfRaceFilter2, aes(x = Race, fill=Race)) +
  geom_bar() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

```

There was a much larger amount of those who identified as white who took
the survey.

### General Physical Health

```{r}

dfGHFilter <- adult22 %>% 
  filter(PHSTAT_A < '7')

ggplot(dfGHFilter, aes(x = factor(PHSTAT_A))) + scale_x_discrete(labels=c("Excellent", "Very Good", "Good", "Fair", "Poor")) +
  geom_bar() + 
  xlab("General Physical Health")
```

Most people put their General Physical Health in the more positive side
(Excellent, Very good, Good) side of the scale.

### Life Satisfaction

```{r}

dfFilteredLS <- adult22 %>% 
  filter(LSATIS4_A < '5')


ggplot(dfFilteredLS, aes(x = factor(LSATIS4_A))) + scale_x_discrete(labels=c("Very Satisfied", "Satisfied", "Dissatisfied", "Very Dissatisfied")) +
  geom_bar() +
  xlab("Life Satisfaction")

```

Most people considered themselves to be positively satisfied with life.
The difference between the positive and negative side of the scale was
larger than for General Physical Health.

# BMI

BMI can often be contributed to health and certain levels can be
contributed to health issues. However, BMI is fundamentally flawed as it
does not take into consideration muscle fat. For example, a body builder
could be considered to have an "overweight" BMI, but would be much
healthier than someone who had mostly non-muscle fat.

### Histogram

```{r}
hist(dfFilteredBMI$BMICAT_A)
```

The distribution was relatively equal, except for the Underweight
category, so there was a potential for this category to cause problems.

### Bootstrap

This is to help us get a better idea of the average Life Satisfaction
and General Physical Health based on a person's BMI category, by
creating samples.

```{r}

dfFilteredLSBMI <- adult22 %>%
  filter(BMICAT_A < 5 & LSATIS4_A <5)

dfFilteredPHBMI <- adult22 %>%
  filter(BMICAT_A < 5 & PHSTAT_A <7)

boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}
```

#### Bootstrap: Life Satisfaction vs BMI

```{r}
df_ciLS <- dfFilteredLSBMI |>
  group_by(BMICAT_A) |>
  summarise(ci_lower = boot_ci(LSATIS4_A, mean)['lower'],
            mean_LS = mean(LSATIS4_A),
            ci_upper = boot_ci(LSATIS4_A, mean)['upper'])

```

```{r}

df_ciLS |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = BMICAT_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_LS, y = BMICAT_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "Life Satisfaction by BMI Category",
       x = "Life Satisfaction",
       y = "BMI Category",
       color = '')
```

Category 1, underweight, seems to have a much larger range than the
other 3, which like stated above, could be an issue because of how small
the distribution was.

#### Bootstrap: General Physical Health vs BMI

```{r}
df_ciPH <- dfFilteredPHBMI |>
  group_by(BMICAT_A) |>
  summarise(ci_lower = boot_ci(PHSTAT_A, mean)['lower'],
            mean_PH = mean(PHSTAT_A),
            ci_upper = boot_ci(PHSTAT_A, mean)['upper'])

```

```{r}

df_ciPH |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = BMICAT_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_PH, y = BMICAT_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "General Health by BMI Category",
       x = "General Health",
       y = "BMI Category",
       color = '')
```

There were way less people in the underweight BMI (1) than the other
categories, similar to the issue in Life Satisfaction, so that group
does not give as much information, or as accurate of information as the
others and was removed for the graphs below.

### General Health by BMI Status

```{r}

dfBMIGH <- adult22 %>%
  filter(PHSTAT_A < 7)

dfNormalBMI <- dfBMIGH %>%
  filter(BMICAT_A == 2 )

dfOverBMI <- dfBMIGH %>%
  filter(BMICAT_A == 3 )

dfObeseBMI <- dfBMIGH %>%
  filter(BMICAT_A == 4 )

BMIGH <- data.frame(                                            
   Status  = c(2, 3, 4),       
   Physical_Health= c( mean(dfNormalBMI$PHSTAT_A), mean(dfOverBMI$PHSTAT_A), mean(dfObeseBMI$PHSTAT_A)))

ggplot(BMIGH, aes(x = factor(Status), y = Physical_Health)) + 
  geom_point() +
  scale_x_discrete(labels= c( "Normal", "Overweight", "Obese")) +
  xlab("BMI Status") +
  ylab("Average General Physical Health")


```

### Life Satisfaction by BMI Status

```{r}

dfBMILS <- adult22 %>%
  filter(LSATIS4_A < 5)

dfNormalBMILS <- dfBMILS %>%
  filter(BMICAT_A == 2 )

dfOverBMILS <- dfBMILS %>%
  filter(BMICAT_A == 3 )

dfObeseBMILS <- dfBMILS %>%
  filter(BMICAT_A == 4 )

BMILS <- data.frame(                                            
   Status  = c(2, 3, 4),       
   Life_Satis= c( mean(dfNormalBMILS$LSATIS4_A), mean(dfOverBMILS$LSATIS4_A), mean(dfObeseBMILS$LSATIS4_A)))

ggplot(BMILS, aes(x = factor(Status), y = Life_Satis)) + 
  geom_point() +
  scale_x_discrete(labels= c("Normal", "Overweight", "Obese")) +
  xlab("BMI Status") +
  ylab("Average Life Satisfaction")


```

# Education Level

### Vs General Physical Health

```{r}

dfFilteredEdu <- dfGHFilter %>%
  filter(EDUCP_A < 97)

df_ciEdu <- dfFilteredEdu |>
  group_by(EDUCP_A) |>
  summarise(ci_lower = boot_ci(PHSTAT_A, mean)['lower'],
            mean_PH = mean(PHSTAT_A),
            ci_upper = boot_ci(PHSTAT_A, mean)['upper'])


df_ciEdu |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = EDUCP_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_PH, y = EDUCP_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "General Physical Health by Education",
       x = "Physical Health",
       y = "Education",
       color = '')
```

We can see that as a person's education level decreases, their average
physical health generally decreases, however it was still in the same
physical health category. There seems to be a few groups forming, such
as the bottom 3, the middle 4, and the top 3.

## Vs Life Satisfaction

```{r}

dfFilteredEduLS <- dfFilteredLS %>%
  filter(EDUCP_A < 97)

df_ciEduLS <- dfFilteredEduLS |>
  group_by(EDUCP_A) |>
  summarise(ci_lower = boot_ci(LSATIS4_A, mean)['lower'],
            mean_LS = mean(LSATIS4_A),
            ci_upper = boot_ci(LSATIS4_A, mean)['upper'])


df_ciEduLS |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = EDUCP_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_LS, y = EDUCP_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "Life Satisfaction by Education",
       x = "Life Satisfaction",
       y = "Education",
       color = '')
```

Similar to general health, there is a trend of as education level
decreases, life satisfaction decreases.

# Vision

```{r}
dfFilteredVis <- adult22 %>%
  filter(VISIONDF_A <4)

hist(dfFilteredVis$VISIONDF_A)

hist(dfFilteredVis$WEARGLSS_A)
```

### Vision vs General Physical Health

```{r}
dfGlasses <- adult22 %>%
  filter(PHSTAT_A < 5)

dfNoGlasses <- dfGlasses %>%
  filter(WEARGLSS_A == '2')

dfWearGlasses <- dfGlasses %>%
  filter(WEARGLSS_A== '1' )

WearGlasses <- data.frame(                                            
   Status  = c(1, 2),       
   Gen_Health= c(mean(dfWearGlasses$PHSTAT_A), mean(dfNoGlasses$PHSTAT_A)))


ggplot(WearGlasses, aes(x = factor(Status), y = Gen_Health)) + 
  geom_point() +
  scale_x_discrete(labels= c("Wear Glasses/Contacts", "No Glasses/Contacts")) +
  xlab("")+
  ylab("Average Physical Health")
```

```{r}
dfVision <- dfFilteredVis %>%
  filter(PHSTAT_A < 5)

dfNoneVis <- dfVision %>%
  filter(VISIONDF_A == '1')

dfSomeVis <- dfVision %>%
  filter(VISIONDF_A== '2' )

dfAlotVis <- dfVision %>%
  filter(VISIONDF_A== '3' )

dfCannotVis <- dfVision %>%
  filter(VISIONDF_A== '4' )

VisDifficulty <- data.frame(                                            
   Status  = c(1, 2,3,4),       
   Gen_Health= c(mean(dfNoneVis$PHSTAT_A), mean(dfSomeVis$PHSTAT_A), mean(dfAlotVis$PHSTAT_A), mean(dfCannotVis$PHSTAT_A)))


ggplot(VisDifficulty, aes(x = factor(Status), y = Gen_Health)) + 
  geom_point() +
  scale_x_discrete(labels= c("None", "Some", "A Lot","Cannot See")) +
  xlab("Difficulty Seeing") +
  ylab("Average Physical Health")
```

### Vision vs Life Satisfaction

```{r}
dfGlassesLS <- adult22 %>%
  filter(LSATIS4_A < 7)

dfNoGlassesLS <- dfGlassesLS %>%
  filter(WEARGLSS_A == '2')

dfWearGlassesLS <- dfGlassesLS %>%
  filter(WEARGLSS_A== '1' )

WearGlassesLS <- data.frame(                                            
   Status  = c(1, 2),       
   Life_Satis= c(mean(dfWearGlassesLS$LSATIS4_A), mean(dfNoGlassesLS$LSATIS4_A)))


ggplot(WearGlassesLS, aes(x = factor(Status), y = Life_Satis)) + 
  geom_point() +
  scale_x_discrete(labels= c("Wear Glasses/Contacts", "No Glasses/Contacts")) +
  xlab("")+
  ylab("Average Life Satisfaction")
```

### Difficulty Seeing and Life Satisfaction

```{r}
dfVisionLS <- dfFilteredVis %>%
  filter(LSATIS4_A < 7)

dfNoneVisLS <- dfVisionLS %>%
  filter(VISIONDF_A == '1')

dfSomeVisLS <- dfVisionLS %>%
  filter(VISIONDF_A== '2' )

dfAlotVisLS <- dfVisionLS %>%
  filter(VISIONDF_A== '3' )

dfCannotVisLS <- dfVisionLS %>%
  filter(VISIONDF_A== '4' )

VisDifficultyLS <- data.frame(                                            
   Status  = c(1, 2,3,4),       
   Life_Satis= c(mean(dfNoneVisLS$LSATIS4_A), mean(dfSomeVisLS$LSATIS4_A), mean(dfAlotVisLS$LSATIS4_A), mean(dfCannotVisLS$LSATIS4_A)))


ggplot(VisDifficultyLS, aes(x = factor(Status), y = Life_Satis)) + 
  geom_point() +
  scale_x_discrete(labels= c("None", "Some", "A Lot","Cannot See")) +
  xlab("Difficulty Seeing") +
  ylab("Average Life Satisfaction")
```

For both physical health and life satisfaction, there was a decrease for
those who wear glasses/contacts. Since people can be split into positive
and negative life satisfaction, we can try a binary response.

```{r}
dfVisionAll <- dfFilteredVis %>%
  filter(WEARGLSS_A< '3' )

dfVisionAllLS <-
  dfVisionAll |>
    group_by(dfVisionAll$LSATIS4_A) |>
    mutate(LifeSatis_Status = ifelse(LSATIS4_A == 1,
                                0, 
                                ifelse(LSATIS4_A == 2,
                                0, 
                                 1))) |> 
  ungroup()

model <- glm(LifeSatis_Status ~ VISIONDF_A, data = dfVisionAllLS,
             family = binomial(link = 'logit'))

model$coefficients



```

```{r}
sigmoid <- \(x) 1 / (1 + exp(-(-4.325735 + 1.004927 * x)))

dfVisionAllLS |>
  ggplot(mapping = aes(x = VISIONDF_A, y = LifeSatis_Status)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Life Satisfaction Binary Response - Difficulty Seeing") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()
```

Between Positive Life Satisfaction(0) and Negative Life Satisfaction(1).
We can see that there is a small increase in probability that as
someone's vision gets worst, the life satisfaction will become more
negative.

# Smoking

### Smoking, Life Satisfaction, and General Health

1.  Current, every day smoker
2.  Current, some days smoker
3.  Former smoker
4.  Non-Smoker

```{r}

dfFilteredLSSmoke <- dfFilteredLS %>%
  filter(SMKCIGST_A < 5)


dfFilteredLSSmoke |>
  ggplot(mapping = aes(x = SMKCIGST_A, y = LSATIS4_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal() + xlab("Smoking Status") + ylab("Life Satisfaction")
  

dfFilteredPHSmoke <- dfGHFilter %>%
  filter(SMKCIGST_A < 5)


dfFilteredPHSmoke |>
  ggplot(mapping = aes(x = SMKCIGST_A, y = PHSTAT_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal() + xlab("Smoking Status") + ylab("General Health")


```

There is a bit of a decrease in life satisfaction and general health the
more someone smokes.

### Smoking and Lung Cancer - Binary Response

We can split people up into someone who has smoked and someone who has
not smoked, so we can try a binary response.

```{r}

dfSmokeCan <- adult22 %>%
  filter(SMKCIGST_A < 5 & LUNGCAN_A <3)

dfSmokeCan <-
  dfSmokeCan |>
    group_by(dfSmokeCan$SMKCIGST_A) |>
    mutate(Smoke = ifelse(SMKCIGST_A == "4",
                                0,
                                 1)) |> 
  ungroup()

model <- glm(Smoke ~ LUNGCAN_A, data = dfSmokeCan,
             family = binomial(link = 'logit'))

model$coefficients

```

```{r}
sigmoid <- \(x) 1 / (1 + exp(-(3.264507 -1.755153 * x)))

dfSmokeCan |>
  ggplot(mapping = aes(x = LUNGCAN_A, y = Smoke)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Binary Response: Lung Cancer and Smoking") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()
```

This graph shows that for those who have lung cancer (1), there is a
higher probability of them being a smoker or former smoker(1).

### 100 Cigarettes Smoked and Lung Cancer - Binary Response

We can also split the smokers up into those who have smoked 100
cigarettes, and those who have not.

```{r}
dfSmoke100 <- adult22 %>%
  filter(SMKEV_A < 3 & LUNGCAN_A <3)

dfSmoke100Can <-
  dfSmoke100 |>
    group_by(dfSmoke100$SMKEV_A) |>
    mutate(Smoke_100 = ifelse(SMKEV_A == "2",
                                0,
                                 1)) |> 
  ungroup()

model <- glm(Smoke_100 ~ LUNGCAN_A, data = dfSmoke100Can,
             family = binomial(link = 'logit'))

model$coefficients

```

```{r}
sigmoid <- \(x) 1 / (1 + exp(-(3.263094 -1.753740 * x)))

dfSmoke100Can |>
  ggplot(mapping = aes(x = LUNGCAN_A, y = Smoke_100)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Binary Response: Lung Cancer and Smoked 100 Cigarettes") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

```

This graph shows that for those who have lung cancer (1), there is a
higher probability of them having smoked 100 cigarettes(1). The model is
extremely close to the previous model. So, how many of the smokers with
lung cancer smoked 100 cigarettes?

```{r}
LungCan <- nrow(dfSmoke100[dfSmoke100$LUNGCAN_A == '1',])
Smoked100 <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '1',])
Smoked100Can <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '1' & dfSmoke100$LUNGCAN_A == '1', ])
NotSmoked100Can <- nrow(dfSmoke100[dfSmoke100$SMKEV_A == '2' & dfSmoke100$LUNGCAN_A == '1', ])

LungCan
Smoked100Can
95/116*100
Smoked100
NotSmoked100Can
95/1511*100
```

Out of 116 people who stated they have lung cancer, 95 people, or 81.9%
had smoked at least 100 cigarettes. But out of 1511 people who have
smoked at least 100 cigarettes, only 21, or 6.3% have lung cancer.

# Covid

Month of last Covid vaccine for those who got it for the first time.

```{r}
fCovidVax <- adult22

dfCovidVaxFirst <- dfCovidVax %>%
  filter(CVDVAC1M_A != 'NA' & CVDVAC1Y_A !='NA')

dfCovidVaxLast <- dfCovidVax %>%
  filter(CVDVAC2M_A != 'NA' & CVDVAC2Y_A !='NA')

dfCovidVaxFirst$LastCovidVaxFirstTime <- with(dfCovidVaxFirst, sprintf("%d-01-%02d", CVDVAC1M_A, CVDVAC1Y_A))

dfCovidVaxLast$LastCovidVaxNotFirstTime <- with(dfCovidVaxLast, sprintf("%d-01-%02d", CVDVAC2M_A, CVDVAC2Y_A))

dfCovidVaxFirst_ <- dfCovidVaxFirst %>%
  filter(CVDDIAG_A < 3)

dfCovidVaxLast_ <- dfCovidVaxLast %>%
  filter(CVDDIAG_A < 3)

dfCovidVaxFirst1 <- dfCovidVaxFirst_ |>
  select(LastCovidVaxFirstTime, CVDDIAG_A)

dfCovidVaxLast1 <- dfCovidVaxLast_ |>
  select(LastCovidVaxNotFirstTime, CVDDIAG_A)

dftransf <- dfCovidVaxFirst1 |>
  select(LastCovidVaxFirstTime, CVDDIAG_A)

dftransf['date']<- as.Date(dftransf$LastCovidVaxFirstTime, format="%m-%d-%Y")

dftransfilt <- dftransf %>% filter(!is.na(date))

dfFirstV <- dftransfilt |>
  select(date, CVDDIAG_A)

dffirstVCount<- dfFirstV %>% group_by(date) %>%tally() 

dfMerge <- merge(dffirstVCount, dfFirstVSum, by = "date", all.x=TRUE, all.y=TRUE)

dfMerge['percent'] = dfMerge['Frequency']/dfMerge['n']

dfMerge <- dfMerge %>% filter(percent >.0000001)

```

```{r}
dfMerge |>
  ggplot(mapping = aes(x = date, y = n)) +
  geom_line() +
  labs(title = "Date of First Covid Vaccine") +
  theme_hc() +
  scale_x_date(date_labels = "%Y (%b)") + ylab("Number of People")

```

The first vaccine came out in Dec 2021, which is around where the graph
starts to increase. There was an increase in April and May 2021. In June
2021 when it started to decrease, there was a slowdown in people getting
the vaccine and a record low number of cases.

Source:
<https://www.kff.org/coronavirus-covid-19/poll-finding/kff-covid-19-vaccine-monitor-june-2021/>

On September 9 2021, Biden made an announcement requiring federal
employees to be vaccinated by the end of January 2022, which is around
the time of the largest peak between Sept 2021 and January 2022.

Source:
<https://www.whitehouse.gov/briefing-room/presidential-actions/2021/09/09/executive-order-on-requiring-coronavirus-disease-2019-vaccination-for-federal-employees/>

# Cancer

### Cancer by Sex

#### Told has cancer.

```{r}

dfFemale = adult22 %>%
  filter(SEX_A == 2)

dfFemaleCan = dfFemale %>%
  filter(CANEV_A == 1)

dfMale = adult22 %>%
  filter(SEX_A == 1)

dfMaleCan = dfMale %>%
  filter(CANEV_A == 1)

TotalF <- nrow(dfFemale)
TotalM <- nrow(dfMale)
PercentF <- (nrow(dfFemaleCan)/nrow(dfFemale))*100
PercentM <- (nrow(dfMaleCan)/nrow(dfMale))*100

PercentCan <- data.frame(                                            
   Sex  = c("Female", "Male"),       
   Percent= c((PercentF), (PercentM)))

ggplot(PercentCan, aes(x = Sex, y=Percent)) +
  geom_point()
```

#### Percent by Type of Cancer

```{r}
dfCan <- adult22 %>%
  filter(NUMCAN_A < 7 & NUMCAN_A !=0)

dfBlad <- dfCan %>%
  filter(BLADDCAN_A == 1)

dfBlood <- dfCan %>%
  filter(BLOODCAN_A == 1)

dfBrain <- dfCan %>%
  filter(BRAINCAN_A == 1)

dfBone <- dfCan %>%
  filter(BONECAN_A == 1)

dfBreast <- dfCan %>%
  filter(BREASCAN_A == 1)

dfCervical <- dfCan %>%
  filter(CERVICAN_A == 1)

dfEsoph <- dfCan %>%
  filter(BLADDCAN_A == 1)

dfGall <- dfCan %>%
  filter(ESOPHCAN_A == 1)

dfLarynx <- dfCan %>%
  filter(LARYNCAN_A == 1)

dfLeuk <- dfCan %>%
  filter(LEUKECAN_A == 1)

dfLiver <- dfCan %>%
  filter(LIVERCAN_A == 1)

dfLung <- dfCan %>%
  filter(LUNGCAN_A == 1)

dfLymph <- dfCan %>%
  filter(LYMPHCAN_A == 1)

dfMela <- dfCan %>%
  filter(MELANCAN_A == 1)

dfMouth <- dfCan %>%
  filter(MOUTHCAN_A == 1)

dfOvary <- dfCan %>%
  filter(OVARYCAN_A == 1)

dfPanc <- dfCan %>%
  filter(PANCRCAN_A== 1)

dfProst <- dfCan %>%
  filter(PROSTCAN_A == 1)

dfSkinMela <- dfCan %>%
  filter(SKNMCAN_A == 1)

dfSkinNoMela <- dfCan %>%
  filter(SKNNMCAN_A == 1)

dfSkinUnknown <- dfCan %>%
  filter(SKNDKCAN_A == 1)

dfStomach <- dfCan %>%
  filter(STOMACAN_A == 1)

dfThroat <- dfCan %>%
  filter(THROACAN_A == 1)

dfThyroid <- dfCan %>%
  filter(THYROCAN_A == 1)

dfUterus <- dfCan %>%
  filter(UTERUCAN_A == 1)

dfHandNeck <- dfCan %>%
  filter(HDNCKCAN_A == 1)

dfColRec <- dfCan %>%
  filter(COLRCCAN_A == 1)

dfOther <- dfCan %>%
  filter(OTHERCANP_A == 1)

```

```{r}
TBladF <- nrow(dfBlad %>%
  filter(SEX_A == 2))/TotalF*100
TBloodF <- nrow(dfBlood %>%
  filter(SEX_A == 2))/TotalF*100
TBoneF <- nrow(dfBone %>%
  filter(SEX_A == 2))/TotalF*100
TBrainF <- nrow(dfBrain %>%
  filter(SEX_A == 2))/TotalF*100
TBreastF <- nrow(dfBreast %>%
  filter(SEX_A == 2))/TotalF*100
TEsophF <- nrow(dfEsoph %>%
  filter(SEX_A == 2))/TotalF*100
TCervF <- nrow(dfCervical %>%
  filter(SEX_A == 2))/TotalF*100
TGallF <- nrow(dfGall %>%
  filter(SEX_A == 2))/TotalF*100
TLarynxF <- nrow(dfLarynx %>%
  filter(SEX_A == 2))/TotalF*100
TLeukF <- nrow(dfLeuk %>%
  filter(SEX_A == 2))/TotalF*100
TLungF <- nrow(dfLung %>%
  filter(SEX_A == 2))/TotalF*100
TLiverF <- nrow(dfLiver %>%
  filter(SEX_A == 2))/TotalF*100
TMelaF <- nrow(dfMela %>%
  filter(SEX_A == 2))/TotalF*100
TMouthF <- nrow(dfMouth %>%
  filter(SEX_A == 2))/TotalF*100
TOvaryF <- nrow(dfOvary %>%
  filter(SEX_A == 2))/TotalF*100
TPancF <- nrow(dfBlad %>%
  filter(SEX_A == 2))/TotalF*100
TSkinMelaF <- nrow(dfSkinMela %>%
  filter(SEX_A == 2))/TotalF*100
TSkinNoMelaF <- nrow(dfSkinNoMela %>%
  filter(SEX_A == 2))/TotalF*100
TSkinUnknownF <- nrow(dfSkinUnknown %>%
  filter(SEX_A == 2))/TotalF*100
TStomachF <- nrow(dfStomach %>%
  filter(SEX_A == 2))/TotalF*100
TThyroidF <- nrow(dfThyroid %>%
  filter(SEX_A == 2))/TotalF*100
TThroatF <- nrow(dfThroat %>%
  filter(SEX_A == 2))/TotalF*100
TUterusF <- nrow(dfUterus %>%
  filter(SEX_A == 2))/TotalF*100
THandNeckF <- nrow(dfHandNeck %>%
  filter(SEX_A == 2))/TotalF*100
TColRecF <- nrow(dfColRec %>%
  filter(SEX_A == 2))/TotalF*100
TOtherF <- nrow(dfOther %>%
  filter(SEX_A == 2))/TotalF*100
```

##### Cancer Among Females

```{r}
CancerF <- data.frame(                                            
   Cancer  = c("Bladder", "Blood","Brain","Bone","Breast","Esophagus","Cervical","GallBladder","Larynx","Leukemia","Lung","Liver","Melanoma","Mouth","Ovarian","Pancreas","Skin Melanoma","Skin-Not Melanoma","Skin-Unknown","Stomach","Thyroid","Throat","Uterus","Hand and Neck","ColonRectal","Other"),       
   Percent= c((TBladF), (TBloodF),(TBrainF),(TBoneF),(TBreastF),(TEsophF),(TCervF),(TGallF),(TLarynxF),(TLeukF),(TLungF),(TLiverF),(TMelaF),(TMouthF),(TOvaryF),(TPancF),(TSkinMelaF),(TSkinNoMelaF),(TSkinUnknownF),(TStomachF),(TThyroidF),(TThroatF),(TUterusF),(THandNeckF),(TColRecF),(TOtherF)))

ggplot(CancerF, aes(x = Percent, y=Cancer)) +
  geom_point() + labs(title = "Cancer Among Females")
```

##### Cancer Among Males

```{r}
TBladM <- nrow(dfBlad %>%
  filter(SEX_A == 1))/TotalM*100
TBloodM <- nrow(dfBlood %>%
  filter(SEX_A == 1))/TotalM*100
TBoneM <- nrow(dfBone %>%
  filter(SEX_A == 1))/TotalM*100
TBrainM <- nrow(dfBrain %>%
  filter(SEX_A == 1))/TotalM*100
TBreastM <- nrow(dfBreast %>%
  filter(SEX_A == 1))/TotalM*100
TEsophM <- nrow(dfEsoph %>%
  filter(SEX_A == 1))/TotalM*100
TGallM <- nrow(dfGall %>%
  filter(SEX_A == 1))/TotalM*100
TLarynxM <- nrow(dfLarynx %>%
  filter(SEX_A == 1))/TotalM*100
TLeukM <- nrow(dfLeuk %>%
  filter(SEX_A == 1))/TotalM*100
TLungM <- nrow(dfLung %>%
  filter(SEX_A == 1))/TotalM*100
TLiverM <- nrow(dfLiver %>%
  filter(SEX_A == 1))/TotalM*100
TMelaM <- nrow(dfMela %>%
  filter(SEX_A == 1))/TotalM*100
TMouthM <- nrow(dfMouth %>%
  filter(SEX_A == 1))/TotalM*100
TPancM <- nrow(dfPanc %>%
  filter(SEX_A == 1))/TotalM*100
TProstM <- nrow(dfProst %>%
  filter(SEX_A == 1))/TotalM*100
TSkinMelaM <- nrow(dfSkinMela %>%
  filter(SEX_A == 1))/TotalM*100
TSkinNoMelaM <- nrow(dfSkinNoMela %>%
  filter(SEX_A == 1))/TotalM*100
TSkinUnknownM <- nrow(dfSkinUnknown %>%
  filter(SEX_A == 1))/TotalM*100
TStomachM <- nrow(dfStomach %>%
  filter(SEX_A == 1))/TotalM*100
TThyroidM <- nrow(dfThyroid %>%
  filter(SEX_A == 1))/TotalM*100
TThroatM <- nrow(dfThroat %>%
  filter(SEX_A == 1))/TotalM*100
THandNeckM <- nrow(dfHandNeck %>%
  filter(SEX_A == 1))/TotalM*100
TColRecM <- nrow(dfColRec %>%
  filter(SEX_A == 1))/TotalM*100
TOtherM <- nrow(dfOther %>%
  filter(SEX_A == 1))/TotalM*100
```

```{r}
CancerM <- data.frame(                                            
   Cancer  = c("Bladder", "Blood","Brain","Bone","Breast","Esophagus","GallBladder","Larynx","Leukemia","Lung","Liver","Melanoma","Mouth","Pancreas","Skin Melanoma","Skin-Not Melanoma","Skin-Unknown","Stomach","Thyroid","Throat","Hand and Neck","ColonRectal","Other"),       
   Percent= c((TBladM), (TBloodM),(TBrainM),(TBoneM),(TBreastM),(TEsophM),(TGallM),(TLarynxM),(TLeukM),(TLungM),(TLiverM),(TMelaM),(TMouthM),(TPancM),(TSkinMelaM),(TSkinNoMelaM),(TSkinUnknownM),(TStomachM),(TThyroidM),(TThroatM),(THandNeckM),(TColRecM),(TOtherM)))

ggplot(CancerM, aes(x = Percent, y=Cancer)) +
  geom_point() + labs(title = "Cancer Among Males")
```

##### Cancer Among Both

```{r}
df_Canmerge <- merge(CancerF,CancerM,by="Cancer", all.x = TRUE) 

names(df_Canmerge)[names(df_Canmerge) == "Percent.x"] <- "Female"
names(df_Canmerge)[names(df_Canmerge) == "Percent.y"] <- "Male"

```

```{r}
dfCancerMerged <- melt(df_Canmerge, id.vars="Cancer")
names(dfCancerMerged)[names(dfCancerMerged) == "value"] <- "Percent"
names(dfCancerMerged)[names(dfCancerMerged) == "variable"] <- "Sex"

ggplot(dfCancerMerged, aes(Percent,Cancer, col=Sex)) + 
  geom_point() + labs(title = "Cancer Among Males and Females")

```

Both males and females had a higher percentage for Skin-Not Melanoma
cancer. The highest percentage for cancer was breast cancer in
females.In many of the cancer categories, females had a slightly lower
percentage.

# Summary

While many of the ways to stay healthy are commonly known and
advertised, it is interesting to look at the data to get a more specific
idea. While many of these did not have much of a difference between
categories, there is some information that can be gained that follow
many of the health ideas we already know. These include

1.  It is less healthier, mentally and physically, for you to smoke.
    Smoking at least 100 cigarettes can make this worst and if you end
    up with lung cancer, it could be contributed to you smoking.
2.  Cancer is important to watch out for, and there are some
    specifically that are more common based on your sex and overall more
    common.
3.  While it is mostly unavoidable, those who have difficulty seeing can
    have worst life satisfaction, so if you can get it fixed by surgery
    or glasses/contacts, it will be helpful.
4.  It is better to for your mental and physical health to stay around a
    normal BMI. However, this could be argued based on if your weight
    comes from non-muscle fat or muscle fat.
5.  A higher education could potentially lead you to feel more satisfied
    and have better health.
