R Markdown

Dataset - National Health Interview Adult Survey

The dataset is data from the 2021 National Health Interview Adult Survey. The survey contained questions related to household and family composition, demographics about the survey taker, satisfaction with life, health insurance, medication, immunization, preventive screenings, and multiple health problems such as hypertension, cardiovascular conditions, cancer, vision, hearing, mobility, and more.

This survey is important in following the health of American’s based on many different factors of their lives. Looking at previous surveys can also help to see trends in Americans’ health.

Questions

1. Does education level play a role in the mental or physical health?
2. What are some health issues that correlate to other health issues?
3: What health issues are more common among certain demographics?
4: Has COVID possibly had an effect on certain health issues?
5: Is there a link between physical health and mental health?

Columns

General Health

1: Excellent
2: Very Good
3: Good
4: Fair
5: Poor
7: Refused
8: Not Ascertained
9: Don't Know

2: Life Satisfaction

1: Very Satisfied
2: Satisfied
3: Dissatisfied
4: Very Dissatisfied
7: Refused
8: Not Ascertained
9: Don't Know

3: General Demographics

Classification of County Lived In
  1: Large central metro
  2: Large fringe metro
  3: Medium and small metro
  4: Nonmetropolitan
  
Household Region
  1: Northeast
  2: Midwest
  3: South
  4: West
  
Age
  18-84: 18-84 with number corresponding
  85: 85+
  97: Refused
  98: Not Ascertained
  99: Don't Know
  
Age 65+
  1: Less than 65
  2: 65 or older
  7: Refused
  8: Not Ascertained
  9: Don't Know
  
Sex
  1: Male
  2: Female
  7: Refused
  8: Not Ascertained
  9: Don't Know

Education Level

0: Never attended/Kindergarten only
1: Grade 1-11
2: 12th grade, no diploma
3: GED or equivalent
4: High School Graduate
5: Some college, no degree
6: Associate degree: occupational, technical, or vocational program
7: Associate degree: academic program
8: Bachelor's degree
9: Master's degree
10: Professional School or Doctoral degree
97: Refused
98: Not Ascertained
99: Don't Know

Weight

Person's weight in lbs

Height

Person's height in ???

Medical Problems

Questions were laid out as... 
  Told you have (condition)?
  Told you have (condition) on 2 or more visits?
  Had (condition) in past 12 months?
  
...with the possible responses being,
  1: Yes. 1 answered if respondant is taking medication to control the issue
  2: No
  7: Refused
  8: Not Ascertained
  9: Don't Know

Cancer

Types Included
  1. 
  2
  3
  4

Age when first told had (type) cancer?
  1-84: 1-84 years, with the corresponding number
  85: 85+ years
  97: Refused
  98: Not Ascertained
  99: Don't Know

Others

Days Missed Work
  0-129: 0 to 129 with corresponding value
  130: 130+ days
  997: Refused
  998: Not Ascertained
  999: Don't Know"

Most of the column names were unclear until I read the Codebook, however it was often easy to tell what category something fell under such as EDUCP_A, likely had something to do with education, while variable with CAN in them had to do with Cancer. I have an Excel sheet of the data where I have the columns color coded by if I know them from the codebook, if they are not in the codebook, or if I will not be using that column. Some of these unclear ones are the ones that start with DRK, PA18, MOD, VIG, and STR. I am still working on figuring those out.

Among the columns I do know, there are a few that I am unclear about. Among the cancer ones, they are asked what age were they told they have colon-rectal cancer. However, two other questions ask about colon cancer and rectal cancer, so I am trying to figure out if those are the same things, or separated.

dfColonRectal <- adult22[ , c("COLRCAGETC_A", "COLONAGETC_A", "RECTUAGETC_A")]  

dfColonRectalAge <-subset(dfColonRectal, COLRCAGETC_A<="85")
#count(dfColonRectalAge) = 196
#print(dfColonRectalAge)

dfColonRectalAgeTest <-subset(dfColonRectalAge, COLRCAGETC_A==COLONAGETC_A | COLRCAGETC_A==RECTUAGETC_A)
#count(dfColonRectalAgeTest) = 196

Both have 196, so that means they have the same age that they put for ColoRectal in either Colon or Rectal. So this won’t cause problems for the data, I just have to make sure I don’t include ColoRectal and Colon, or ColoRectal and Rectal as separate cancers. Such as if I am counting how many types of cancer one person has.

Weight

dfWeightFilter <- adult22 %>% 
  filter(WEIGHTLBTC_A <= 996)

paste("Mean:",mean(dfWeightFilter$WEIGHTLBTC_A))

## [1] "Mean: 230.906970736928"

paste("Max:",max(dfWeightFilter$WEIGHTLBTC_A))

## [1] "Max: 996"

paste("Min:",min(dfWeightFilter$WEIGHTLBTC_A))

## [1] "Min: 100"

Age

# Age

dfAgeFilter <- adult22 %>% 
  filter(AGEP_A < 97)

paste("Mean:",mean(dfAgeFilter$AGEP_A))

## [1] "Mean: 52.9485989777794"

paste("Max:",max(dfAgeFilter$AGEP_A))

## [1] "Max: 85"

paste("Min:",min(dfAgeFilter$AGEP_A))

## [1] "Min: 18"

paste("Over 85:",nrow(dfAgeFilter[dfAgeFilter$AGEP_A == '85', ]))

## [1] "Over 85: 1002"

paste("Under 85:",nrow(dfAgeFilter[dfAgeFilter$AGEP_A < '85', ]))

## [1] "Under 85: 26585"

Sex

dfSexFilter <- adult22 %>% 
  filter(SEX_A < 7)

dfSexFilter <-
  dfSexFilter |>
    group_by(dfSexFilter$SEX_A) |>
    mutate(Sex_Status = ifelse(SEX_A == 1,
                                 "Male", 
                                 "Female")) |>

  ungroup()
ggplot(dfSexFilter, aes(x = Sex_Status)) +
  geom_bar()

Education Level

dfEduFilter <- adult22 %>% 
  filter(EDUCP_A < 97)


dfEduFilter <-
  dfEduFilter |>
    group_by(dfEduFilter$EDUCP_A) |>
    mutate(Edu_Status = ifelse(EDUCP_A == 1,
                                "Grade 1-11", 
                               ifelse(EDUCP_A == 2,
                                 "12th Grade, no Diploma",
                                 ifelse(EDUCP_A == 3,
                                 "GED or Equivalent",
                                 ifelse(EDUCP_A == 4,
                                 "High School Graduate",
                                 ifelse(EDUCP_A == 5,
                                 "Some College, no Degree",
                                 ifelse(EDUCP_A == 6,
                                 "Associate degree: occupational, technical, or vocational program",
                                 ifelse(EDUCP_A == 7,
                                 "Associate degree: academic program",
                                 ifelse(EDUCP_A == 8,
                                 "Bachelor's degree",
                                 ifelse(EDUCP_A == 9,
                                 "Master's degree ",
                                 ifelse(EDUCP_A == 10,
                                 "Professional School or Doctoral degree",
                                 ifelse(EDUCP_A == 97,
                                 "Refused",
                                "Don't Know")))))))))))) |> 
ungroup()
  
  
dfEduFilter$Edu_Status <- factor(dfEduFilter$Edu_Status, levels = c("Grade 1-11", "12th Grade, no Diploma", "GED or Equivalent","High School Graduate", "Some College, no Degree", "Associate degree: occupational, technical, or vocational program", "Associate degree: academic program", "Bachelor's degree", "Master's degree ", "Professional School or Doctoral degree", "Refused", "Don't Know"))

ggplot(dfEduFilter, aes(x = EDUCP_A, fill=Edu_Status)) +
  geom_bar() + theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

#General Health
dfGHFilter <- adult22 %>% 
  filter(PHSTAT_A < '7')
dfGHFilter <- dfGHFilter %>% 
  filter(AGEP_A < '97')

mean(dfGHFilter$PHSTAT_A)

## [1] 2.439941

ggplot(dfGHFilter, aes(x = PHSTAT_A)) +
  geom_bar()

plot(dfGHFilter$AGEP_A , dfGHFilter$PHSTAT_A)
  abline(lm(dfGHFilter$PHSTAT_A ~ dfGHFilter$AGEP_A), col = "red", lwd = 3)

#Weight and Health

# dfWeightFilter <- adult22[adult22$WEIGHTLBTC_A < '997', ]


plot(dfWeightFilter$WEIGHTLBTC_A)

dfWHFilter <- dfWeightFilter %>% 
  filter(PHSTAT_A <= 6)

dfHighHealth <- dfWHFilter %>%
  filter(PHSTAT_A < 3 )

dfHighWeight <- dfWHFilter %>%
  filter(WEIGHTLBTC_A >= 250 )

Weight1 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A < '150', ])
Weight2 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A > '150' & dfWeightFilter$WEIGHTLBTC_A <= '200', ])
Weight3 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A > '200' & dfWeightFilter$WEIGHTLBTC_A <= '250', ])
Weight4 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A <= '250', ])

dfWeightCount <- data.frame(Weight1, Weight2, Weight3, Weight4)
print(dfWeightCount)

##   Weight1 Weight2 Weight3 Weight4
## 1    6451   11717    5008   24210

plot(dfWHFilter$WEIGHTLBTC_A, dfWHFilter$PHSTAT_A, xlab = "Weight", ylab = "General Health")

plot(dfHighWeight$WEIGHTLBTC_A, dfHighWeight$PHSTAT_A, xlab = "Weight", ylab = "General Health")

hist(dfWeightFilter$WEIGHTLBTC_A, )

#Weight and Height

plot(adult22$WEIGHTLBTC_A, adult22$HEIGHTTC_A)

# Group_By


dfEdu <- adult22 %>% group_by(adult22$EDUCP_A)

mean(dfEdu$EDUCP_A)

## [1] 6.443528

# which is an associate degree

# Probability of at least an associate degree (6, 7, 8, 9, 10)

prob_Associate_Up<- nrow(dfEdu[dfEdu$EDUCP_A >= '6' & dfEdu$EDUCP_A <= '10', ])

prob_All <- nrow(dfEdu)

prob_Associate_Up/prob_All

## [1] 0

# Probability of below grade 12

prob_Under_12 <- nrow(dfEdu[dfEdu$EDUCP_A <= '1', ])

prob_Under_12/prob_All

## [1] 0.06802647

# Probability of associate or higher and positive life satisfaction

prob_Associate_Satisfied <- nrow(dfEdu[dfEdu$EDUCP_A >= '6' & dfEdu$EDUCP_A <= '10' & dfEdu$LSATIS4_A <= '2', ])

prob_Associate_Satisfied/prob_Associate_Up

## [1] NaN

#Probability of below grade 12 and satisfied
prob_Under12_Satisfied <- nrow(dfEdu[dfEdu$EDUCP_A <= '1' & dfEdu$LSATIS4_A <= '2', ])

prob_Under12_Satisfied/prob_Under_12

## [1] 0.917597

plot

## function (x, y, ...) 
## UseMethod("plot")
## <bytecode: 0x7f857b6650b8>
## <environment: namespace:base>

#Probability of normal BMI(18.5 to 24.9) and general health

dfHealth <- adult22 %>% group_by(adult22$PHSTAT_A)

prob_NormBMI <- nrow(dfHealth[dfHealth$BMICAT_A == '2', ])

prob_NormBMI/prob_All

## [1] 0.307186

prob_NormBMI_GoodHealth <- nrow(dfHealth[dfHealth$BMICAT_A == '2' & dfHealth$PHSTAT_A <= '4', ])

prob_NormBMI_GoodHealth/prob_NormBMI

## [1] 0.970332

#Probability of overweight BMI and positive/negative health

prob_OverweightBMI <- nrow(dfHealth[dfHealth$BMICAT_A == '3', ])

prob_OverweightBMI/prob_All

## [1] 0.3357926

prob_OverweightBMI_GoodHealth <- nrow(dfHealth[dfHealth$BMICAT_A == '3' & dfHealth$PHSTAT_A <= '4', ])

prob_OverweightBMI_GoodHealth/prob_OverweightBMI

## [1] 0.9696284

prob_OverweightBMI_BadHealth <- nrow(dfHealth[dfHealth$BMICAT_A == '3' & dfHealth$PHSTAT_A == '5', ])

prob_OverweightBMI_BadHealth/prob_OverweightBMI

## [1] 0.02994076

prob_GoodHealth <- nrow(dfHealth[dfHealth$PHSTAT_A <= '4', ])

# How many of all BMIs considered themselves to be in good health

prob_GoodHealth/prob_All

## [1] 0.9626053

# About 96% of people considered themselves to be in good, or greater health. Even among different BMIs, the percent that considered themselves to be in good health was above 90%.

# Why do most people see themselves to be in good health, or were most of the survey takers healthy in general? -- Check the more specific medical issues

BMI New Column

# Sort BMI by Underweight, Normal, Overweight, Obese

adult22_raw <- adult22

adult22BMI <- adult22_raw

adult22BMI <-
  adult22BMI |>
    group_by(adult22BMI$BMICAT_A) |>
    mutate(BMI_Status = ifelse(BMICAT_A == 1,
                                 "Under", 
                               ifelse(BMICAT_A == 3,
                                 "Over",
                                 ifelse(BMICAT_A == 4,
                                 "Obese",
                                 ifelse(BMICAT_A,
                                 "Normal",
                                 "Unknown"))))) |>
    ungroup()

dfAllBMI <- adult22BMI %>%
  filter(BMICAT_A < 5)

nrow(dfAllBMI[dfAllBMI$BMICAT_A == '1',])

## [1] 432

nrow(dfAllBMI[dfAllBMI$BMICAT_A == '2',])

## [1] 8494

nrow(dfAllBMI[dfAllBMI$BMICAT_A == '3',])

## [1] 9285

nrow(dfAllBMI[dfAllBMI$BMICAT_A == '4',])

## [1] 8814

hist(dfAllBMI$BMICAT_A)

# Life Satisfaction and General Health

prob_GoodLS_Health <- nrow(dfHealth[dfHealth$LSATIS4_A <= '2' & dfHealth$PHSTAT_A <= '4', ])
prob_GoodLS_Health/prob_All

## [1] 0.9275976

#Prob out of those who have high general health
prob_GoodLS_Health/prob_GoodHealth

## [1] 0.9636323

#Bad life satisfaction and bad health out of all
prob_BadLS_Health <- nrow(dfHealth[dfHealth$LSATIS4_A >= '3' & dfHealth$LSATIS4_A <=4 & dfHealth$PHSTAT_A == '5', ])

prob_BadLS_Health/prob_All

## [1] 0.01182597

#Bad life satisfaction among those with low health
prob_Low_LS <- nrow(dfHealth[dfHealth$PHSTAT_A == '5',])

prob_BadLS_Health/prob_Low_LS

## [1] 0.3180934

plot(adult22$EDUCP_A , adult22$LSATIS4_A)
  abline(lm(adult22$LSATIS4_A ~ adult22$EDUCP_A), col = "red", lwd = 3)

Because the survey was mostly multiple choice, there are not any major anomalies. The only thing that falls out of the typical range of responses are the “don’t know, refuse, or not ascertained” but even those have specific values that are consistent across questions.

There were a few strange ones among these, such as a few people putting “don’t know/not ascertained” for their age, which is something they should know. Probably a wrong click or just not paying attention?

Education Dataframe Samples

dfEduSample <- dfEdu[ , c("EDUCP_A")]  
dfEdu1 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu2 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu3 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu4 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu5 <- sample_n(dfEduSample,100, replace = TRUE)

print(dfEdu1)

## # A tibble: 100 × 1
##    EDUCP_A
##      <int>
##  1       7
##  2       5
##  3       5
##  4       9
##  5       4
##  6       8
##  7       8
##  8       7
##  9       8
## 10       9
## # ℹ 90 more rows

paste("Sample 1 Mean:", mean(dfEdu1$EDUCP_A))

## [1] "Sample 1 Mean: 5.64"

print(dfEdu2)

## # A tibble: 100 × 1
##    EDUCP_A
##      <int>
##  1       5
##  2       1
##  3       5
##  4       9
##  5       8
##  6       4
##  7       4
##  8       8
##  9       4
## 10       8
## # ℹ 90 more rows

paste("Sample 2 Mean:", mean(dfEdu2$EDUCP_A))

## [1] "Sample 2 Mean: 6.13"

print(dfEdu3)

## # A tibble: 100 × 1
##    EDUCP_A
##      <int>
##  1       8
##  2       4
##  3       8
##  4       9
##  5       4
##  6       6
##  7       8
##  8       4
##  9       5
## 10       4
## # ℹ 90 more rows

paste("Sample 3 Mean:", mean(dfEdu3$EDUCP_A))

## [1] "Sample 3 Mean: 5.92"

print(dfEdu4)

## # A tibble: 100 × 1
##    EDUCP_A
##      <int>
##  1       7
##  2       4
##  3       4
##  4       8
##  5       7
##  6       4
##  7       6
##  8       8
##  9       9
## 10       6
## # ℹ 90 more rows

paste("Sample 4 Mean:", mean(dfEdu4$EDUCP_A))

## [1] "Sample 4 Mean: 6.35"

print(dfEdu5)

## # A tibble: 100 × 1
##    EDUCP_A
##      <int>
##  1       1
##  2       8
##  3       8
##  4       4
##  5       4
##  6       2
##  7       7
##  8       1
##  9       5
## 10       9
## # ℹ 90 more rows

paste("Sample 5 Mean:", mean(dfEdu5$EDUCP_A))

## [1] "Sample 5 Mean: 5.35"

# The average tends to be between 5 (some college) and 8 (Bachelor's degree), among all the samples. However if any sample ends up with the 97,98, or 99 that correspond with "don't know", then the sample will be greatly skewed.

dfWeightHeightSample <- dfHealth[ , c("WEIGHTLBTC_A", "HEIGHTTC_A")]  
dfWH1 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH2 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH3 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH4 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH5 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
print(dfWH1)

## # A tibble: 100 × 2
##    WEIGHTLBTC_A HEIGHTTC_A
##           <int>      <int>
##  1          105         66
##  2          175         67
##  3          170         66
##  4          185         73
##  5          125         65
##  6          265         73
##  7          999         68
##  8          165         62
##  9          160         60
## 10          210         72
## # ℹ 90 more rows

print(dfWH2)

## # A tibble: 100 × 2
##    WEIGHTLBTC_A HEIGHTTC_A
##           <int>      <int>
##  1          116         63
##  2          212         66
##  3          140         64
##  4          120         62
##  5          175         67
##  6          135         65
##  7          999         67
##  8          200         69
##  9          140         64
## 10          252         61
## # ℹ 90 more rows

print(dfWH3)

## # A tibble: 100 × 2
##    WEIGHTLBTC_A HEIGHTTC_A
##           <int>      <int>
##  1          225         62
##  2          164         67
##  3          180         64
##  4          175         71
##  5          149         67
##  6          996         96
##  7          182         71
##  8          180         64
##  9          130         70
## 10          135         68
## # ℹ 90 more rows

print(dfWH4)

## # A tibble: 100 × 2
##    WEIGHTLBTC_A HEIGHTTC_A
##           <int>      <int>
##  1          200         60
##  2          173         71
##  3          174         65
##  4          997         97
##  5          212         72
##  6          185         69
##  7          185         67
##  8          142         64
##  9          230         63
## 10          190         69
## # ℹ 90 more rows

print(dfWH5)

## # A tibble: 100 × 2
##    WEIGHTLBTC_A HEIGHTTC_A
##           <int>      <int>
##  1          997         97
##  2          190         64
##  3          190         66
##  4          165         70
##  5          260         76
##  6          137         60
##  7          145         60
##  8          165         69
##  9          200         66
## 10          137         64
## # ℹ 90 more rows

plot(dfWH1$WEIGHTLBTC_A,dfWH1$HEIGHTTC_A,type="p",main="Normal Distribution",xlab="Weight(lbs)",ylab="Height")
 points(dfWH2$WEIGHTLBTC_A,dfWH2$HEIGHTTC_A, col="green")
 points(dfWH3$WEIGHTLBTC_A,dfWH3$HEIGHTTC_A,col="blue")
 points(dfWH4$WEIGHTLBTC_A,dfWH4$HEIGHTTC_A,col="red")
 points(dfWH5$WEIGHTLBTC_A,dfWH5$HEIGHTTC_A,col="yellow")
abline(lm(dfWeightHeightSample$HEIGHTTC_A ~ dfWeightHeightSample$WEIGHTLBTC_A), col = "red", lwd = 3)

dfGenHealthSample <- dfHealth[ , c("PHSTAT_A")]  
dfGH1 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH2 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH3 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH4 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH5 <- sample_n(dfGenHealthSample,100, replace = TRUE)

# Average
print(mean(dfGH1$PHSTAT_A))

## [1] 2.41

print(mean(dfGH2$PHSTAT_A))

## [1] 2.55

print(mean(dfGH3$PHSTAT_A))

## [1] 2.37

print(mean(dfGH4$PHSTAT_A))

## [1] 2.38

print(mean(dfGH5$PHSTAT_A))

## [1] 2.57

# The average tends to be between 2 and 3, which makes sense because the general health among all survey takers is often a 2 (Very good) or 3 (Good).

Looking at data among cancer types

Types:

BLADDCAN_A BLOODCAN_A BONECAN_A BRAINCAN_A BREASCAN_A CERVICAN_A ESOPHCAN_A GALLBCAN_A LARYNCAN_A LEUKECAN_A LIVERCAN_A LUNGCAN_A LYMPHCAN_A MELANCAN_A MOUTHCAN_A OVARYCAN_A PANCRCAN_A PROSTCAN_A SKNMCAN_A SKNNMCAN_A SKNDKCAN_A STOMACAN_A THROACAN_A THYROCAN_A UTERUCAN_A HDNCKCAN_A COLRCCAN_A OTHERCANP_A

Number of reported cancers: NUMCAN_A

Age Told has Cancer

BLADDAGETC_A BLOODAGETC_A BONEAGETC_A BRAINAGETC_A BREASAGETC_A CERVIAGETC_A COLONAGETC_A ESOPHAGETC_A GALLBAGETC_A LARYNAGETC_A LEUKEAGETC_A LIVERAGETC_A LUNGAGETC_A LYMPHAGETC_A MELANAGETC_A MOUTHAGETC_A OVARYAGETC_A PANCRAGETC_A PROSTAGETC_A SKNMAGETC_A SKNNMAGETC_A SKNDKAGETC_A STOMAAGETC_A THROAAGETC_A THYROAGETC_A UTERUAGETC_A HDNCKAGETC_A COLRCAGETC_A OTHERAGETC_A

# Cancers df

dfCancer <- adult22 %>% 
  filter(NUMCAN_A > 0 & NUMCAN_A < 7)

ggplot(dfCancer, aes(x = NUMCAN_A)) +
  geom_bar()

ggplot(dfCancer, aes(NUMCAN_A, LSATIS4_A, colour=NUMCAN_A)) + 
    geom_line() + 
    geom_point()

ggplot(dfCancer, aes(NUMCAN_A, AGEP_A, colour=NUMCAN_A)) + 
    geom_line() + 
    geom_point()

plot(dfCancer$AGEP_A, dfCancer$NUMCAN_A)
  abline(lm(dfCancer$NUMCAN_A ~ dfCancer$AGEP_A), col = "red", lwd = 3)

# Age CI of those with cancer
resultCAN <- t.test(dfCancer$AGEP_A)
confidence_intervalCAN <- resultCAN$conf.int
confidence_intervalCAN

## [1] 68.11267 68.97246
## attr(,"conf.level")
## [1] 0.95

mean(dfCancer$AGEP_A)

## [1] 68.54257

# Age CI of those without cancer
dfNoCancer <- adult22 %>% 
  filter(NUMCAN_A == 0)

# Age CI of those with no cancer
resultNOCAN <- t.test(dfNoCancer$AGEP_A)
confidence_intervalNONE <- resultNOCAN$conf.int
confidence_intervalNONE

## [1] 50.60551 51.06352
## attr(,"conf.level")
## [1] 0.95

mean(dfNoCancer$AGEP_A)

## [1] 50.83452

# Age CI of all
result <- t.test(adult22$AGEP_A)
confidence_interval <- result$conf.int
confidence_interval

## [1] 52.83239 53.26945
## attr(,"conf.level")
## [1] 0.95

mean(adult22$AGEP_A)

## [1] 53.05092

Hypothesis 1

Those with normal BMI have higher average life satisfaction and physical health than those who are underweight or overweight. The lower the value, the higher the life satisfaction is

Average Normal BMI Life Satisfaction < Averagee Underweigth BMI Life Satisfaction, Average Overweight Life Satisfaction, Average Obese Life Satisfaction

Same goes for General Physical Health and BMI

dfFilteredLS <- adult22 %>%
  filter(BMICAT_A < 5 & LSATIS4_A <7)

cohen.d(dfFilteredLS$BMICAT_A, dfFilteredLS$LSATIS4_A)

## 
## Cohen's d
## 
## d estimate: 1.877823 (large)
## 95 percent confidence interval:
##    lower    upper 
## 1.857558 1.898087

# Effect size is 1.500028

dfFilteredPH <- adult22 %>%
  filter(BMICAT_A < 5 & PHSTAT_A <7)

cohen.d(dfFilteredPH$BMICAT_A, dfFilteredPH$PHSTAT_A)

## 
## Cohen's d
## 
## d estimate: 0.5713476 (medium)
## 95 percent confidence interval:
##     lower     upper 
## 0.5541439 0.5885514

#Effect size is 0.5916467

#got error of out of workspace until I added the simulate.p.value. In then was taking a very long time to run the cell.

#fisher.test(select(adult22, BMICAT_A, LSATIS4_A), simulate.p.value = TRUE)
#fisher.test(select(adult22, BMICAT_A, PHSTAT_A), simulate.p.value = TRUE)

dfFilteredBMI <- adult22 %>%
  filter(BMICAT_A < 5)

sd(dfFilteredBMI$BMICAT_A)

## [1] 0.8390505

sd(dfFilteredLS$LSATIS4_A)

## [1] 0.6045232

sd(dfFilteredPH$PHSTAT_A)

## [1] 1.054588

chisq.test(dfFilteredPH$BMICAT_A, dfFilteredPH$PHSTAT_A)

## 
##  Pearson's Chi-squared test
## 
## data:  dfFilteredPH$BMICAT_A and dfFilteredPH$PHSTAT_A
## X-squared = 1708.1, df = 12, p-value < 2.2e-16

chisq.test(dfFilteredLS$BMICAT_A, dfFilteredLS$LSATIS4_A)

## Warning in chisq.test(dfFilteredLS$BMICAT_A, dfFilteredLS$LSATIS4_A):
## Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  dfFilteredLS$BMICAT_A and dfFilteredLS$LSATIS4_A
## X-squared = 173.42, df = 9, p-value < 2.2e-16

chisq.test(dfFilteredPH$BMICAT_A, dfFilteredPH$PHSTAT_A, simulate.p.value = TRUE)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  dfFilteredPH$BMICAT_A and dfFilteredPH$PHSTAT_A
## X-squared = 1708.1, df = NA, p-value = 0.0004998

chisq.test(dfFilteredLS$BMICAT_A, dfFilteredLS$LSATIS4_A, simulate.p.value = TRUE)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  dfFilteredLS$BMICAT_A and dfFilteredLS$LSATIS4_A
## X-squared = 173.42, df = NA, p-value = 0.0004998

dfUnderBMI <- adult22 %>%
  filter(BMICAT_A == 1 )

dfUnderBMI <- dfUnderBMI %>%
  filter(LSATIS4_A < 7 )

dfNormalBMI <- adult22 %>%
  filter(BMICAT_A == 2 )

dfNormalBMI <- dfNormalBMI %>%
  filter(LSATIS4_A < 7 )

dfOverBMI <- adult22 %>%
  filter(BMICAT_A == 3 )

dfOverBMI <- dfOverBMI %>%
  filter(LSATIS4_A < 7 )

dfObeseBMI <- adult22 %>%
  filter(BMICAT_A == 4 )

dfObeseBMI <- dfObeseBMI %>%
  filter(LSATIS4_A < 7 )

mean(dfUnderBMI$LSATIS4_A)

## [1] 1.6875

mean(dfNormalBMI$LSATIS4_A)

## [1] 1.570281

mean(dfOverBMI$LSATIS4_A)

## [1] 1.576346

mean(dfObeseBMI$LSATIS4_A)

## [1] 1.670496

Status  = c("Underweight", "Normal BMI", "Overweight", "Obese")

LifeSatisfaction = c(mean(dfUnderBMI$LSATIS4_A), mean(dfNormalBMI$LSATIS4_A), mean(dfOverBMI$LSATIS4_A), mean(dfObeseBMI$LSATIS4_A))

dfPlot <- data.frame(Status, LifeSatisfaction)

ggplot(dfPlot, aes(x=Status, LifeSatisfaction)) +  geom_point(fill='black')

hist(dfUnderBMI$LSATIS4_A)

hist(dfNormalBMI$LSATIS4_A)

hist(dfOverBMI$LSATIS4_A)

hist(dfObeseBMI$LSATIS4_A)

dfUnderBMI <- adult22 %>%
  filter(BMICAT_A == 1 )

dfUnderBMI <- dfUnderBMI %>%
  filter(PHSTAT_A < 7 )

dfNormalBMI <- adult22 %>%
  filter(BMICAT_A == 2 )

dfNormalBMI <- dfNormalBMI %>%
  filter(PHSTAT_A < 7 )

dfOverBMI <- adult22 %>%
  filter(BMICAT_A == 3 )

dfOverBMI <- dfOverBMI %>%
  filter(PHSTAT_A < 7 )

dfObeseBMI <- adult22 %>%
  filter(BMICAT_A == 4 )

dfObeseBMI <- dfObeseBMI %>%
  filter(PHSTAT_A < 7 )

mean(dfUnderBMI$PHSTAT_A)

## [1] 2.516204

mean(dfNormalBMI$PHSTAT_A)

## [1] 2.176284

mean(dfOverBMI$PHSTAT_A)

## [1] 2.360845

mean(dfObeseBMI$PHSTAT_A)

## [1] 2.759814

Status  = c("Underweight", "Normal BMI", "Overweight", "Obese")

PhysicalHealth = c(mean(dfUnderBMI$PHSTAT_A), mean(dfNormalBMI$PHSTAT_A), mean(dfOverBMI$PHSTAT_A), mean(dfObeseBMI$PHSTAT_A))

dfPlot <- data.frame(Status, PhysicalHealth)

ggplot(dfPlot, aes(x=Status, PhysicalHealth)) +  geom_point(fill='black')

#Check with ANOVA \[ H_0 : \text{average Life Satisfaction and Physical Health price are equal across all BMIs} \]

hist(dfFilteredBMI$BMICAT_A)

hist(dfFilteredLS$LSATIS4_A)

hist(dfFilteredPH$PHSTAT_A)

#PHSTAT_A and LSATIS4_A are response variables
#BMICAT_A is eplanatory variable

hist(dfUnderBMI$PHSTAT_A)

hist(dfNormalBMI$PHSTAT_A)

hist(dfOverBMI$PHSTAT_A)

hist(dfObeseBMI$PHSTAT_A)

hist(dfUnderBMI$LSATIS4_A)

hist(dfNormalBMI$LSATIS4_A)

hist(dfOverBMI$LSATIS4_A)

hist(dfObeseBMI$LSATIS4_A)

m <- aov(PHSTAT_A ~ BMICAT_A, data = dfFilteredPH)
summary(m)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## BMICAT_A        1   1309  1309.0    1231 <2e-16 ***
## Residuals   27017  28739     1.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m2 <- aov(LSATIS4_A ~ BMICAT_A, data = dfFilteredLS)
summary(m2)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## BMICAT_A        1     34   33.69    92.5 <2e-16 ***
## Residuals   26955   9817    0.36                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#P is less than significance, so we reject null hypothesis.

pairwise.t.test(dfFilteredPH$PHSTAT_A, dfFilteredPH$BMICAT_A, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  dfFilteredPH$PHSTAT_A and dfFilteredPH$BMICAT_A 
## 
##   1       2       3      
## 2 1.2e-10 -       -      
## 3 0.013   < 2e-16 -      
## 4 8.9e-06 < 2e-16 < 2e-16
## 
## P value adjustment method: bonferroni

pairwise.t.test(dfFilteredLS$LSATIS4_A, dfFilteredLS$BMICAT_A, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  dfFilteredLS$LSATIS4_A and dfFilteredLS$BMICAT_A 
## 
##   1       2       3      
## 2 0.00048 -       -      
## 3 0.00108 1.00000 -      
## 4 1.00000 < 2e-16 < 2e-16
## 
## P value adjustment method: bonferroni

boot_ciLS <- function (v, func = median, conf = 0.95, n_iter = 100) {
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}

df_ciLS <- dfFilteredLS |>
  group_by(BMICAT_A) |>
  summarise(ci_lower = boot_ciLS(LSATIS4_A, mean)['lower'],
            mean_LS = mean(LSATIS4_A),
            ci_upper = boot_ciLS(LSATIS4_A, mean)['upper'])

df_ciLS

## # A tibble: 4 × 4
##   BMICAT_A ci_lower mean_LS ci_upper
##      <int>    <dbl>   <dbl>    <dbl>
## 1        1     1.62    1.69     1.76
## 2        2     1.56    1.57     1.58
## 3        3     1.57    1.58     1.59
## 4        4     1.65    1.67     1.68

df_ciLS |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = BMICAT_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_LS, y = BMICAT_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "Life Satisfaction by BMI Category",
       x = "Life Satisfaction",
       y = "BMI Category",
       color = '')

# 1 is underweight, which had way less people in it, so it could mess with the data a bit.

boot_ciPH <- function (v, func = median, conf = 0.95, n_iter = 100) {
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}

df_ciPH <- dfFilteredPH |>
  group_by(BMICAT_A) |>
  summarise(ci_lower = boot_ciPH(PHSTAT_A, mean)['lower'],
            mean_PH = mean(PHSTAT_A),
            ci_upper = boot_ciPH(PHSTAT_A, mean)['upper'])

df_ciPH

## # A tibble: 4 × 4
##   BMICAT_A ci_lower mean_PH ci_upper
##      <int>    <dbl>   <dbl>    <dbl>
## 1        1     2.41    2.52     2.65
## 2        2     2.15    2.18     2.20
## 3        3     2.34    2.36     2.38
## 4        4     2.74    2.76     2.79

df_ciPH |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = BMICAT_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_PH, y = BMICAT_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "General Health by BMI Category",
       x = "General Health",
       y = "BMI Category",
       color = '')

# Underweight category has the same problem as above.

# Both of these show that the average is not the same among BMI Categories.

# Age could also be a factor. 

dfFilteredPHAge <- dfFilteredPH %>%
  filter(AGEP_A < 86)

dfFilteredLSAge <- dfFilteredLS %>%
  filter(AGEP_A < 86)

modelLS <- lm(AGEP_A ~ LSATIS4_A, dfFilteredLSAge)
modelLS$coefficients

## (Intercept)   LSATIS4_A 
## 53.00383657 -0.04623706

modelPH <- lm(AGEP_A ~ PHSTAT_A, dfFilteredPHAge)
modelPH$coefficients

## (Intercept)    PHSTAT_A 
##    42.67638     4.20775

dfFilteredLSAge |>
  ggplot(mapping = aes(x = LSATIS4_A, y = AGEP_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

dfFilteredPHAge |>
  ggplot(mapping = aes(x = PHSTAT_A, y = AGEP_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# With Life Satisfaction, there does not seem to be much of a regression compared to General Health, based on age.

# Checking Age vs BMI Category
dfFilteredBMIAge <- dfFilteredBMI %>%
  filter(AGEP_A < 86)

boot_ciBMIAge <- function (v, func = median, conf = 0.95, n_iter = 100) {
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}

df_ciBMIAge <- dfFilteredBMI |>
  group_by(BMICAT_A) |>
  summarise(ci_lower = boot_ciBMIAge(AGEP_A, mean)['lower'],
            mean_Age = mean(AGEP_A),
            ci_upper = boot_ciBMIAge(AGEP_A, mean)['upper'])

df_ciBMIAge

## # A tibble: 4 × 4
##   BMICAT_A ci_lower mean_Age ci_upper
##      <int>    <dbl>    <dbl>    <dbl>
## 1        1     48.4     50.4     52.2
## 2        2     51.4     51.9     52.3
## 3        3     54.0     54.4     54.8
## 4        4     52.4     52.7     53.0

df_ciBMIAge |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = BMICAT_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_Age, y = BMICAT_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "BMI Category by Age",
       x = "Age",
       y = "BMI Category",
       color = '')

# Same problem once again with Underweight BMI.

# Age with General Health graph

dfFilteredPHAge <- dfFilteredPH %>%
  filter(AGEP_A < 86)

boot_ciPHAge <- function (v, func = median, conf = 0.95, n_iter = 100) {
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}

df_ciPHAge <- dfFilteredPHAge |>
  group_by(PHSTAT_A) |>
  summarise(ci_lower = boot_ciPHAge(AGEP_A, mean)['lower'],
            mean_Age = mean(AGEP_A),
            ci_upper = boot_ciPHAge(AGEP_A, mean)['upper'])

df_ciPHAge

## # A tibble: 5 × 4
##   PHSTAT_A ci_lower mean_Age ci_upper
##      <int>    <dbl>    <dbl>    <dbl>
## 1        1     46.7     47.1     47.5
## 2        2     50.4     50.9     51.3
## 3        3     54.8     55.2     55.6
## 4        4     59.1     59.7     60.4
## 5        5     63.0     63.9     64.6

df_ciPHAge |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = PHSTAT_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_Age, y = PHSTAT_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "General Health by Age",
       x = "Age",
       y = "General Health",
       color = '')

# General Health decreases age Age increases. (A higher General Health meaning worst)

dfFilteredLSAge <- dfFilteredLS %>%
  filter(AGEP_A < 86)

boot_ciLSAge <- function (v, func = median, conf = 0.95, n_iter = 100) {
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}

df_ciLSAge <- dfFilteredLSAge |>
  group_by(LSATIS4_A) |>
  summarise(ci_lower = boot_ciLSAge(AGEP_A, mean)['lower'],
            mean_Age = mean(AGEP_A),
            ci_upper = boot_ciLSAge(AGEP_A, mean)['upper'])

df_ciLSAge

## # A tibble: 4 × 4
##   LSATIS4_A ci_lower mean_Age ci_upper
##       <int>    <dbl>    <dbl>    <dbl>
## 1         1     53.1     53.4     53.6
## 2         2     52.0     52.3     52.7
## 3         3     54.1     55.3     56.6
## 4         4     55.2     57.5     59.9

df_ciLSAge |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = LSATIS4_A, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_Age, y = LSATIS4_A,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  theme_minimal() +
  labs(title = "Average Age by Life Satisfaction",
       x = "Age",
       y = "Life Satisfaction",
       color = '')

## It seems that the average Life Satisfaction and General Health are not the same among BMI categories. Additionally, age plays a part in the average General Health, but not in Life Satisfaction and BMI category.

Cancer vs No Cancer

IF LSATIS4_A = 1 or 2, positive life satisfaction. If LSATIS4_A = 3 or 4, life satisfaction is negative.

dfFilteredLS <-
  dfFilteredLS |>
    group_by(dfFilteredLS$LSATIS4_A) |>
    mutate(LifeSatis_Status = ifelse(LSATIS4_A == 1,
                                0, 
                                ifelse(LSATIS4_A == 2,
                                0, 
                                 1))) |> 
ungroup()

plot(dfFilteredLS$LifeSatis_Status)

Life Satisfaction and Age

dfFilteredLSAge <- dfFilteredLS %>%
  filter(AGEP_A < 86)


model <- glm(LifeSatis_Status ~ AGEP_A, data = dfFilteredLSAge,
             family = binomial(link = 'logit'))

model$coefficients

##  (Intercept)       AGEP_A 
## -3.548698218  0.008911942

sigmoid <- \(x) 1 / (1 + exp(-(-3.548698 + 0.0089 * x)))

dfFilteredLSAge |>
  ggplot(mapping = aes(x = AGEP_A, y = LifeSatis_Status)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Life Satisfaction Binary Response with Age") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

0 = -3.459 + .0089(Age) 3.459 = .0089(Age) Age = 3.459/.0089 = 388.65

For every 1 year increase in age, the odds that the person’s Life Satisfaction is negative is multiplied by \(e^{-0.0089}\). At age 388.65, there is a 50/50 chance of having positive life satisfaction or negative life satisfaction.

Cancer vs Life Satisfaction

dfFilteredLSCancer <- dfFilteredLS %>%
  filter(NUMCAN_A < 7)

model <- glm(LifeSatis_Status ~ NUMCAN_A, data = dfFilteredLSCancer,
             family = binomial(link = 'logit'))

model$coefficients

## (Intercept)    NUMCAN_A 
##  -3.1159401   0.2863125

sigmoid <- \(x) 1 / (1 + exp(-(-3.1159 + 0.286 * x)))

dfFilteredLSCancer |>
  ggplot(mapping = aes(x = NUMCAN_A, y = LifeSatis_Status)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Life Satisfaction Binary Response with Number of Cancer") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

0 = -3.1159 + .286(Number of Cancers) 3.1159 = .286(Number of Cancers) Number of Cancers = 3.1159/.286 = 10.89

For every 1 increase in the number of cancers, the odds that the person’s Life Satisfaction is negative is multiplied by \(e^{-0.286}\). When the number of cancers is 10.89, there is a 50/50 chance of having positive life satisfaction or negative life satisfaction.

dfFilteredLSEdu <- dfFilteredLS %>%
  filter(EDUCP_A < 97)

model <- glm(LifeSatis_Status ~ EDUCP_A, data = dfFilteredLSEdu,
             family = binomial(link = 'logit'))

model$coefficients

## (Intercept)     EDUCP_A 
##  -2.3230721  -0.1325631

sigmoid <- \(x) 1 / (1 + exp(-(-2.323 - 0.1326 * x)))

dfFilteredLSEdu |>
  ggplot(mapping = aes(x = EDUCP_A, y = LifeSatis_Status)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Life Satisfaction Binary Response with Education") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

0 = -2.323 - .1326(Education) 2.323 = -.1326(Education) Education = -2.323/.1326 = -17.5

For every 1 increase in the person’s education, the odds that the person’s Life Satisfaction is negative is multiplied by \(e^{0.286}\). When a person’s education level if -17.5, which is greatly below the value of 0, which is “no education/Kindergarten only”, there is a 50/50 chance of a negative life satisfaction or a positive life satisfaction

Life Satisfaction and Days of Work Missed in Last 12 Months

dfFilteredLSWork <- dfFilteredLS %>%
  filter(EMPDYSMSS3_A < 997)

model <- glm(LifeSatis_Status ~ EMPDYSMSS3_A, data = dfFilteredLSWork,
             family = binomial(link = 'logit'))

model$coefficients

##  (Intercept) EMPDYSMSS3_A 
##  -3.57897281   0.01228414

sigmoid <- \(x) 1 / (1 + exp(-(-3.57897 + .0122 * x)))

dfFilteredLSWork |>
  ggplot(mapping = aes(x = EMPDYSMSS3_A, y = LifeSatis_Status)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Life Satisfaction Binary Response with Hours Worked Per Week") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

These all follow close to a log function, or without much of a slope, so there is not a need to transform the explanatory variables. However, some of these variables may fit the Poisson Regression better, considering many of them are count values, and will not be below 0.

#Marriage and Life Satisfaction

dfFilteredLSMar <- dfFilteredLS %>%
  filter(MARITAL_A < 4)

hist(dfFilteredLSMar$MARITAL_A)

dfFilteredLSMar |>
  ggplot(mapping = aes(x = MARITAL_A, y = LSATIS4_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

modelLSM <- lm(LSATIS4_A ~ MARITAL_A, dfFilteredLSMar)
modelLSM$coefficients

## (Intercept)   MARITAL_A 
##   1.3511036   0.1264522

#Smoking and Life Satisfaction: Smoking Frequency

dfFilteredLSSmoke <- dfFilteredLS %>%
  filter(SMKNOW_A < 4 & SMKCIGST_A < 5)

dfFilteredLSSmoke |>
  ggplot(mapping = aes(x = SMKNOW_A, y = LSATIS4_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

modelSmoke <- lm(LSATIS4_A ~ SMKNOW_A, dfFilteredLSSmoke)
modelSmoke$coefficients

## (Intercept)    SMKNOW_A 
##   1.9189437  -0.0961429

modelSmoke3 <- glm(LSATIS4_A ~ SMKNOW_A + MARITAL_A,
             data = dfFilteredLSSmoke, 
             family = poisson(link = 'log'))
vif(modelSmoke3)

##  SMKNOW_A MARITAL_A 
##  1.010728  1.010728

Smoking and Life Satisfaction: Cigarettes Smoked

#How many cigarettes smoked a day, how many days smoked in last 30 days, and LS
dfFilteredLSSmoke <- dfFilteredLSSmoke %>%
  filter(CIGNOW_A < 96 & SMK30D_A < 31)

dfFilteredLSSmoke |>
  ggplot(mapping = aes(x = CIGNOW_A, y = LSATIS4_A)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal()

# Binary LS


dfFilteredLSSmoke |>
  ggplot(mapping = aes(x = CIGNOW_A, y = LifeSatis_Status)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') + 
  theme_minimal()

# FOR ALL BELOW
# Get an error which says Warning: Unknown or uninitialised column: `family`. Error in glm(LSATIS4_A ~ CIGNOW_A, dfFilteredLSSmoke) : 'family' not recognized. Looking it up, it seems to be a tibbles bug and I am having trouble fixing it. Did not happen earlier and I had not changed the code when it stopped working.

# modelSmoke <- glm(LSATIS4_A ~ CIGNOW_A, dfFilteredLSSmoke)
# modelSmoke$coefficients

#Add in SMK30D_A
# modelSmoke <- glm(LSATIS4_A ~ CIGNOW_A + SMK30D_A, dfFilteredLSSmoke)
# modelSmoke$coefficients


#model <- glm(LifeSatis_Status ~ CIGNOW_A + SMK30D_A, data = dfFilteredLSSmoke, family = binomial(link = 'logit'))

#model$coefficients

#then get sigmoid based on coefficents and plot it.
# If coefficent for explanatory variable is positive, as seen in the graphs, then the more cigarettes smoked, the lower the person's life satisfaction.

#Week 12: Time-Based Data

CVDVAC1M_A: month of last covid vaccine for those who got it the first time CVDVAC1Y_A: year of last covid vaccine CVDVAC2M_A: month of last covid vaccine for those with 2+ covid vaccines CVDVAC2Y_A: year for above

CVDDIAG_A: Ever had covid? 1=yes, 2=no

dfCovidVax <- adult22

dfCovidVaxFirst <- dfCovidVax %>%
  filter(CVDVAC1M_A != 'NA' & CVDVAC1Y_A !='NA')

dfCovidVaxLast <- dfCovidVax %>%
  filter(CVDVAC2M_A != 'NA' & CVDVAC2Y_A !='NA')

#need day of month, but unknown so will set to 01
dfCovidVaxFirst$LastCovidVaxFirstTime <- with(dfCovidVaxFirst, sprintf("%d-01-%02d", CVDVAC1M_A, CVDVAC1Y_A))

dfCovidVaxLast$LastCovidVaxNotFirstTime <- with(dfCovidVaxLast, sprintf("%d-01-%02d", CVDVAC2M_A, CVDVAC2Y_A))


dfCovidVaxFirst_ <- dfCovidVaxFirst %>%
  filter(CVDDIAG_A < 3)

dfCovidVaxLast_ <- dfCovidVaxLast %>%
  filter(CVDDIAG_A < 3)

#LAst Vax for those who got it for first time
dfCovidVaxFirst1 <- dfCovidVaxFirst_ |>
  select(LastCovidVaxFirstTime, CVDDIAG_A)
#did not remove duplicates each row is a separate person, but this causes issues with tsibble as multiple dates are the same.

##LAst Vax for those who did not get it for first time
dfCovidVaxLast1 <- dfCovidVaxLast_ |>
  select(LastCovidVaxNotFirstTime, CVDDIAG_A)

#View(dfCovidVaxFirst1)

dftransf <- dfCovidVaxFirst1 |>
  select(LastCovidVaxFirstTime, CVDDIAG_A)

dftransf['date']<- as.Date(dftransf$LastCovidVaxFirstTime, format="%m-%d-%Y")

dftransfilt <- dftransf %>% filter(!is.na(date))

#View(dftransfilt)

dfFirstV <- dftransfilt |>
  select(date, CVDDIAG_A)

#View(dfFirstV)

#not able to make a tsibble as I cannot seem to get rid of duplicates
#CovidVaxFirst_ts <- as_tsibble(dfFirstV, index=date) |>
  #index_by(date=date)
  #summarise(num_vax = sum(CVDDIAG_A)) |>
  #fill_gaps()

#change CVDDIAG_A to 0 or 1 
dfFirstV <-
  dfFirstV |>
    group_by(dfFirstV$CVDDIAG_A) |>
    mutate(Got_Covid = ifelse(CVDDIAG_A == 1,
                                1, 
                                 0)) |> 
  ungroup()

#sum for each date
dfFirstVSum<- dfFirstV %>% 
  group_by(date) %>% 
  summarise(Frequency = sum(Got_Covid))

dfFirstVSum

## # A tibble: 41 × 2
##    date       Frequency
##    <date>         <dbl>
##  1 2020-01-01         1
##  2 2020-02-01         2
##  3 2020-03-01         1
##  4 2020-04-01         5
##  5 2020-05-01         3
##  6 2020-06-01         3
##  7 2020-07-01         1
##  8 2020-08-01         5
##  9 2020-09-01         0
## 10 2020-10-01         1
## # ℹ 31 more rows

#percent of people who got first vax on each day, that got covid
dffirstVCount<- dfFirstV %>% group_by(date) %>%tally() 

dfMerge <- merge(dffirstVCount, dfFirstVSum, by = "date", all.x=TRUE, all.y=TRUE)

dfMerge['percent'] = dfMerge['Frequency']/dfMerge['n']


dfMerge <- dfMerge %>% filter(percent >.0000001)

dfMerge

##          date    n Frequency    percent
## 1  2020-01-01    4         1 0.25000000
## 2  2020-02-01    7         2 0.28571429
## 3  2020-03-01   15         1 0.06666667
## 4  2020-04-01   17         5 0.29411765
## 5  2020-05-01   15         3 0.20000000
## 6  2020-06-01    4         3 0.75000000
## 7  2020-07-01    2         1 0.50000000
## 8  2020-08-01    9         5 0.55555556
## 9  2020-10-01    2         1 0.50000000
## 10 2020-11-01    5         2 0.40000000
## 11 2020-12-01   20         7 0.35000000
## 12 2021-01-01  173        61 0.35260116
## 13 2021-02-01  345       112 0.32463768
## 14 2021-03-01  779       252 0.32349166
## 15 2021-04-01 1088       379 0.34834559
## 16 2021-05-01 1010       294 0.29108911
## 17 2021-06-01  698       237 0.33954155
## 18 2021-07-01  519       171 0.32947977
## 19 2021-08-01  718       236 0.32869081
## 20 2021-09-01 1056       277 0.26231061
## 21 2021-10-01 2010       479 0.23830846
## 22 2021-11-01 2669       595 0.22292994
## 23 2021-12-01 2535       625 0.24654832
## 24 2022-01-01 1494       435 0.29116466
## 25 2022-02-01  910       276 0.30329670
## 26 2022-03-01  868       248 0.28571429
## 27 2022-04-01 1085       237 0.21843318
## 28 2022-05-01  698       179 0.25644699
## 29 2022-06-01  469       140 0.29850746
## 30 2022-07-01  299        76 0.25418060
## 31 2022-08-01  197        54 0.27411168
## 32 2022-09-01  457       116 0.25382932
## 33 2022-10-01  593       162 0.27318718
## 34 2022-11-01  204        57 0.27941176
## 35 2022-12-01   29        14 0.48275862

dfMerge |>
  ggplot(mapping = aes(x = date, y = percent)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Percent of People who got First Covid Vax, and got Covid") +
  theme_hc() +
  scale_x_date(date_labels = "%Y (%b)")

## `geom_smooth()` using formula = 'y ~ x'

## First, this graph does not take into account when people got Covid and if it was before or after their first vaccine. ##One thing I noticed is that there were people claiming they got their first Covid vax before it actually came out, so, these may be false imputs before December 2020. In 2021, covid cases began to fall, which lines up with the graph above. Source: https://www.yalemedicine.org/news/covid-timeline

This could be subset into ’Before Vaccine” and into different sets based on which vaccine was obtained. The trendline is not that strong at some points, but from around the beginning of Jan 2021 and the end of 2022, the trend seems to follow.

dfMerge |>
  ggplot(mapping = aes(x = date, y = n)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Number of people who got their first Covid Vaccine") +
  theme_hc() +
  scale_x_date(date_labels = "%Y (%b)")

## `geom_smooth()` using formula = 'y ~ x'

This graph is the number of people who got their first Covid vaccine. The first vaccine came out in Dec 2021, which is around where the graph starts to increase. There was an increase in April and May 2021. In June 2021 when it started to decrease, there was a slowdown in people getting the vaccine and a record low number of cases. Source: https://www.kff.org/coronavirus-covid-19/poll-finding/kff-covid-19-vaccine-monitor-june-2021/

##On September 9, Biden made an announcement requiring federal employees to be vaccinated, which is around the time of the largest peak between Sept 2021 and January 2022. Source: https://www.whitehouse.gov/briefing-room/presidential-actions/2021/09/09/executive-order-on-requiring-coronavirus-disease-2019-vaccination-for-federal-employees/

Data Dive

Julia French

2023-09-04

R Markdown

Dataset - National Health Interview Adult Survey

Questions

Columns

General Health

2: Life Satisfaction

3: General Demographics

Education Level

Weight

Height

Medical Problems

Cancer

Others

Weight

Age

Sex

Education Level

BMI New Column

Because the survey was mostly multiple choice, there are not any major anomalies. The only thing that falls out of the typical range of responses are the “don’t know, refuse, or not ascertained” but even those have specific values that are consistent across questions.

There were a few strange ones among these, such as a few people putting “don’t know/not ascertained” for their age, which is something they should know. Probably a wrong click or just not paying attention?

Education Dataframe Samples

Looking at data among cancer types

Hypothesis 1

Those with normal BMI have higher average life satisfaction and physical health than those who are underweight or overweight. The lower the value, the higher the life satisfaction is

Average Normal BMI Life Satisfaction < Averagee Underweigth BMI Life Satisfaction, Average Overweight Life Satisfaction, Average Obese Life Satisfaction

Same goes for General Physical Health and BMI

Cancer vs No Cancer

Cancer vs Life Satisfaction

Life Satisfaction and Days of Work Missed in Last 12 Months

Smoking and Life Satisfaction: Cigarettes Smoked

This could be subset into ’Before Vaccine” and into different sets based on which vaccine was obtained. The trendline is not that strong at some points, but from around the beginning of Jan 2021 and the end of 2022, the trend seems to follow.