The dataset is data from the 2021 National Health Interview Adult Survey. The survey contained questions related to household and family composition, demographics about the survey taker, satisfaction with life, health insurance, medication, immunization, preventive screenings, and multiple health problems such as hypertension, cardiovascular conditions, cancer, vision, hearing, mobility, and more.
This survey is important in following the health of American’s based on many different factors of their lives. Looking at previous surveys can also help to see trends in Americans’ health.
1. Does education level play a role in the mental or physical health?
2. What are some health issues that correlate to other health issues?
3: What health issues are more common among certain demographics?
4: Has COVID possibly had an effect on certain health issues?
5: Is there a link between physical health and mental health?
1: Excellent
2: Very Good
3: Good
4: Fair
5: Poor
7: Refused
8: Not Ascertained
9: Don't Know
1: Very Satisfied
2: Satisfied
3: Dissatisfied
4: Very Dissatisfied
7: Refused
8: Not Ascertained
9: Don't Know
Classification of County Lived In
1: Large central metro
2: Large fringe metro
3: Medium and small metro
4: Nonmetropolitan
Household Region
1: Northeast
2: Midwest
3: South
4: West
Age
18-84: 18-84 with number corresponding
85: 85+
97: Refused
98: Not Ascertained
99: Don't Know
Age 65+
1: Less than 65
2: 65 or older
7: Refused
8: Not Ascertained
9: Don't Know
Sex
1: Male
2: Female
7: Refused
8: Not Ascertained
9: Don't Know
0: Never attended/Kindergarten only
1: Grade 1-11
2: 12th grade, no diploma
3: GED or equivalent
4: High School Graduate
5: Some college, no degree
6: Associate degree: occupational, technical, or vocational program
7: Associate degree: academic program
8: Bachelor's degree
9: Master's degree
10: Professional School or Doctoral degree
97: Refused
98: Not Ascertained
99: Don't Know
Person's weight in lbs
Person's height in ???
Questions were laid out as...
Told you have (condition)?
Told you have (condition) on 2 or more visits?
Had (condition) in past 12 months?
...with the possible responses being,
1: Yes. 1 answered if respondant is taking medication to control the issue
2: No
7: Refused
8: Not Ascertained
9: Don't Know
Types Included
1.
2
3
4
Age when first told had (type) cancer?
1-84: 1-84 years, with the corresponding number
85: 85+ years
97: Refused
98: Not Ascertained
99: Don't Know
Days Missed Work
0-129: 0 to 129 with corresponding value
130: 130+ days
997: Refused
998: Not Ascertained
999: Don't Know"
Most of the column names were unclear until I read the Codebook, however it was often easy to tell what category something fell under such as EDUCP_A, likely had something to do with education, while variable with CAN in them had to do with Cancer. I have an Excel sheet of the data where I have the columns color coded by if I know them from the codebook, if they are not in the codebook, or if I will not be using that column. Some of these unclear ones are the ones that start with DRK, PA18, MOD, VIG, and STR. I am still working on figuring those out.
Among the columns I do know, there are a few that I am unclear about. Among the cancer ones, they are asked what age were they told they have colon-rectal cancer. However, two other questions ask about colon cancer and rectal cancer, so I am trying to figure out if those are the same things, or separated.
dfColonRectal <- adult22[ , c("COLRCAGETC_A", "COLONAGETC_A", "RECTUAGETC_A")]
dfColonRectalAge <-subset(dfColonRectal, COLRCAGETC_A<="85")
#count(dfColonRectalAge) = 196
#print(dfColonRectalAge)
dfColonRectalAgeTest <-subset(dfColonRectalAge, COLRCAGETC_A==COLONAGETC_A | COLRCAGETC_A==RECTUAGETC_A)
#count(dfColonRectalAgeTest) = 196
Both have 196, so that means they have the same age that they put for ColoRectal in either Colon or Rectal. So this won’t cause problems for the data, I just have to make sure I don’t include ColoRectal and Colon, or ColoRectal and Rectal as separate cancers. Such as if I am counting how many types of cancer one person has.
dfWeightFilter <- adult22 %>%
filter(WEIGHTLBTC_A <= 996)
paste("Mean:",mean(dfWeightFilter$WEIGHTLBTC_A))
## [1] "Mean: 230.906970736928"
paste("Max:",max(dfWeightFilter$WEIGHTLBTC_A))
## [1] "Max: 996"
paste("Min:",min(dfWeightFilter$WEIGHTLBTC_A))
## [1] "Min: 100"
# Age
dfAgeFilter <- adult22 %>%
filter(AGEP_A < 97)
paste("Mean:",mean(dfAgeFilter$AGEP_A))
## [1] "Mean: 52.9485989777794"
paste("Max:",max(dfAgeFilter$AGEP_A))
## [1] "Max: 85"
paste("Min:",min(dfAgeFilter$AGEP_A))
## [1] "Min: 18"
paste("Over 85:",nrow(dfAgeFilter[dfAgeFilter$AGEP_A == '85', ]))
## [1] "Over 85: 1002"
paste("Under 85:",nrow(dfAgeFilter[dfAgeFilter$AGEP_A < '85', ]))
## [1] "Under 85: 26585"
dfSexFilter <- adult22 %>%
filter(SEX_A < 7)
dfSexFilter <-
dfSexFilter |>
group_by(dfSexFilter$SEX_A) |>
mutate(Sex_Status = ifelse(SEX_A == 1,
"Male",
"Female")) |>
ungroup()
ggplot(dfSexFilter, aes(x = Sex_Status)) +
geom_bar()
dfEduFilter <- adult22 %>%
filter(EDUCP_A < 97)
dfEduFilter <-
dfEduFilter |>
group_by(dfEduFilter$EDUCP_A) |>
mutate(Edu_Status = ifelse(EDUCP_A == 1,
"Grade 1-11",
ifelse(EDUCP_A == 2,
"12th Grade, no Diploma",
ifelse(EDUCP_A == 3,
"GED or Equivalent",
ifelse(EDUCP_A == 4,
"High School Graduate",
ifelse(EDUCP_A == 5,
"Some College, no Degree",
ifelse(EDUCP_A == 6,
"Associate degree: occupational, technical, or vocational program",
ifelse(EDUCP_A == 7,
"Associate degree: academic program",
ifelse(EDUCP_A == 8,
"Bachelor's degree",
ifelse(EDUCP_A == 9,
"Master's degree ",
ifelse(EDUCP_A == 10,
"Professional School or Doctoral degree",
ifelse(EDUCP_A == 97,
"Refused",
"Don't Know")))))))))))) |>
ungroup()
dfEduFilter$Edu_Status <- factor(dfEduFilter$Edu_Status, levels = c("Grade 1-11", "12th Grade, no Diploma", "GED or Equivalent","High School Graduate", "Some College, no Degree", "Associate degree: occupational, technical, or vocational program", "Associate degree: academic program", "Bachelor's degree", "Master's degree ", "Professional School or Doctoral degree", "Refused", "Don't Know"))
ggplot(dfEduFilter, aes(x = EDUCP_A, fill=Edu_Status)) +
geom_bar() + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
#General Health
dfGHFilter <- adult22 %>%
filter(PHSTAT_A < '7')
dfGHFilter <- dfGHFilter %>%
filter(AGEP_A < '97')
mean(dfGHFilter$PHSTAT_A)
## [1] 2.439941
ggplot(dfGHFilter, aes(x = PHSTAT_A)) +
geom_bar()
plot(dfGHFilter$AGEP_A , dfGHFilter$PHSTAT_A)
abline(lm(dfGHFilter$PHSTAT_A ~ dfGHFilter$AGEP_A), col = "red", lwd = 3)
#Weight and Health
# dfWeightFilter <- adult22[adult22$WEIGHTLBTC_A < '997', ]
plot(dfWeightFilter$WEIGHTLBTC_A)
dfWHFilter <- dfWeightFilter %>%
filter(PHSTAT_A <= 6)
dfHighHealth <- dfWHFilter %>%
filter(PHSTAT_A < 3 )
dfHighWeight <- dfWHFilter %>%
filter(WEIGHTLBTC_A >= 250 )
Weight1 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A < '150', ])
Weight2 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A > '150' & dfWeightFilter$WEIGHTLBTC_A <= '200', ])
Weight3 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A > '200' & dfWeightFilter$WEIGHTLBTC_A <= '250', ])
Weight4 <- nrow(dfWeightFilter[dfWeightFilter$WEIGHTLBTC_A <= '250', ])
dfWeightCount <- data.frame(Weight1, Weight2, Weight3, Weight4)
print(dfWeightCount)
## Weight1 Weight2 Weight3 Weight4
## 1 6451 11717 5008 24210
plot(dfWHFilter$WEIGHTLBTC_A, dfWHFilter$PHSTAT_A, xlab = "Weight", ylab = "General Health")
plot(dfHighWeight$WEIGHTLBTC_A, dfHighWeight$PHSTAT_A, xlab = "Weight", ylab = "General Health")
hist(dfWeightFilter$WEIGHTLBTC_A, )
#Weight and Height
plot(adult22$WEIGHTLBTC_A, adult22$HEIGHTTC_A)
# Group_By
dfEdu <- adult22 %>% group_by(adult22$EDUCP_A)
mean(dfEdu$EDUCP_A)
## [1] 6.443528
# which is an associate degree
# Probability of at least an associate degree (6, 7, 8, 9, 10)
prob_Associate_Up<- nrow(dfEdu[dfEdu$EDUCP_A >= '6' & dfEdu$EDUCP_A <= '10', ])
prob_All <- nrow(dfEdu)
prob_Associate_Up/prob_All
## [1] 0
# Probability of below grade 12
prob_Under_12 <- nrow(dfEdu[dfEdu$EDUCP_A <= '1', ])
prob_Under_12/prob_All
## [1] 0.06802647
# Probability of associate or higher and positive life satisfaction
prob_Associate_Satisfied <- nrow(dfEdu[dfEdu$EDUCP_A >= '6' & dfEdu$EDUCP_A <= '10' & dfEdu$LSATIS4_A <= '2', ])
prob_Associate_Satisfied/prob_Associate_Up
## [1] NaN
#Probability of below grade 12 and satisfied
prob_Under12_Satisfied <- nrow(dfEdu[dfEdu$EDUCP_A <= '1' & dfEdu$LSATIS4_A <= '2', ])
prob_Under12_Satisfied/prob_Under_12
## [1] 0.917597
plot
## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x7fcb5bd89540>
## <environment: namespace:base>
#Probability of normal BMI(18.5 to 24.9) and general health
dfHealth <- adult22 %>% group_by(adult22$PHSTAT_A)
prob_NormBMI <- nrow(dfHealth[dfHealth$BMICAT_A == '2', ])
prob_NormBMI/prob_All
## [1] 0.307186
prob_NormBMI_GoodHealth <- nrow(dfHealth[dfHealth$BMICAT_A == '2' & dfHealth$PHSTAT_A <= '4', ])
prob_NormBMI_GoodHealth/prob_NormBMI
## [1] 0.970332
#Probability of overweight BMI and positive/negative health
prob_OverweightBMI <- nrow(dfHealth[dfHealth$BMICAT_A == '3', ])
prob_OverweightBMI/prob_All
## [1] 0.3357926
prob_OverweightBMI_GoodHealth <- nrow(dfHealth[dfHealth$BMICAT_A == '3' & dfHealth$PHSTAT_A <= '4', ])
prob_OverweightBMI_GoodHealth/prob_OverweightBMI
## [1] 0.9696284
prob_OverweightBMI_BadHealth <- nrow(dfHealth[dfHealth$BMICAT_A == '3' & dfHealth$PHSTAT_A == '5', ])
prob_OverweightBMI_BadHealth/prob_OverweightBMI
## [1] 0.02994076
prob_GoodHealth <- nrow(dfHealth[dfHealth$PHSTAT_A <= '4', ])
# How many of all BMIs considered themselves to be in good health
prob_GoodHealth/prob_All
## [1] 0.9626053
# About 96% of people considered themselves to be in good, or greater health. Even among different BMIs, the percent that considered themselves to be in good health was above 90%.
# Why do most people see themselves to be in good health, or were most of the survey takers healthy in general? -- Check the more specific medical issues
# Sort BMI by Underweight, Normal, Overweight, Obese
adult22_raw <- adult22
adult22BMI <- adult22_raw
adult22BMI <-
adult22BMI |>
group_by(adult22BMI$BMICAT_A) |>
mutate(BMI_Status = ifelse(BMICAT_A == 1,
"Under",
ifelse(BMICAT_A == 3,
"Over",
ifelse(BMICAT_A == 4,
"Obese",
ifelse(BMICAT_A,
"Normal",
"Unknown"))))) |>
ungroup()
dfAllBMI <- adult22BMI %>%
filter(BMICAT_A < 5)
nrow(dfAllBMI[dfAllBMI$BMICAT_A == '1',])
## [1] 432
nrow(dfAllBMI[dfAllBMI$BMICAT_A == '2',])
## [1] 8494
nrow(dfAllBMI[dfAllBMI$BMICAT_A == '3',])
## [1] 9285
nrow(dfAllBMI[dfAllBMI$BMICAT_A == '4',])
## [1] 8814
hist(dfAllBMI$BMICAT_A)
# Life Satisfaction and General Health
prob_GoodLS_Health <- nrow(dfHealth[dfHealth$LSATIS4_A <= '2' & dfHealth$PHSTAT_A <= '4', ])
prob_GoodLS_Health/prob_All
## [1] 0.9275976
#Prob out of those who have high general health
prob_GoodLS_Health/prob_GoodHealth
## [1] 0.9636323
#Bad life satisfaction and bad health out of all
prob_BadLS_Health <- nrow(dfHealth[dfHealth$LSATIS4_A >= '3' & dfHealth$LSATIS4_A <=4 & dfHealth$PHSTAT_A == '5', ])
prob_BadLS_Health/prob_All
## [1] 0.01182597
#Bad life satisfaction among those with low health
prob_Low_LS <- nrow(dfHealth[dfHealth$PHSTAT_A == '5',])
prob_BadLS_Health/prob_Low_LS
## [1] 0.3180934
plot(adult22$EDUCP_A , adult22$LSATIS4_A)
abline(lm(adult22$LSATIS4_A ~ adult22$EDUCP_A), col = "red", lwd = 3)
dfEduSample <- dfEdu[ , c("EDUCP_A")]
dfEdu1 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu2 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu3 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu4 <- sample_n(dfEduSample,100, replace = TRUE)
dfEdu5 <- sample_n(dfEduSample,100, replace = TRUE)
print(dfEdu1)
## # A tibble: 100 × 1
## EDUCP_A
## <int>
## 1 7
## 2 3
## 3 7
## 4 7
## 5 8
## 6 4
## 7 4
## 8 4
## 9 6
## 10 8
## # ℹ 90 more rows
paste("Sample 1 Mean:", mean(dfEdu1$EDUCP_A))
## [1] "Sample 1 Mean: 6.16"
print(dfEdu2)
## # A tibble: 100 × 1
## EDUCP_A
## <int>
## 1 4
## 2 6
## 3 1
## 4 8
## 5 5
## 6 5
## 7 1
## 8 6
## 9 8
## 10 9
## # ℹ 90 more rows
paste("Sample 2 Mean:", mean(dfEdu2$EDUCP_A))
## [1] "Sample 2 Mean: 7.3"
print(dfEdu3)
## # A tibble: 100 × 1
## EDUCP_A
## <int>
## 1 9
## 2 1
## 3 5
## 4 7
## 5 8
## 6 8
## 7 4
## 8 5
## 9 9
## 10 6
## # ℹ 90 more rows
paste("Sample 3 Mean:", mean(dfEdu3$EDUCP_A))
## [1] "Sample 3 Mean: 5.99"
print(dfEdu4)
## # A tibble: 100 × 1
## EDUCP_A
## <int>
## 1 8
## 2 6
## 3 4
## 4 6
## 5 5
## 6 5
## 7 4
## 8 4
## 9 4
## 10 4
## # ℹ 90 more rows
paste("Sample 4 Mean:", mean(dfEdu4$EDUCP_A))
## [1] "Sample 4 Mean: 5.84"
print(dfEdu5)
## # A tibble: 100 × 1
## EDUCP_A
## <int>
## 1 5
## 2 8
## 3 5
## 4 5
## 5 8
## 6 10
## 7 1
## 8 4
## 9 9
## 10 5
## # ℹ 90 more rows
paste("Sample 5 Mean:", mean(dfEdu5$EDUCP_A))
## [1] "Sample 5 Mean: 5.67"
# The average tends to be between 5 (some college) and 8 (Bachelor's degree), among all the samples. However if any sample ends up with the 97,98, or 99 that correspond with "don't know", then the sample will be greatly skewed.
dfWeightHeightSample <- dfHealth[ , c("WEIGHTLBTC_A", "HEIGHTTC_A")]
dfWH1 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH2 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH3 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH4 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
dfWH5 <- sample_n(dfWeightHeightSample,100, replace = TRUE)
print(dfWH1)
## # A tibble: 100 × 2
## WEIGHTLBTC_A HEIGHTTC_A
## <int> <int>
## 1 279 70
## 2 180 65
## 3 140 62
## 4 150 68
## 5 150 61
## 6 180 64
## 7 135 64
## 8 223 67
## 9 138 69
## 10 165 65
## # ℹ 90 more rows
print(dfWH2)
## # A tibble: 100 × 2
## WEIGHTLBTC_A HEIGHTTC_A
## <int> <int>
## 1 168 68
## 2 205 67
## 3 165 66
## 4 215 73
## 5 996 96
## 6 110 64
## 7 273 67
## 8 160 60
## 9 184 69
## 10 125 65
## # ℹ 90 more rows
print(dfWH3)
## # A tibble: 100 × 2
## WEIGHTLBTC_A HEIGHTTC_A
## <int> <int>
## 1 996 96
## 2 185 73
## 3 163 65
## 4 140 63
## 5 150 63
## 6 136 68
## 7 180 64
## 8 128 62
## 9 209 64
## 10 189 74
## # ℹ 90 more rows
print(dfWH4)
## # A tibble: 100 × 2
## WEIGHTLBTC_A HEIGHTTC_A
## <int> <int>
## 1 150 67
## 2 215 63
## 3 126 64
## 4 140 62
## 5 230 69
## 6 160 71
## 7 130 63
## 8 163 68
## 9 230 71
## 10 165 70
## # ℹ 90 more rows
print(dfWH5)
## # A tibble: 100 × 2
## WEIGHTLBTC_A HEIGHTTC_A
## <int> <int>
## 1 215 75
## 2 195 69
## 3 996 96
## 4 153 70
## 5 147 65
## 6 135 64
## 7 145 68
## 8 996 96
## 9 160 66
## 10 240 66
## # ℹ 90 more rows
plot(dfWH1$WEIGHTLBTC_A,dfWH1$HEIGHTTC_A,type="p",main="Normal Distribution",xlab="Weight(lbs)",ylab="Height")
points(dfWH2$WEIGHTLBTC_A,dfWH2$HEIGHTTC_A, col="green")
points(dfWH3$WEIGHTLBTC_A,dfWH3$HEIGHTTC_A,col="blue")
points(dfWH4$WEIGHTLBTC_A,dfWH4$HEIGHTTC_A,col="red")
points(dfWH5$WEIGHTLBTC_A,dfWH5$HEIGHTTC_A,col="yellow")
abline(lm(dfWeightHeightSample$HEIGHTTC_A ~ dfWeightHeightSample$WEIGHTLBTC_A), col = "red", lwd = 3)
dfGenHealthSample <- dfHealth[ , c("PHSTAT_A")]
dfGH1 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH2 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH3 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH4 <- sample_n(dfGenHealthSample,100, replace = TRUE)
dfGH5 <- sample_n(dfGenHealthSample,100, replace = TRUE)
# Average
print(mean(dfGH1$PHSTAT_A))
## [1] 2.51
print(mean(dfGH2$PHSTAT_A))
## [1] 2.49
print(mean(dfGH3$PHSTAT_A))
## [1] 2.46
print(mean(dfGH4$PHSTAT_A))
## [1] 2.31
print(mean(dfGH5$PHSTAT_A))
## [1] 2.31
# The average tends to be between 2 and 3, which makes sense because the general health among all survey takers is often a 2 (Very good) or 3 (Good).
Types:
BLADDCAN_A BLOODCAN_A BONECAN_A BRAINCAN_A BREASCAN_A CERVICAN_A ESOPHCAN_A GALLBCAN_A LARYNCAN_A LEUKECAN_A LIVERCAN_A LUNGCAN_A LYMPHCAN_A MELANCAN_A MOUTHCAN_A OVARYCAN_A PANCRCAN_A PROSTCAN_A SKNMCAN_A SKNNMCAN_A SKNDKCAN_A STOMACAN_A THROACAN_A THYROCAN_A UTERUCAN_A HDNCKCAN_A COLRCCAN_A OTHERCANP_A
Number of reported cancers: NUMCAN_A
Age Told has Cancer
BLADDAGETC_A BLOODAGETC_A BONEAGETC_A BRAINAGETC_A BREASAGETC_A CERVIAGETC_A COLONAGETC_A ESOPHAGETC_A GALLBAGETC_A LARYNAGETC_A LEUKEAGETC_A LIVERAGETC_A LUNGAGETC_A LYMPHAGETC_A MELANAGETC_A MOUTHAGETC_A OVARYAGETC_A PANCRAGETC_A PROSTAGETC_A SKNMAGETC_A SKNNMAGETC_A SKNDKAGETC_A STOMAAGETC_A THROAAGETC_A THYROAGETC_A UTERUAGETC_A HDNCKAGETC_A COLRCAGETC_A OTHERAGETC_A
# Cancers df
dfCancer <- adult22 %>%
filter(NUMCAN_A > 0 & NUMCAN_A < 7)
ggplot(dfCancer, aes(x = NUMCAN_A)) +
geom_bar()
ggplot(dfCancer, aes(NUMCAN_A, LSATIS4_A, colour=NUMCAN_A)) +
geom_line() +
geom_point()
ggplot(dfCancer, aes(NUMCAN_A, AGEP_A, colour=NUMCAN_A)) +
geom_line() +
geom_point()
plot(dfCancer$AGEP_A, dfCancer$NUMCAN_A)
abline(lm(dfCancer$NUMCAN_A ~ dfCancer$AGEP_A), col = "red", lwd = 3)
# Age CI of those with cancer
resultCAN <- t.test(dfCancer$AGEP_A)
confidence_intervalCAN <- resultCAN$conf.int
confidence_intervalCAN
## [1] 68.11267 68.97246
## attr(,"conf.level")
## [1] 0.95
mean(dfCancer$AGEP_A)
## [1] 68.54257
# Age CI of those without cancer
dfNoCancer <- adult22 %>%
filter(NUMCAN_A == 0)
# Age CI of those with no cancer
resultNOCAN <- t.test(dfNoCancer$AGEP_A)
confidence_intervalNONE <- resultNOCAN$conf.int
confidence_intervalNONE
## [1] 50.60551 51.06352
## attr(,"conf.level")
## [1] 0.95
mean(dfNoCancer$AGEP_A)
## [1] 50.83452
# Age CI of all
result <- t.test(adult22$AGEP_A)
confidence_interval <- result$conf.int
confidence_interval
## [1] 52.83239 53.26945
## attr(,"conf.level")
## [1] 0.95
mean(adult22$AGEP_A)
## [1] 53.05092
dfFilteredLS <- adult22 %>%
filter(BMICAT_A < 5 & LSATIS4_A <7)
cohen.d(dfFilteredLS$BMICAT_A, dfFilteredLS$LSATIS4_A)
##
## Cohen's d
##
## d estimate: 1.877823 (large)
## 95 percent confidence interval:
## lower upper
## 1.857558 1.898087
# Effect size is 1.500028
dfFilteredPH <- adult22 %>%
filter(BMICAT_A < 5 & PHSTAT_A <7)
cohen.d(dfFilteredPH$BMICAT_A, dfFilteredPH$PHSTAT_A)
##
## Cohen's d
##
## d estimate: 0.5713476 (medium)
## 95 percent confidence interval:
## lower upper
## 0.5541439 0.5885514
#Effect size is 0.5916467
#got error of out of workspace until I added the simulate.p.value. In then was taking a very long time to run the cell.
#fisher.test(select(adult22, BMICAT_A, LSATIS4_A), simulate.p.value = TRUE)
#fisher.test(select(adult22, BMICAT_A, PHSTAT_A), simulate.p.value = TRUE)
dfFilteredBMI <- adult22 %>%
filter(BMICAT_A < 5)
sd(dfFilteredBMI$BMICAT_A)
## [1] 0.8390505
sd(dfFilteredLS$LSATIS4_A)
## [1] 0.6045232
sd(dfFilteredPH$PHSTAT_A)
## [1] 1.054588
chisq.test(dfFilteredPH$BMICAT_A, dfFilteredPH$PHSTAT_A)
##
## Pearson's Chi-squared test
##
## data: dfFilteredPH$BMICAT_A and dfFilteredPH$PHSTAT_A
## X-squared = 1708.1, df = 12, p-value < 2.2e-16
chisq.test(dfFilteredLS$BMICAT_A, dfFilteredLS$LSATIS4_A)
## Warning in chisq.test(dfFilteredLS$BMICAT_A, dfFilteredLS$LSATIS4_A):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: dfFilteredLS$BMICAT_A and dfFilteredLS$LSATIS4_A
## X-squared = 173.42, df = 9, p-value < 2.2e-16
chisq.test(dfFilteredPH$BMICAT_A, dfFilteredPH$PHSTAT_A, simulate.p.value = TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: dfFilteredPH$BMICAT_A and dfFilteredPH$PHSTAT_A
## X-squared = 1708.1, df = NA, p-value = 0.0004998
chisq.test(dfFilteredLS$BMICAT_A, dfFilteredLS$LSATIS4_A, simulate.p.value = TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: dfFilteredLS$BMICAT_A and dfFilteredLS$LSATIS4_A
## X-squared = 173.42, df = NA, p-value = 0.0004998
dfUnderBMI <- adult22 %>%
filter(BMICAT_A == 1 )
dfUnderBMI <- dfUnderBMI %>%
filter(LSATIS4_A < 7 )
dfNormalBMI <- adult22 %>%
filter(BMICAT_A == 2 )
dfNormalBMI <- dfNormalBMI %>%
filter(LSATIS4_A < 7 )
dfOverBMI <- adult22 %>%
filter(BMICAT_A == 3 )
dfOverBMI <- dfOverBMI %>%
filter(LSATIS4_A < 7 )
dfObeseBMI <- adult22 %>%
filter(BMICAT_A == 4 )
dfObeseBMI <- dfObeseBMI %>%
filter(LSATIS4_A < 7 )
mean(dfUnderBMI$LSATIS4_A)
## [1] 1.6875
mean(dfNormalBMI$LSATIS4_A)
## [1] 1.570281
mean(dfOverBMI$LSATIS4_A)
## [1] 1.576346
mean(dfObeseBMI$LSATIS4_A)
## [1] 1.670496
Status = c("Underweight", "Normal BMI", "Overweight", "Obese")
LifeSatisfaction = c(mean(dfUnderBMI$LSATIS4_A), mean(dfNormalBMI$LSATIS4_A), mean(dfOverBMI$LSATIS4_A), mean(dfObeseBMI$LSATIS4_A))
dfPlot <- data.frame(Status, LifeSatisfaction)
ggplot(dfPlot, aes(x=Status, LifeSatisfaction)) + geom_point(fill='black')
hist(dfUnderBMI$LSATIS4_A)
hist(dfNormalBMI$LSATIS4_A)
hist(dfOverBMI$LSATIS4_A)
hist(dfObeseBMI$LSATIS4_A)
dfUnderBMI <- adult22 %>%
filter(BMICAT_A == 1 )
dfUnderBMI <- dfUnderBMI %>%
filter(PHSTAT_A < 7 )
dfNormalBMI <- adult22 %>%
filter(BMICAT_A == 2 )
dfNormalBMI <- dfNormalBMI %>%
filter(PHSTAT_A < 7 )
dfOverBMI <- adult22 %>%
filter(BMICAT_A == 3 )
dfOverBMI <- dfOverBMI %>%
filter(PHSTAT_A < 7 )
dfObeseBMI <- adult22 %>%
filter(BMICAT_A == 4 )
dfObeseBMI <- dfObeseBMI %>%
filter(PHSTAT_A < 7 )
mean(dfUnderBMI$PHSTAT_A)
## [1] 2.516204
mean(dfNormalBMI$PHSTAT_A)
## [1] 2.176284
mean(dfOverBMI$PHSTAT_A)
## [1] 2.360845
mean(dfObeseBMI$PHSTAT_A)
## [1] 2.759814
Status = c("Underweight", "Normal BMI", "Overweight", "Obese")
PhysicalHealth = c(mean(dfUnderBMI$PHSTAT_A), mean(dfNormalBMI$PHSTAT_A), mean(dfOverBMI$PHSTAT_A), mean(dfObeseBMI$PHSTAT_A))
dfPlot <- data.frame(Status, PhysicalHealth)
ggplot(dfPlot, aes(x=Status, PhysicalHealth)) + geom_point(fill='black')
#Check with ANOVA \[ H_0 : \text{average Life Satisfaction and Physical Health price are equal across all BMIs} \]
hist(dfFilteredBMI$BMICAT_A)
hist(dfFilteredLS$LSATIS4_A)
hist(dfFilteredPH$PHSTAT_A)
#PHSTAT_A and LSATIS4_A are response variables
#BMICAT_A is eplanatory variable
hist(dfUnderBMI$PHSTAT_A)
hist(dfNormalBMI$PHSTAT_A)
hist(dfOverBMI$PHSTAT_A)
hist(dfObeseBMI$PHSTAT_A)
hist(dfUnderBMI$LSATIS4_A)
hist(dfNormalBMI$LSATIS4_A)
hist(dfOverBMI$LSATIS4_A)
hist(dfObeseBMI$LSATIS4_A)
m <- aov(PHSTAT_A ~ BMICAT_A, data = dfFilteredPH)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## BMICAT_A 1 1309 1309.0 1231 <2e-16 ***
## Residuals 27017 28739 1.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m2 <- aov(LSATIS4_A ~ BMICAT_A, data = dfFilteredLS)
summary(m2)
## Df Sum Sq Mean Sq F value Pr(>F)
## BMICAT_A 1 34 33.69 92.5 <2e-16 ***
## Residuals 26955 9817 0.36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#P is less than significance, so we reject null hypothesis.
pairwise.t.test(dfFilteredPH$PHSTAT_A, dfFilteredPH$BMICAT_A, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dfFilteredPH$PHSTAT_A and dfFilteredPH$BMICAT_A
##
## 1 2 3
## 2 1.2e-10 - -
## 3 0.013 < 2e-16 -
## 4 8.9e-06 < 2e-16 < 2e-16
##
## P value adjustment method: bonferroni
pairwise.t.test(dfFilteredLS$LSATIS4_A, dfFilteredLS$BMICAT_A, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: dfFilteredLS$LSATIS4_A and dfFilteredLS$BMICAT_A
##
## 1 2 3
## 2 0.00048 - -
## 3 0.00108 1.00000 -
## 4 1.00000 < 2e-16 < 2e-16
##
## P value adjustment method: bonferroni
boot_ciLS <- function (v, func = median, conf = 0.95, n_iter = 100) {
boot_func <- \(x, i) func(x[i])
b <- boot(v, boot_func, R = n_iter)
b <- boot.ci(b, conf = conf, type = "perc")
return(c("lower" = b$percent[4],
"upper" = b$percent[5]))
}
df_ciLS <- dfFilteredLS |>
group_by(BMICAT_A) |>
summarise(ci_lower = boot_ciLS(LSATIS4_A, mean)['lower'],
mean_LS = mean(LSATIS4_A),
ci_upper = boot_ciLS(LSATIS4_A, mean)['upper'])
df_ciLS
## # A tibble: 4 × 4
## BMICAT_A ci_lower mean_LS ci_upper
## <int> <dbl> <dbl> <dbl>
## 1 1 1.62 1.69 1.74
## 2 2 1.56 1.57 1.58
## 3 3 1.56 1.58 1.59
## 4 4 1.66 1.67 1.68
df_ciLS |>
ggplot() +
geom_errorbarh(mapping = aes(y = BMICAT_A,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_LS, y = BMICAT_A,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'red')) +
theme_minimal() +
labs(title = "Life Satisfaction by BMI Category",
x = "Life Satisfaction",
y = "BMI Category",
color = '')
# 1 is underweight, which had way less people in it, so it could mess with the data a bit.
boot_ciPH <- function (v, func = median, conf = 0.95, n_iter = 100) {
boot_func <- \(x, i) func(x[i])
b <- boot(v, boot_func, R = n_iter)
b <- boot.ci(b, conf = conf, type = "perc")
return(c("lower" = b$percent[4],
"upper" = b$percent[5]))
}
df_ciPH <- dfFilteredPH |>
group_by(BMICAT_A) |>
summarise(ci_lower = boot_ciPH(PHSTAT_A, mean)['lower'],
mean_PH = mean(PHSTAT_A),
ci_upper = boot_ciPH(PHSTAT_A, mean)['upper'])
df_ciPH
## # A tibble: 4 × 4
## BMICAT_A ci_lower mean_PH ci_upper
## <int> <dbl> <dbl> <dbl>
## 1 1 2.39 2.52 2.63
## 2 2 2.15 2.18 2.20
## 3 3 2.34 2.36 2.38
## 4 4 2.74 2.76 2.78
df_ciPH |>
ggplot() +
geom_errorbarh(mapping = aes(y = BMICAT_A,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_PH, y = BMICAT_A,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'red')) +
theme_minimal() +
labs(title = "General Health by BMI Category",
x = "General Health",
y = "BMI Category",
color = '')
# Underweight category has the same problem as above.
# Both of these show that the average is not the same among BMI Categories.
# Age could also be a factor.
dfFilteredPHAge <- dfFilteredPH %>%
filter(AGEP_A < 86)
dfFilteredLSAge <- dfFilteredLS %>%
filter(AGEP_A < 86)
modelLS <- lm(AGEP_A ~ LSATIS4_A, dfFilteredLSAge)
modelLS$coefficients
## (Intercept) LSATIS4_A
## 53.00383657 -0.04623706
modelPH <- lm(AGEP_A ~ PHSTAT_A, dfFilteredPHAge)
modelPH$coefficients
## (Intercept) PHSTAT_A
## 42.67638 4.20775
dfFilteredLSAge |>
ggplot(mapping = aes(x = LSATIS4_A, y = AGEP_A)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = 'darkblue') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
dfFilteredPHAge |>
ggplot(mapping = aes(x = PHSTAT_A, y = AGEP_A)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = 'darkblue') +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# With Life Satisfaction, there does not seem to be much of a regression compared to General Health, based on age.
# Checking Age vs BMI Category
dfFilteredBMIAge <- dfFilteredBMI %>%
filter(AGEP_A < 86)
boot_ciBMIAge <- function (v, func = median, conf = 0.95, n_iter = 100) {
boot_func <- \(x, i) func(x[i])
b <- boot(v, boot_func, R = n_iter)
b <- boot.ci(b, conf = conf, type = "perc")
return(c("lower" = b$percent[4],
"upper" = b$percent[5]))
}
df_ciBMIAge <- dfFilteredBMI |>
group_by(BMICAT_A) |>
summarise(ci_lower = boot_ciBMIAge(AGEP_A, mean)['lower'],
mean_Age = mean(AGEP_A),
ci_upper = boot_ciBMIAge(AGEP_A, mean)['upper'])
df_ciBMIAge
## # A tibble: 4 × 4
## BMICAT_A ci_lower mean_Age ci_upper
## <int> <dbl> <dbl> <dbl>
## 1 1 48.4 50.4 52.8
## 2 2 51.5 51.9 52.3
## 3 3 54.0 54.4 54.8
## 4 4 52.3 52.7 53.0
df_ciBMIAge |>
ggplot() +
geom_errorbarh(mapping = aes(y = BMICAT_A,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_Age, y = BMICAT_A,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'red')) +
theme_minimal() +
labs(title = "BMI Category by Age",
x = "Age",
y = "BMI Category",
color = '')
# Same problem once again with Underweight BMI.
# Age with General Health graph
dfFilteredPHAge <- dfFilteredPH %>%
filter(AGEP_A < 86)
boot_ciPHAge <- function (v, func = median, conf = 0.95, n_iter = 100) {
boot_func <- \(x, i) func(x[i])
b <- boot(v, boot_func, R = n_iter)
b <- boot.ci(b, conf = conf, type = "perc")
return(c("lower" = b$percent[4],
"upper" = b$percent[5]))
}
df_ciPHAge <- dfFilteredPHAge |>
group_by(PHSTAT_A) |>
summarise(ci_lower = boot_ciPHAge(AGEP_A, mean)['lower'],
mean_Age = mean(AGEP_A),
ci_upper = boot_ciPHAge(AGEP_A, mean)['upper'])
df_ciPHAge
## # A tibble: 5 × 4
## PHSTAT_A ci_lower mean_Age ci_upper
## <int> <dbl> <dbl> <dbl>
## 1 1 46.6 47.1 47.5
## 2 2 50.5 50.9 51.4
## 3 3 54.8 55.2 55.7
## 4 4 59.1 59.7 60.4
## 5 5 62.9 63.9 64.9
df_ciPHAge |>
ggplot() +
geom_errorbarh(mapping = aes(y = PHSTAT_A,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_Age, y = PHSTAT_A,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'red')) +
theme_minimal() +
labs(title = "General Health by Age",
x = "Age",
y = "General Health",
color = '')
# General Health decreases age Age increases. (A higher General Health meaning worst)
dfFilteredLSAge <- dfFilteredLS %>%
filter(AGEP_A < 86)
boot_ciLSAge <- function (v, func = median, conf = 0.95, n_iter = 100) {
boot_func <- \(x, i) func(x[i])
b <- boot(v, boot_func, R = n_iter)
b <- boot.ci(b, conf = conf, type = "perc")
return(c("lower" = b$percent[4],
"upper" = b$percent[5]))
}
df_ciLSAge <- dfFilteredLSAge |>
group_by(LSATIS4_A) |>
summarise(ci_lower = boot_ciLSAge(AGEP_A, mean)['lower'],
mean_Age = mean(AGEP_A),
ci_upper = boot_ciLSAge(AGEP_A, mean)['upper'])
df_ciLSAge
## # A tibble: 4 × 4
## LSATIS4_A ci_lower mean_Age ci_upper
## <int> <dbl> <dbl> <dbl>
## 1 1 53.0 53.4 53.7
## 2 2 52.0 52.3 52.7
## 3 3 54.0 55.3 56.8
## 4 4 55.7 57.5 59.4
df_ciLSAge |>
ggplot() +
geom_errorbarh(mapping = aes(y = LSATIS4_A,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_Age, y = LSATIS4_A,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'red')) +
theme_minimal() +
labs(title = "Average Age by Life Satisfaction",
x = "Age",
y = "Life Satisfaction",
color = '')
## It seems that the average Life Satisfaction and General Health are
not the same among BMI categories. Additionally, age plays a part in the
average General Health, but not in Life Satisfaction and BMI
category.