getwd()
## [1] "C:/Users/Jerome/Documents/0000_Work_Files/0000_Coursera/Statistics_with_R_Specialization/Course_1_Probability_&_Data/Week5_Project"
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(statsr)
## Warning: package 'statsr' was built under R version 4.0.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.4     v purrr   0.3.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## Warning: package 'tibble' was built under R version 4.0.3
## Warning: package 'tidyr' was built under R version 4.0.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
load("brfss2013.RData")

According to the BRFSS Data User Guide, the BRFSS uses two sample frames. One is a landline sample frame; the other is the cell phone sample frame. These sample frames are obtained for each state from the Centers for Disease Control and Prevention (CDC). The landline sample uses disproportionate stratified sampling based on the density of landline telephone numbers. The cell phone list is a random sample of known cell phone numbers. Each cell phone number has an equal probability of being selected. Weights are assigned following sampling to remove, as much as possible, any bias from the survey responses.

Because the households and persons within households are sampled randomly, the results may be generalized to the population. Thus the individual state data may be generalized to the individual state; if the entire data set is used, those results may be generalized to the entire United States. Random assignment is NOT used. The BRFSS is not a clinical trial w/ persons assigned to control and treatment groups. It is not possible to draw causal conclusions from this data set.

According to the website MDEDGE, https://www.mdedge.com/chestphysician/article/79880/health-policy/hawaii-named-healthiest-state-2013, accessed 6 August 2021, Hawaii had the healthiest population in the USA in 2013; Mississippi had the lowest ranking, 50/50. The same rankings held in 2020; https://www.beckershospitalreview.com/rankings-and-ratings/50-states-ranked-from-healthiest-to-unhealthiest.html. For purposes of this exercise, I want to compare various indicators for these 2 states to see the extent of the differences between them. Thus one of the three variables used in the research questions is state. Data from Hawaii and Mississippi will be compared in this study.

Subset the dataset to create a dataset of only Hawaii and Mississippi

 hims0 <- subset(brfss2013, brfss2013$X_state == "Hawaii" | brfss2013$X_state == "Mississippi")
write.csv(hims0, "hims0.csv", row.names = FALSE)
hims0 <- read.csv("hims0.csv", header = TRUE)

Research Questions: My goal in this project is to compare select health indicators for the state w/ the lowest ratings for behavioral risk factors to the indicators for the state w/ the highest ratings for those indicators. The indicators I selected for this study include nicotine use; vaccination for flu and pneumonia, for persons over age 65; and the numbers of people in good health vs. poor health. To do this, I will create several new variables and, for research question 2, select only certain cases to analyze.

Research Question 1: Is there a difference between the levels of nicotine use between residents of Hawaii and Mississippi?

In order to answer this question, I will construct a new variable, nicotine, which will be created by using the smoke100 variable with a value of “Yes” and the usenow3 variable for smokeless tobacco w/ a value of “Everyday.” The values in the new variable for the two states will be compared. To properly create the variable, I first must delete the “NA” cases from those two variables.

hims0 = filter(hims0, smoke100 != "NA")
write.csv(hims0, "hims0.csv", row.names = FALSE)
hims0 <-read.csv("hims0.csv", header = TRUE)
table(hims0$smoke100)
## 
##   No  Yes 
## 8574 6350

Now filter NAs and “Some days” from usenow3 (the smokeless tobacco column)

hims0 <- filter(hims0, usenow3 != "NA")
hims0 <- filter(hims0, usenow3 != "Some days")
table(hims0$usenow3)
## 
##  Every day Not at all 
##        300      14341
write.csv(hims0, "hims0.csv", row.names = FALSE)

hims0 <-read.csv("hims0.csv", header = TRUE)

Now create the variable “nicotine” by combining the “Yes” values in smoke100 and the “Every day” values in usenow3

hims1  <- mutate(hims0, nicotine = ifelse(smoke100 == "Yes" | usenow3 == "Every day", 1,0))
write.csv(hims1, "hims1.csv", row.names = FALSE)

hims1 <-read.csv("hims1.csv", header = TRUE)
table(hims1$nicotine)
## 
##    0    1 
## 8311 6330

Compare the use of nicotine products between the 2 states

table(hims1$X_state, hims1$nicotine)
##              
##                  0    1
##   Hawaii      4379 3219
##   Mississippi 3932 3111

Calculate the total # of Hawaiians and Mississippians in the study, then calculate the percentage in each state who use nicotine.

4379+3219
## [1] 7598
3932+3111
## [1] 7043
3219/7598
## [1] 0.4236641
3111/7043
## [1] 0.4417152

These figures show 42.4% of the sample of Hawaiians use nicotine, compared to 44% of the Mississippians. This does not seem like a significant difference between the two populations. If Hawaiians are more healthy than Mississippians, it is not because Hawaiians use nicotine products appreciably less than Mississippians.

Now display the results graphically

hims2 = mutate(hims1,nicotine1=ifelse(smoke100 == "Yes" | usenow3 == "Every day", "Yes","No"))
write.csv(hims2, "hims2.csv", row.names = FALSE)

hims2 <-read.csv("hims2.csv", header = TRUE)
ggplot(data=hims2, aes(x=X_state, fill = nicotine1)) + geom_bar()

### As can be seen from the plot, the 2 states are just about equal in terms of tobacco use, confirming the calculated percentages.

table(hims2$nicotine1, hims2$nicotine)
##      
##          0    1
##   No  8311    0
##   Yes    0 6330
nicotine_use <- matrix(c(3219, 3111, 4379,3932),ncol = 2)
colnames (nicotine_use) <- c("Use", "Don't Use")
rownames (nicotine_use) <- c("Hawaii", "Mississippi")
nicotine_use
##              Use Don't Use
## Hawaii      3219      4379
## Mississippi 3111      3932

Now conduct a proportionality test.

result.prop <- prop.test(nicotine_use)

result.prop
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  nicotine_use
## X-squared = 4.7793, df = 1, p-value = 0.0288
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.034248843 -0.001853269
## sample estimates:
##    prop 1    prop 2 
## 0.4236641 0.4417152

While the results of the proportionality test show a statistically significant difference between the two groups, the significance may arise solely because of the large sample size.

Research Question 2: How do the states compare in terms of vaccinations for flu and pneumonia for persons 65 or older?

The first step in this analysis is to create a subset of the data of persons age 65 or older; filter x_age65yr to keep only persons => 65 years of age

hims65 <- filter(hims2, X_age65yr != "NA")
hims65 <- filter(hims65, X_age65yr != "Age 18 to 64")
table(hims65$flushot6)
## 
##   No  Yes 
## 1637 2957
table(hims65$X_age65yr, hims65$X_state)
##                  
##                   Hawaii Mississippi
##   Age 65 or older   2203        2586
write.csv(hims65, file = "hims65.csv", row.names = FALSE)
hims65 <- read.csv("hims65.csv", header = TRUE )

The next step is to analyze the columns w/ vaccinations for flu, flushot6, and pneuvac3 to determine if those columns have “NA” values.

table(hims65$flushot6)
## 
##   No  Yes 
## 1637 2957
table(hims65$pneuvac3)
## 
##   No  Yes 
## 1428 2913

No NA values in these columns. Now create a vaccine indicator.

hims65  <- mutate(hims65, vax = ifelse(flushot6 == "Yes" | pneuvac3 == "Yes", "Yes", "No"))
hims65 <- filter(hims65, vax != "NA")
table(hims65$X_state, hims65$vax)
##              
##                 No  Yes
##   Hawaii       363 1696
##   Mississippi  489 1958
write.csv(hims65, file = "hims65.csv", row.names = FALSE)
hims65 <- read.csv("hims65.csv", header = TRUE )
vaccinated <- matrix(c(1696, 1958, 363,489),ncol = 2)
colnames (vaccinated) <- c("Vaxed", "Not Vaxed")
rownames (vaccinated) <- c("Hawaii", "Mississippi")
vaccinated
##             Vaxed Not Vaxed
## Hawaii       1696       363
## Mississippi  1958       489
result.prop <- prop.test(vaccinated)

result.prop
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  vaccinated
## X-squared = 3.888, df = 1, p-value = 0.04863
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.0002438377 0.0468308826
## sample estimates:
##    prop 1    prop 2 
## 0.8237008 0.8001635
ggplot(data=hims65, aes(x=X_state, fill = vax)) + geom_bar()

With respect to nicotine use, Hawaii had a slightly smaller percentage of persons using nicotine. The difference was statistically significant, but possibly only because of the large sample size. With respect to vaccinations for flu and pneumonia of persons 65 or older, the same holds true. A higher percentage of persons in Hawaii are vaccinated for flu and pneumonia, thus avoiding risky behavior, but the percentage difference is not great, and the p-value is closer to .05 than it was in the case of nicotine use. Again, with this large of a sample size, the significance may arise simply because of the number of observations.

Research question 3: What is the difference beween the states in terms of the number of people reporting good health vs. poor health?

To answer this question, I will use the variable genhlth to create a new variable health_status. Any response of excellent, very good, or good will be tabulated as good. Responses of fair or poor will be calculated as poor.

The first step is to eliminate the NA responses from genhlth.

table(hims2$genhlth)
## 
## Excellent      Fair      Good      Poor Very good 
##      2369      2141      4905      1056      4134
hims2 <- filter(hims2, genhlth != "NA")
write.csv(hims2, file = "hims2.csv", row.names = FALSE)
hims2 <- read.csv("hims2.csv", header = TRUE )

Now create the bivariate indicator for health.

hims3  <- mutate(hims2, health_status = ifelse(genhlth == "Excellent" | genhlth == "Very good" | genhlth == "Good", "Good","Poor"))
table(hims3$X_state, hims3$health_status)
##              
##               Good Poor
##   Hawaii      6449 1141
##   Mississippi 4959 2056
healthy <- matrix(c(6449, 4959, 1141,2056),ncol = 2)
colnames (healthy) <- c("Good", "Poor")
rownames (healthy) <- c("Hawaii", "Mississippi")
healthy
##             Good Poor
## Hawaii      6449 1141
## Mississippi 4959 2056
result.prop <- prop.test(healthy)
result.prop
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  healthy
## X-squared = 433.69, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.1292742 0.1562396
## sample estimates:
##    prop 1    prop 2 
## 0.8496706 0.7069138
ggplot(data=hims3, aes(x=X_state, fill = health_status)) + geom_bar()

write.csv(hims3, file = "hims3.csv", row.names = FALSE)
hims2 <- read.csv("hims3.csv", header = TRUE )

The two specific behavioral risk factors examined in the 1st 2 research questions did not show much difference between the 2 states. However, when analyzing responses to the general question “How is your general health?” there is a very large difference between the responses from the 2 states. 85% of the Hawaiian respondents reported good health, while only 71% of the Mississippian respondents reported good health. The p-value of the test is extremely small, much < .001. Even considering the relatively large sample size, there is a significant difference in the proportion of persons in good health between the 2 states. A much more detailed analysis of the brfss data would be required to pinpoint the reasons for the large difference in the proportions of healthy persons.