THIS IS AN INDIVIDUAL ASSESSMENT, THIS DOCUMENT AND YOUR ANSWERS ARE FOR YOUR EYES ONLY. ANY VIOLATION OF THIS POLICY WILL BE IMMEDIATELY REPORTED TO THE UNDERGRADUATE CONDUCT BOARD.

Replace the underscores below with your name acknowledging that you have read and understood the Duke Community Standard.

I, Mishek Thapa, hereby state that I have not communicated with or gained information in any way from my classmates or anyone other than the Professor or TA during this exam, and that all work is my own.

Load packages

# load required packages here
library(broom)
library(infer)
library(tidyverse)
library(NHANES)
data(NHANES)
set.seed(2048)

Questions

Question 1

The NHANES dataset is filtered for people between age 26 & 64 and the NAs in the work variable.

nhanes <- NHANES %>%
  filter(!is.na(Work),
         Age<= 64 & Age>=26)
nrow(nhanes) #1525

Question 2

Measuring a participant's blood pressure multiple times helps create a more accurate representation of a person's blood pressure. Having a single blood pressure could be less accurate because a single blood pressure might have resulted from some uncontrolled stimulus and having a few additional measurements corrects for it. In this dataset, if there are multiple measurements, then first is always ignored. The first measurement may be less accurate because of unexpected stimuli and the future measurements could prevent those issues. Blood pressure is also a fairly easy test to conduct because it does not require much medical expertise, so multiple tests can easily be conducted without having too much inconvenience.

Recording an average allows for compiling the measurements into a single summary statistic, so a single person's blood pressure measurements are equally represented by a single number.

Question 3

BPSysAve is visualized with the levels of the Work variable into boxplots. This is able to compare the range, the center of the distribution, and outlier information about the BP Systolic Averages between the work categories.

ggplot(data = nhanes, aes(x = Work, y = BPSysAve)) + 
  geom_boxplot() + 
  labs(title = "BP Systolic Averages in US National Center for Health Statistics Sample",
       subtitle = "based on Work Status",
       x = "Work Status", y = "BP Systolic Average")

#removes 185 observations because of NAs in BPSysAve Variable

This graph shows that BP Systolic averages in this sample slightly differ between work status. There appears to be more frequent outliers in the blood pressures of working people in this sample. However, the center of the distribution appears to be about the same between the 3 levels. The median and the middle 75% of the two graphs appear almost the same. Not much can be concluded about the differences between the categories with this visualization, so hypothesis tests may be helpful in understanding what this graph is showing.

(average_summary <- nhanes %>%
  filter(!is.na(BPSysAve)) %>%
  group_by(Work) %>%
  summarise(mean = mean(BPSysAve),
            median = median(BPSysAve),
            sd = sd(BPSysAve),
            min = min(BPSysAve),
            max = max(BPSysAve),
            q1 = quantile (BPSysAve, probs = 0.25),
            q2 = quantile (BPSysAve, probs = 0.75)))

The relevant summary statistics for BPSysAve can be found in this table above for all three work statuses. Most of the values for each statistic are very near each other between the three levels except for the minimum and maximum statistics. The minimum for notworking is the smallest and its maximum is the greatest, while the minimum for people looking for job's is greatest and the maximum is the smallest in this sample. These are summary statistics of the sample though and conclusions about the American population cannot be made by simply comparing these statistics. Though these statistics are similar, there have small differences and hypothesis tests would help understand if they are relevant to the population.

Question 4

A. Hypothesis Test

This information is here to assist in the descriptions for the hypothesis tests.

#remove NAs in BPSysAve
work_test <- nhanes %>%
  filter(!is.na(BPSysAve),
         Work!= "Looking")
#calculate the quanitites of each Work status for hypothesis test descriptions 
work_count <- nhanes %>%
  count(Work)
#calculates the current difference in mean BPSysAve between Work and not-working Americans
diff_bpSys <- average_summary$mean[2]- average_summary$mean[3]

Null Hypothesis: There is no difference in average BP systolic measurements between working and not-working people in the American population. Alternative Hypothesis: There is a difference in average BP systolic measurements between working and not-working people in the American population. Difference in means between working and not-working Americans in this sample is 1.1573098

To evaluate whether average systolic blood pressures are different for Americans between 26 & 64 who are not working and working, you would take the cards from the sample with BPSysAverage values for Working and Not Working Americans mix them randomly. Then, separate them into two groups: for Working Americans group take 3678 of the cards and for not working americans take 1258 cards. Compute the difference in averages (mean could be a statistic, for example) between the two groups and repeat the simulation 15,000 times, in total. Then, you would calculate the proportion of differences that are greater than the observed difference (rdiff_bpSys) and multiply by two (because a difference is two-tailed) i.e. the p-value. If the p-value is above 0.05, then the null hypothesis cannot be rejected and there is not evidence that there is a difference in the average BP Systolics of Working and Not-Working people. If the p-value is below 0.05, then the null hypothesis can be rejected and there is some evidence that there is a difference in the average BP Systolics of Working and Not-Working people

B. Confidence Interval

To create a confidence interval for estimating the difference between the average systolic blood pressures for Americans who are "Working" and NotWorking", you would first create a bootstrap. This bootstrap would entail separating chips of the two Work levels into two bags with the Working Americans bag would have the sample's 3678 chips and the not-Working American chips bag having 1258 pieces Then, one would take out 3678 chips with replacement from the Working chips bag and 1258 chips from the not-Working chips bag, then calculate the difference in the average (mean, for example) of the two groups. Then, completing this simulation 1500 times would produce a distribution of values to help make a confidence interval. Then a confidence interval can be created to calculate the range of values that are in the middle __% of this bootstrap. The confidence interval can then be interpreted with the data by saying that there is __% confidence for the difference in the average blood pressures between the Working and Not-Working people in the American population to be between [_,_].

Question 5

A permute hypothesis test to determine if an actual difference in means exists between the Working and Not Working population.

null_nhanes <- work_test %>%
  specify(response = BPSysAve,
          explanatory = Work) %>%
  hypothesize(null = "independence") %>%
  generate(1500, type = "permute") %>%
  calculate(stat = "diff in means",
            order = c("NotWorking", "Working"))

ggplot(data = null_nhanes, aes(stat)) +
  geom_histogram(binwidth = 0.1) + 
  labs(title = "Null Distribution of ")

p_work <- null_nhanes %>%
  filter(stat >= diff_bpSys) %>%
  summarise(p_value = 2 * n()/nrow(null_nhanes))

Null Hypothesis: There is no difference in the actual mean between Working and Not Working people in the population. Alternative Hypothesis: There is a difference in the actual mean between Working and Not Working people in the population.

The p-value for this null distribution given the observation is 0.0306667 (a value below 0.05). Since this p-value is so low, there is evidence that the actual mean BP Systolic is different between people who work and do not work in the population and the null-hypothesis can be rejected.

bs_work <- nhanes %>%
  filter(Work != "Looking",
         !is.na(BPSysAve)) %>%
  specify(response = BPSysAve,
          explanatory = Work) %>%
  generate(1500, type = "bootstrap") %>%
  calculate(stat = "diff in means",
            order = c("NotWorking","Working"))
con_work <- bs_work %>%
summarize(lower_bound = quantile(stat, 0.025),
           upper_bound = quantile(stat, 0.975))
ggplot(data = bs_work, aes(stat)) +
  geom_histogram(binwidth = 0.1) + 
  labs(title = "Bootstrap distribution of difference in means",
       subtitle = "between Working and NotWorking people",
       x = "Difference in means") + 
  geom_vline(xintercept = con_work$lower_bound, color = "red") +
  geom_vline(xintercept = con_work$upper_bound, color = "red")

Question 6

Creates a new variable called healthy to determine if a person gets a healthy amount of sleep and finds the percentage of people who get a healthy amount of sleep and the percentage of people who do not in this sample.

nhanes <- nhanes %>%
  mutate(healthy =
           case_when(
             SleepHrsNight <= 9 & SleepHrsNight >= 7 ~ "yes",
             SleepHrsNight < 7 | SleepHrsNight > 9 ~ "no"
             )
         )

sleep_percentage <- nhanes %>%
  filter(!is.na(healthy)) %>%
  count(healthy) %>%
  mutate(percentage = n / sum(n) * 100)

The percentage of people who get a healthy amount of sleep in this sample is NA and the percentage of people who get a unhealthy amount of sleep is . This shows that there is a signficant difference in the quantity of people who have unhealthy sleeping habits and people who have healthy sleeping habits.

Find percentages of people who get healthy and unhealthy amount of sleep given their work status.

work_health <- nhanes %>%
  filter(!is.na(healthy)) %>%
  group_by(Work) %>%
  count(healthy) %>%
  mutate(percentage = n/sum(n))
ggplot(data = work_health, aes(x = healthy, y = percentage, fill = healthy)) + 
  geom_col() +
  facet_grid( . ~ Work) + 
  guides(fill=FALSE) +
  labs(title = "Percentage of Healthy and Unhealthy Sleeping Habits",
       subtitle = "between Various Working Status", 
       x= "Healthy Sleeping Habit?")

In this sample, the people who are looking for work seems to have the highest percentage of people who have healthy sleeping habits. People who are not working have the greatest percentage of bad sleeping habits. There appear to be small differences between the different work statuses, but it is difficult to conclude if these differences are significant enough to appear in the general population as well. A hypothesis test would help conclude if the visuazlized differences here can be seen in the population.

Question 7

Creates datasets for healthy and unhealthy sleepers and counts the number of observations.

healthy_sleep <- nhanes %>%
  filter(healthy == "yes")
n_unhealthy <- nrow(healthy_sleep)

unhealthy_sleep <- nhanes %>%
  filter(healthy == "no")
n_healthy <- nrow(unhealthy_sleep)

The number of healthy sleepers in this sample and in this new dataset is 3070. The number of unhealthy sleepers in this sample and in this new dataset is 2041.

Question 8

Finds the current difference in BPSys between who are working and people who are not working. Then performs a simulation to check whether that difference is statistically significant to the American population.

#current difference in mean for BPSysAve between healthy sleeping people who work and do not work
healthy_summary <- healthy_sleep %>%
  filter(!is.na(BPSysAve)) %>%
  group_by(Work) %>%
  summarise(mean = mean(BPSysAve))
diff_healthy <- healthy_summary$mean[2] - healthy_summary$mean[3]

null_healthy <- healthy_sleep %>%
  filter(Work != "Looking") %>%
  specify(response = BPSysAve,
          explanatory = Work) %>%
  hypothesize(null = "independence") %>%
  generate(1500, type = "permute") %>%
  calculate(stat = "diff in means",
            order = c("NotWorking", "Working"))

ggplot(data = null_nhanes, aes(stat)) +
  geom_histogram(binwidth = 0.1) + 
  labs(title = "Null distribution of Difference in Means",
       subtitle = "between NotWorking and Working Americans", 
       x = "Difference in Mean")

p_healthy <- null_nhanes %>%
  filter(stat >= diff_healthy) %>%
  summarise(p_value = 2 * n()/nrow(null_nhanes))

Null Hypothesis: There is a statistically significant difference in the actual mean of BP Systolic Averages in the healthy sleeping American population between Working and Notworking people.

Alternative Hypothesis: There is not a statistically significant difference in the actual mean of BP Systolic Averages in the healthy sleeping American population between Working and Notworking people.

The p-value for this null distribution given the observation is 0.0013333. Since the p-value is below 0.05, the null hypothesis can be rejected and there is evidence that a statistically significant difference in the actual mean of BP Systolic Averages exists in the healthy sleeping American population between Working and Notworking people.

Creates a bootstrap to find the confidence interval for the previous simulation.

bs_healthy <- healthy_sleep %>%
  filter(Work != "Looking",
         !is.na(BPSysAve)) %>%
  specify(response = BPSysAve,
          explanatory = Work) %>%
  generate(1500, type = "bootstrap") %>%
  calculate(stat = "diff in means",
            order = c("NotWorking","Working"))
con_healthy <- bs_healthy %>%
summarize(lower_bound = quantile(stat, 0.025),
           upper_bound = quantile(stat, 0.975))

There is 95% confidence that the mean BP Systolic for Not working people is greater than that of Non-working by at least0.6645223 mm HG and at most 3.4686494 mm HG.

Question 9

Finds the observed difference in mean of the BPSystolic averages between people wit unhealthy sleeping habits in the sample. Then, a simulation is conducted to determine if the difference in means is significant to the population.

unhealthy_summary <- unhealthy_sleep %>%
  filter(!is.na(BPSysAve)) %>%
  group_by(Work) %>%
  summarise(mean = mean(BPSysAve))

diff_unhealthy <- unhealthy_summary$mean[2] - unhealthy_summary$mean[3]

null_unhealthy <- unhealthy_sleep %>%
  filter(Work != "Looking") %>%
  specify(response = BPSysAve,
          explanatory = Work) %>%
  hypothesize(null = "independence") %>%
  generate(1500, type = "permute") %>%
  calculate(stat = "diff in means",
            order = c("NotWorking", "Working"))

ggplot(data = null_nhanes, aes(stat)) +
  geom_histogram(binwidth = 0.1) +
  labs(title = "Null Distribution of Difference in Means for people of Unhealthy Sleep habits",
                 subtitle = "between NotWorking and Working people",
       x = "Difference in means")

p_unhealthy <- null_nhanes %>%
  filter(stat >= diff_unhealthy) %>%
  summarise(p_value = 2 * n()/nrow(null_nhanes))

Null Hypothesis: There is a statistically significant difference in the actual mean of BP Systolic Averages in the unhealthy sleeping American population between Working and Notworking people.

Alternative Hypothesis: There is not a statistically significant difference in the actual mean of BP Systolic Averages in the unhealthy sleeping American population between Working and Notworking people.

The p-value for this null distribution given the observation is 0.8226667. Since the p-value is above 0.05, the null hypothesis cannot be rejected and there is not evidence that a statistically significant difference in BP Systolic Averages exists in the unhealthy sleeping American population between Working and Notworking people.

Question 10

Even though not working appears to cause Americans to have higher blood pressure on average, one cannot assume causation because there could have been a confounding variable that caused this to occur. Healthy sleeping habits may have resulted in the higher blood pressure average, or it could not have. However, the presence of this confounding variable that correlates with the blood pressure average shows how uncertain the relationship between two variables can be.

Question 11

Fits a regression model predicting systolic blood pressure (BPSysAve) from Age, BMI, TotChol, HealthGen, SmokeNow, DaysPhysHlthBad, DaysMentHlthBad, PhysActive, Work, the categorical sleep health variable you created earlier and the interaction between Work and the categorical sleep health variable you created earlier.

(nhanes_lm <- lm(BPSysAve ~ Age + BMI + TotChol + HealthGen + SmokeNow +
     DaysPhysHlthBad + DaysMentHlthBad + PhysActive + Work + 
       healthy + Work * healthy, data = nhanes
     )
 )

This model, on average, expects that when all other variables are held constant, average systolic blood pressure reading to go up by 0.39 mm Hg for each additional year in age, by 1.06 mm HG if a person smokes ciggarettes regularly, by 2.43 mm HG if a person rated their own general health as good, by 4.76 mm HG if a person rated their own general health as fair, and by 5.44 if they rated their own general health as poor. In addition, this model, on average, expects that when all other variables are held constant that their average systolic blood pressure reading to go down by 0.18 if they rated their own general health as very good.

Question 12

BPSysAve = 86.43595 + 0.38520(Age) + 0.23645(BMI) + 1.48259(Total cholestorol) - 0.18347(Very good - general health) + 2.42859(Good - general health) + 4.76434(Fair - general health) + 5.44528(Poor - general health) + 1.05772(Currently Smokes Regularly) - 0.13373(Days Physical Health was bad) - 0.05664(Days Mental Health was bad) - 0.46792(Participates in Physical Activity) + 2.72941 (Not Working) + 0.47221(Working) ## Intercept, Working Coefficient, and Not Working coefficients are adjusted for healthy sleeping habits. As a result, interaction variables are eliminated.

This model, on average, expects that when all other variables are held constant, for the average systolic blood pressure reading to go up by 2.73 mm HG if a person is not working and up by 0.47 mm HG if a person is working.

BPSysAve = 82.21932 + 0.38520(Age) + 0.23645(BMI) + 1.48259(Total cholestorol) - 0.18347(Very good - general health) + 2.42859(Good - general health) + 4.76434(Fair - general health) + 5.44528(Poor - general health) + 1.05772(Currently Smokes Regularly) - 0.13373(Quantity of Days Physical Health was bad) - 0.05664(Quantity of Days Mental Health was bad) - 0.46792(Participates in Physical Activity) + 2.68954 (Not Working) + 5.85298(Working)

This model, on average, expects that when all other variables are held constant, for the average systolic blood pressure reading to go up by 2.69 mm HG if a person is not working and up by 5.85 mm HG if a person is working.

Question 13

Perform model selection based on AIC, and print a summary output of the final model.

glance(nhanes_lm)$AIC

final_model <- step(nhanes_lm, direction = "forward")

tidy(final_model) %>% 
  select(term, estimate) %>%
  mutate(estimate = round(estimate, 3))

The final model did not change because the full model itself had the lowest AIC after forward selection was conducted. Backward selection/forward selection was not conducted because there is a conflict that happens when removing a variable because linear models omit data when there are NAs, and when removing a variable in backward selection, R does not retain the observations that were lost when the linear model was made. Still, this indicates that the variables in the full model had a lot of NAs and there needs to be another method to solve the problem that does not remove the data, but fills it in instead. https://stackoverflow.com/questions/46817564/stepwise-regression-error-in-r

# nhanes_full <- mice(nhanes, maxit=50, meth='pmm', seed=500)
# summary(nhanes_full)

This method was attempted as well, but there was an issue with the computational ability of the computer and teh dataset.

Question 14

lm_r <- glance(final_model)$r.squared

The proportion of the variability in this sample that this linear model considers is given by the r.squared 0.1326809. This r squared appears a bit low, but this is probably one of the best (if not the best) models since forward selection AIC step process showed that this model was most efficient. Thus, this r-squared value explains part of the predictive ability of the linear model in estimating BP Systolic measurements of a person.

Bonus

The protocol was unclear what to do if all three variables are present, so this is used to investigate the rules.

nhanes %>%
  filter(!is.na(BPSys1),
         !is.na(BPSys2),
         !is.na(BPSys3)) %>%
  select(BPSys1, BPSys2, BPSys3, BPSysAve)

This shows that when all three measurements are present, then the mean of the 2nd and 3rd measurements is used to find the BPSysAve.

dat <- NHANES %>%
 mutate(condition  =
          case_when( #rule 1: when all three measurements are either 0 or NA
                    (BPSys1 == 0 | is.na(BPSys1)) & (BPSys2 == 0 | is.na(BPSys2)) &
                     (BPSys3 == 0 | is.na(BPSys3)) & (BPSysAve == 0 | is.na(BPSysAve)) ~ "satisfied",
                    #a rule for when measurements were all NAs was not in the protocol  
                    
                    #rule: 2 NAs and one measurement above zero
                    is.na(BPSys1) & is.na(BPSys2) & BPSys3 > 0 & 
                      BPSys3 == BPSysAve ~ "satisfied",
                   
                    is.na(BPSys2) & is.na(BPSys3) & BPSys1 > 0 &
                      BPSys1 == BPSysAve ~ "satisfied",
                   
                    is.na(BPSys1) & is.na(BPSys3) & BPSys2 > 0 &
                      BPSys2 == BPSysAve ~ "satisfied", 
                    
                    #rule: one NA and two measurements above zero
                    (is.na(BPSys1) | BPSys1 == 0) & (!is.na(BPSys2) | BPSys2 > 0) &
                      (!is.na(BPSys3) | BPSys3 > 0) & BPSys3 == BPSysAve ~ "satisfied",
                   
                    (is.na(BPSys2) | BPSys2 == 0) & (!is.na(BPSys1) | BPSys1 > 0) & 
                      (!is.na(BPSys3) | BPSys3 > 0) & BPSys3 == BPSysAve ~ "satisfied",
                    
                    (is.na(BPSys3) | BPSys3 == 0) & (!is.na(BPSys1) | BPSys1 > 0) & 
                      (!is.na(BPSys2) | BPSys2 > 0) & BPSys2 == BPSysAve ~ "satisfied",
                    
                    #rule: all three measurements are available and above zero. 
                    (BPSys1 > 0 | !is.na(BPSys1)) & (BPSys2 > 0 | !is.na(BPSys2)) 
                    & (BPSys3 > 0 | !is.na(BPSys3)) & (BPSysAve == (BPSys2 + BPSys3)/2) ~ "satisfied",
                    
                    TRUE ~ "unsatisfied"
                    )
        )
dat %>%
  filter(condition == "unsatisfied") %>%
  select(BPSys1, BPSys2, BPSys3, BPSysAve)

dat %>%
  count(condition)

There are a lot of instances in which the conditions were not satisfied because a rule was not correctly implemented when creating the BPSysAve variable. The rule is: "If only two blood pressure readings were obtained, the second blood pressure reading is the average."

In this table above, however, it is apparant that when the first measure is NA and the last two are non-NA, the dataset has BPSysAve as the arithmentic average of the two. The mentioned rule would require BPSys3 to be BPSysAve since only two blood pressure readings were obtained.

STA 199 - Spring 2018 - Midterm 02

Academic Honesty Statement