NHANES Data Analysis

Part1

Data Wrangling

#get Variables needed
dataset1 <- NHANES[c(3,9,13)] %>%
  na.omit()

#filter out population not of interest from data set
dataset1 <- filter(dataset1, Gender=="male")
dataset1 <- filter(dataset1, Education==c("High School","College Grad"))

#remove population column
dataset1 <- dataset1 [c(2,3)]

#group by df
ByEducation1 <- dataset1 %>% group_by(Education)

#create individual df by education
HS1 <- dataset1 %>% 
  filter(Education == "High School")
CG1 <- dataset1 %>% 
  filter(Education == "College Grad")

str(dataset1)

## tibble [803 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Education: Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 5 3 5 5 3 5 5 ...
##  $ Poverty  : num [1:803] 1.36 1.36 1.03 3.28 2.18 5 5 2.31 5 1.07 ...
##  - attr(*, "na.action")= 'omit' Named int [1:3312] 4 6 7 14 17 20 23 27 35 38 ...
##   ..- attr(*, "names")= chr [1:3312] "4" "6" "7" "14" ...

For this data the poverty measures family income to poverty guidlines, and smaller numbers indicate more poverty. (Outcome Variable)

The Education variable is a categorical variable that tell us what education level they have acheived and for this we are looking at College Grads and High School Grads. (predictor Variable)

Then lastly we are filtering the data down to how this affects males only. (Population of Interest)

Statistics

#create variability statisitics
VarStats1 <- ByEducation1 %>% summarise(Variance = var(Poverty),
                                     Standard_Deviation = sd(Poverty),
                                     Winsorized_Standard_Deviation = winsor.sd(Poverty,trim=.1),
                                     Median_Abosolute_Deviation = mad(Poverty),
                                     Inter_Quartile_Range = IQR(Poverty),
                                     Total_Observations = n())

LocationStats1 <- ByEducation1 %>% summarise(Mean = mean(Poverty),
                                     Trimmed_mean = mean(Poverty, tr = .1),
                                     Winsorized_Mean = winsor.mean(Poverty),
                                     Median = median(Poverty),
                                     Mode = mode(Poverty),
                                     Total_Observatrions = n())

#shape statistics
ShapeStats1 <- ByEducation1 %>% summarise(Skew = skew(Poverty))

Graphing of stats

#variability stats
highchart() %>%
  hc_chart(type = "column") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_xAxis(categories = VarStats1$Education) %>%
  hc_add_series(VarStats1$Variance, name = "Variance") %>%
  hc_add_series(VarStats1$Standard_Deviation, name = "Standard Deviation") %>%
  hc_add_series(VarStats1$Winsorized_Standard_Deviation, name = "Winsorized Standard Deviation") %>%
  hc_add_series(VarStats1$Median_Abosolute_Deviation, name = "Median Absolute Deviation") %>%
  hc_add_series(VarStats1$Inter_Quartile_Range, name = "Inter Quartile Range") %>%
  hc_title(text = "Variability Statistics by Group")

VarStats1

## # A tibble: 2 x 7
##   Education    Variance Standard_Deviation Winsorized_Standard~ Median_Abosolut~
##   <fct>           <dbl>              <dbl>                <dbl>            <dbl>
## 1 High School      2.30               1.52                 1.48             1.68
## 2 College Grad     1.75               1.32                 1.20             0   
## # ... with 2 more variables: Inter_Quartile_Range <dbl>,
## #   Total_Observations <int>

#Location stats
highchart() %>%
  hc_chart(type = "column") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_xAxis(categories = LocationStats1$Education) %>%
  hc_add_series(LocationStats1$Mean, name = "Mean") %>%
  hc_add_series(LocationStats1$Trimmed_mean, name = "Trimmed Mean") %>%
  hc_add_series(LocationStats1$Winsorized_Mean, name = "Winsorized Mean") %>%
  hc_add_series(LocationStats1$Median, name = "Median") %>%
  hc_add_series(LocationStats1$Mode, name = "Mode") %>%
  hc_title(text = "Location Statistics by Group")

LocationStats1

## # A tibble: 2 x 7
##   Education     Mean Trimmed_mean Winsorized_Mean Median  Mode Total_Observatri~
##   <fct>        <dbl>        <dbl>           <dbl>  <dbl> <dbl>             <int>
## 1 High School   2.57         2.53            2.51   2.31     5               337
## 2 College Grad  4.10         4.36            4.37   5        5               466

#shape stats
highchart() %>%
  hc_chart(type = "column") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_xAxis(categories = ShapeStats1$Education) %>%
  hc_add_series(ShapeStats1$Skew, name = "Skew") %>%
  hc_title(text = "Shape Statistics by Group")

ShapeStats1

## # A tibble: 2 x 2
##   Education      Skew
##   <fct>         <dbl>
## 1 High School   0.322
## 2 College Grad -1.33

Within the variability stats we have many different ways to be able to quantify this we have variance, standard deviation, winsorized standard deviation, median absolute deviation, and interquartile range. The variance shows us on average how much the average point deviates from the mean, this statistic is not a robust statistic as it takes into account extreme values. The standard deviation is the square root of the variance, this makes the data more interpretable. Even with this the standard deviation does not become a robust statistic. The winsorized standard deviation is like a normal standard deviation but recodes the most extreme values to the next most extreme value, by doing this it makes it a robust statistic. The median absolute deviation finds the average deviation from the median which is a robust statistic. The interquartile range is the range of values between the 1st quartile and the 3rd quartile, This ends up being a robust statistic as it does not account for extreme values.

Next was the location statistics with the mean, trimmed mean, winsorized mean, median, and mode. The mean is the average of all values. The mean is not a robust statistic as extreme values have a part in the calculation of the statistic. Next is the trimmed mean which adjusts the mean by removing the highest 10% and the lowest 10% to remove extreme values from the calculation. This property ends up making the trimmed mean a robust statistic. Then we have the winsorized mean that takes the max and min from the 10% - 90% and recodes the 0-10% and 90-100% values with the max or min. This does not eliminate data points from the calculation and makes it a robust statistic. After this we have the median which is the middle most number in the data series. This value is a robust statistic as it does not take into account extreme values. Finally we have the mode which is the most common repeating number in the series. This data point is not a robust statistic as it does not take into account extreme values.

Finally we have the shape statistics which is just shape. This statistic shows us the which way the data is skewed. For High school it is just positively skewed and to the left of the distribution, while college graduates are negatively skewed to the right of the distribution.

General Graphing of Data

#density plot
hchart(density(HS1$Poverty), type = "area", name = "High School Grads") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_add_series(density(CG1$Poverty), type = "area", name = "College Grad") %>%
  hc_title(text = "Poverty Level by Education Among Males")

#Box and Whisker
hcboxplot(x = dataset1$Poverty, var = dataset1$Education, name = "Length", color = "#2980b9", outliers = TRUE) %>%
  hc_chart(type = "column") %>%
  hc_title(text = "Poverty Level by Education Among Men") %>%
  hc_yAxis(title = list(text = "Poverty Level")) %>%
  hc_add_series(data = dataset1, type = "scatter", hcaes(x = "Education", y = "Poverty", group = "Education"))%>%
  hc_plotOptions(scatter = list(color = "red",  marker = list(radius = 2, symbol = "circle", lineWidth = 1))) %>%
  hc_plotOptions(scatter = list(jitter = list(x = .1, y = 0))) %>% 
  hc_add_theme(hc_theme_538())

#histograms
hchart(CG1$Poverty, name = "Collge Grad") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_xAxis(title = list(text = "Poverty Level")) %>% 
  hc_yAxis(title = list(text = "Values")) %>%
  hc_title(text = "Histogram for College Grads by Poverty Level")

hchart(HS1$Poverty, name = "High School Grads") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_xAxis(title = list(text = "Poverty Level")) %>% 
  hc_yAxis(title = list(text = "Values")) %>%
  hc_title(text = "Histogram for High School Grads by Poverty Level")

Here we can see that the data for the density plots we see that the distribution through the Histogram and Density plots that college grads have a mode of 5 the lowest poverty level possible. The data for the college grads is left skewed. Then for high school grads the distribution is much flatter and has 2 modals around the 1.4 and 5.

MMDifferences <- dataset1 %>% 
  summarise(Mean_Difference = mean(dataset1$Poverty[dataset1$Education=="High School"])-mean(dataset1$Poverty[dataset1$Education=="College Grad"]),
            Mean_Difference_Trimmed = mean(dataset1$Poverty[dataset1$Education=="High School"],tr=.1)-mean(dataset1$Poverty[dataset1$Education=="College Grad"],tr=.1),
            Mean_Difference_Winsorized = winsor.mean(dataset1$Poverty[dataset1$Education=="High School"],tr=.1)-winsor.mean(dataset1$Poverty[dataset1$Education=="College Grad"],tr=.1),
            Median_Difference = median(dataset1$Poverty[dataset1$Education=="High School"])-
 median(dataset1$Poverty[dataset1$Education=="College Grad"]))

Mean Differences Charted

#mean differences Charted
highchart() %>%
  hc_chart(type = "column") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_add_series(MMDifferences$Mean_Difference, name = "Mean Difference") %>%
  hc_add_series(MMDifferences$Mean_Difference_Trimmed, name = "Mean Differenced Trimmed") %>%
  hc_add_series(MMDifferences$Mean_Difference_Winsorized, name = "Winsorized Mean Difference") %>%
  hc_add_series(MMDifferences$Median_Difference, name = "Mediand Differences") %>%
  hc_title(text = "Mean and Median Differnces")

MMDifferences

## # A tibble: 1 x 4
##   Mean_Difference Mean_Difference_Trimmed Mean_Difference_Wins~ Median_Differen~
##             <dbl>                   <dbl>                 <dbl>            <dbl>
## 1           -1.53                   -1.82                 -1.55            -2.69

The mean differences here show us the differences between means of high school graduates and college graduates among poverty levels. We have three types of mean differences we can use the regular mean, trimmed mean, and winsorized mean. The trimmed mean, and winsorized mean are both robust as there subsequent methods are robust while the regular mean difference is not. The mean difference shows us that poverty is lower across all mean differences for college grads, and we see this with the median difference as well. We see that poverty is from -1.52 to -2.69 poverty scores lower than high school graduates.

Outliers

dataset1 %>% filter(Education=="College Grad") %>%
mutate(zscore=scale(Poverty)) %>% filter(zscore>3 | zscore< -3)

## # A tibble: 3 x 3
##   Education    Poverty zscore[,1]
##   <fct>          <dbl>      <dbl>
## 1 College Grad    0         -3.10
## 2 College Grad    0.09      -3.04
## 3 College Grad    0.09      -3.04

dataset1 %>% filter(Education=="High School") %>%
mutate(zscore=scale(Poverty)) %>% filter(zscore>3 | zscore< -3)

## # A tibble: 0 x 3
## # ... with 3 variables: Education <fct>, Poverty <dbl>, zscore <dbl[,1]>

With the outliers here we see that College grads has only three outliers at points 0, and two 0.09 levels. We see that high school grads have no outliers in the dataset.

Observation Statistic

obs=(mean(dataset1$Poverty[dataset1$Education == "High School"]))-
  (mean(dataset1$Poverty[dataset1$Education == "College Grad"]))

Randomization Test

#randomization test
N=1000000
meandiff=numeric(N)
for (i in 1:N) {
  dat <- sample(dataset1$Poverty,803,replace=FALSE)
  meandiff[i] <- (mean(dat[1:337])-mean(dat[338:803]))
}

Graphing of randomization test

hchart(meandiff, name = "Distribution Points") %>%
  hc_add_theme(hc_theme_538()) %>% 
  hc_title(text = "Distribution of Poverty Level Mean Differences Under H0")

P-Value

pvalue <- length(which(abs(meandiff)>=obs))/N
pvalue

## [1] 1

The p value here shows us that there is no evidence that H0 is not true.

Bootstrapping of Mean Differences

N=1000000
boot=numeric(N)
for (i in 1:N) {
  boot[i] <- mean(sample(dataset1$Poverty[dataset1$Education == "High School"],337,replace=T))-
    mean(sample(dataset1$Poverty[dataset1$Education == "College Grad"],466,replace=T))
}

Visualizing Bootstrapping

hchart(boot, name = "Bootstrap Distribution Points") %>%
  hc_add_theme(hc_theme_538()) %>% 
  hc_title(text = "Bootstrap Distribution of Poverty Level Mean Differences Under H0")

Bootstrap Standard Error of Mean Differences

SEb=sd(boot)
SEb

## [1] 0.1026668

T Interval

n=1000000
moe=qt(0.975,n-1)*SEb   
moe

## [1] 0.2012235

obs+c(-1,1)*moe

## [1] -1.726675 -1.324228

Percentile Interval

quantile(boot,c(0.025,0.975))

##      2.5%     97.5% 
## -1.725593 -1.323169

Part 2

Data Wrangling

#get variables needed
dataset2 <- NHANES[c(3,13,50)] %>% 
  na.omit()

#filter out population not of interest
dataset2 <- filter(dataset2, Gender=="male")

#remove population column
dataset2 <- dataset2[c(2,3)]

str(dataset2)

## tibble [3,538 x 2] (S3: tbl_df/tbl/data.frame)
##  $ Poverty      : num [1:3538] 1.36 1.36 1.36 2.2 5 2.2 1.24 1.27 1.03 3.28 ...
##  $ SleepHrsNight: int [1:3538] 4 4 4 7 5 4 7 6 6 6 ...
##  - attr(*, "na.action")= 'omit' Named int [1:2842] 4 6 7 14 17 20 23 27 35 39 ...
##   ..- attr(*, "names")= chr [1:2842] "4" "6" "7" "14" ...

For this part we kept with the population of males that we would be looking into. The predictor variable now will be sleep hours per night. Finally the outcome variable will be a poverty ratio that is based on poverty guidelines.

Create Model

model <- lm(Poverty ~ SleepHrsNight, data = dataset2)

fit <- augment(model) %>%
  arrange(SleepHrsNight)

Regression Plot

dataset2 %>% 
  hchart('scatter', hcaes(x =  SleepHrsNight, y = Poverty)) %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_title(text = "Scatter Plot of Sleep Hours and Poverty") %>%
  hc_add_series(fit, type = "line", hcaes(x = SleepHrsNight, y = .fitted), name = "Regression", id = "Regression")

#regression coefficients
model$coefficients

##   (Intercept) SleepHrsNight 
##    2.69732469    0.03924504

#RSE / Residual Standard Error
summary(model)$sigma

## [1] 1.662045

#R Squared / Coefficeint of Determination
summary(model)$r.squared*100

## [1] 0.09476255

The Regression Coefficeint here shows for that every one point increase in sleep hours there is a .0392 increase in the poverty level. The residual standard error shows us the average error of prediction poverty from sleep hours, for this there is a average error of 1.66. Then for the coefficient of determination this shows us the amount of variance in poverty that sleep hours account for, which ends up being 0.09% of the variance.

Model Predictions

PredictDF <- data.frame(SleepHrsNight = c(2,3,4,5,6,7,8,9,10,11,12))
predictions <- predict.lm(model, PredictDF, level = 0.95, interval = "none")
predictions <- as.data.frame(predictions)
predictions["SleepHours"] <- c(2,3,4,5,6,7,8,9,10,11,12)
predictions

##    predictions SleepHours
## 1     2.775815          2
## 2     2.815060          3
## 3     2.854305          4
## 4     2.893550          5
## 5     2.932795          6
## 6     2.972040          7
## 7     3.011285          8
## 8     3.050530          9
## 9     3.089775         10
## 10    3.129020         11
## 11    3.168265         12

Model Statistics

Graphing Model Stats

hcboxplot(x = resid(model)) %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_title(text = "Residuals from Regression Analysis")

highchart() %>%
  hc_chart(type = "column") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_add_series(PearsonR$., name = "Pearson R", id = "Pearson R") %>%
  hc_add_series(SpearmanRho$., name = "Spearman Rho") %>%
  hc_add_series(KendallTau$., name = "Kendall Tau") %>%
  hc_title(text = "Model Statistics")

PearsonR*100

##          .
## 1 3.078353

SpearmanRho*100

##          .
## 1 1.320926

KendallTau*100

##          .
## 1 1.124039

For this section we are looking at the correlation between sleep hours per night and poverty and we can do this in three ways with Pearson R, Kendall Tau, and Spearman Rho. Pearson R gives us the correlation of all values including extreme values, when including these values we get a value of 3.07%. Next up we have Spearman Rho and Kendall Tau which are both robust statistics. For Spearman Rho we get a correlation of 1.32% and for Kendall Tau we get a correlation of 1.12%.

Randomization Test

beta  <- model$coefficients[2]

N=25000

regcoef=numeric(N) 
for (i in 1:N) {
  y<-sample(dataset2$Poverty,3538,replace=F)
  x<-sample(dataset2$SleepHrsNight,3538,replace=F)
  mod<-lm(y~x)
  regcoef[i] <- mod$coefficients[2]
}

Visualizing Regression Coefficients

hchart(regcoef, name = "Data Points") %>%
  hc_add_theme(hc_theme_538()) %>% 
  hc_title(text = "Regression Coefficient Distribution Under H0")

P Value

pvalue=length(which(abs(regcoef)>=beta))/N
pvalue

## [1] 0.06692

The p value here shows us that there is no evidence that H0 is not true.

Bootstrapped Estimate of Population Regression Coefficient

N=25000
boot=numeric(N) 
for (i in 1:N) {
  dat <- dataset2[sample(nrow(dataset2),3538,replace=T),]
  mod <- lm(Poverty ~ SleepHrsNight,data=dat)
  boot[i] <- mod$coefficients[2]
}

Visualizing Bootstrapped Estimate of Population Regression Coefficients

hchart(boot, name = "Data Points") %>%
  hc_add_theme(hc_theme_538()) %>% 
  hc_title(text = "Regression Coefficient Distribution Under H0")

T-interval with bootstrapped Standard Error

SEb=sd(boot)
quantile(boot,c(0.025,0.975))

##        2.5%       97.5% 
## -0.00368832  0.08200702

Percentil Interval

quantile(boot,c(0.025,0.975))

##        2.5%       97.5% 
## -0.00368832  0.08200702

Summary Part 1

For this analysis we looked into education and relation to poverty level for Males. I came to think about these variables as people talk about not having an education will lead to poor economic conditions in life. With our analysis we especially do not see an effect when we do a randomization test as our p value ended up being 1 showing us no statistical conclusion validity. With regards to the external validity we think there is enough data here to be representative of a population. Then with the causal conclusion validity we do not think that education is causal of poverty. With this analysis I think that the poverty level is not the best statistic as it can only have values between 0-5. This spread being too small I believe plays a part in showing that there is no relationship between poverty and education. For the data set the next place to look would be with household income but you can only get a range of values or the mid of range. I think a look into how these relationships affect the female population would be a good part to look at as well.

Summary Part 2

For the next analysis we look at was between sleep hours and poverty level among Males. For me this was to be able to understand sleep and long term effects of poverty as people claim that sleeping to long will inhibit you. During the analysis we see no evidence that the statistical validity stands as the null hypothesis stands as the p value ended up being 0.07. We also can see that the causal conclusion validity does not stand as it does not show that there is any effect on hours of sleep and poverty level. As for a follow up analysis in the future I believe as priorly stated that the poverty level variable is not a adequate measure of poverty within the data set. I there may be a population of interest where this relationship could show some difference like within single parents.

NHANES Data Analysis

Forest Beauchamp

4/23/2022