Variable Selection & Research Question

  • Countious variable: Behav_AlcDaysPerYear_N. “This variable represents the number of days spent drinking alcoholic beverages in the past year.” (NHIS Abbreviated Codebook.pdf)
  • Categorical variable: Demo_agerange_C. This variable is a categorization of a continous age variable.
  • Do younger age groups drink more alcohol than older age groups? I suspect that the answer is yes. In order to test this hypothesis using Rstudio, I will group the contionous variable, which represents alcohol intake per year, into the age categories “18-29” and “50-59” and compare the differences between the two groups.This research could have important health and marketing applications.

Data Prep

load

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
nhisdata<-read.csv('/Volumes/FLASHDRIVE/Data 333/NHIS Data.csv')

Comparision of Means

Table

nhisdata%>%
  filter(Demo_agerange_C %in% c("18-29","50-59"))%>%
  group_by(Demo_agerange_C)%>%
  summarize(avg_Behav_AlcDaysPerYear_N=mean(Behav_AlcDaysPerYear_N,na.rm=TRUE))
## # A tibble: 2 x 2
##   Demo_agerange_C avg_Behav_AlcDaysPerYear_N
## * <chr>                                <dbl>
## 1 18-29                                 53.3
## 2 50-59                                 67.0

Visualization

nhisdata%>%
  filter(Demo_agerange_C %in% c("18-29","50-59"))%>%
  ggplot(aes(x=Behav_AlcDaysPerYear_N,y=Demo_agerange_C,fill=Demo_agerange_C))+
  geom_boxplot(alpha=0.7)+
  stat_summary(fun.y=mean, geom="point", shape=20, size=14, color="red")+ 
  theme(legend.position = "none")
## Warning: `fun.y` is deprecated. Use `fun` instead.
## Warning: Removed 53815 rows containing non-finite values (stat_boxplot).
## Warning: Removed 53815 rows containing non-finite values (stat_summary).

Interpretation

  • It appears that the “50-59” age groups drinks more alcoholic beverages per year than the “18-29” age group. With the older age group drinking an average of 67 alcoholic beverages per year and tje younger drinking an average of 53. Interestingly, the boxplot visulaization shows the presence of outliers within these two groups. The abundance of outliers, especially for the younger age group, may have shifted the mean values to be greater than the median.They are both postively skewed. Perhaps in this case, the mean is not the best measure of central tendency.

Comparision of Distributions

Visualization

nhisdata%>%
  filter(Demo_agerange_C %in% c("18-29","50-59"))%>%
  ggplot(aes(x=Behav_AlcDaysPerYear_N,fill=Demo_agerange_C))+
  geom_histogram(binwidth=10)+
  facet_wrap(~Demo_agerange_C)+ 
  theme(legend.position = "none")
## Warning: Removed 53815 rows containing non-finite values (stat_bin).

Interpretation

  • The distribution for both groups are not normal. It appears that within both groups more people report that they drink closer to zero drinks per year. The older age groups has more data grouped in closer to zero than the younger age group. It also has more data grouped afer 300 which may be one of the reasons that the older mean is higher than the younger mean.

Sampling Distribution & T-test

Sampling Distribution

Nhisdata<-nhisdata%>%
  select(Demo_agerange_C,Behav_AlcDaysPerYear_N)%>%
  filter(Demo_agerange_C %in% c("18-29","50-59"))
young_data<-Nhisdata%>%
  filter(Demo_agerange_C=="18-29")
old_data<-Nhisdata%>%
  filter(Demo_agerange_C=="50-59")
sample(young_data$Behav_AlcDaysPerYear_N,40)%>%
  mean(na.rm=TRUE)
## [1] 52.86667
sample(old_data$Behav_AlcDaysPerYear_N,40)%>%
  mean(na.rm=TRUE)
## [1] 75.39394
replicate(10000,
          sample(young_data$Behav_AlcDaysPerYear_N,40)%>%
  mean(na.rm=TRUE)
  )%>%
  data.frame()%>%
  rename("mean"=1) %>%
  ggplot()+
  geom_histogram(aes(x=mean),fill="red",alpha=0.7)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

replicate(10000,
          sample(old_data$Behav_AlcDaysPerYear_N,40)%>%
  mean(na.rm=TRUE)
  )%>%
  data.frame()%>%
  rename("mean"=1) %>%
  ggplot()+
  geom_histogram(aes(x=mean),fill="light blue",alpha=0.7)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

T-test

t.test(Behav_AlcDaysPerYear_N~Demo_agerange_C,data=Nhisdata)
## 
##  Welch Two Sample t-test
## 
## data:  Behav_AlcDaysPerYear_N by Demo_agerange_C
## t = -30.903, df = 140111, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.60878 -12.86623
## sample estimates:
## mean in group 18-29 mean in group 50-59 
##            53.29934            67.03684

```

Interpretation

  • According to the sample averages given by the T-test, the 18-29 age group on average drank 53.3 alcoholic beverages a year. The 50-59 age group drank on average 67 alcohlic beverages a year.The p-value being less than 0.5 means that the difference between both groups are stastically significant. This makes my original hypothesis that younger people drink more than older people incorrect.