library(Hmisc)
a.drugs2014 <- read_csv("drugs2014.csv")
select(a.drugs2014, "IRALCFY", "IRALCFM", "IRSEX", "CATAG7", "HEALTH2" ) -> Alcohol2014

Through the select command, we selected the variables included in the instructions that are going to be used for the lab.

mean(Alcohol2014$IRALCFY)
## [1] 425.2426
max(Alcohol2014$IRALCFY)
## [1] 993

The mean proves to be 425.2426, which does not make sense due to how in a year, there are only 365 days. This would be due to how the number used for people who never used alcohol in the past year or in their lives is 993 and 991, respectively. This affects the mean of the study as it gives us a number that is not useful in regards to the data showing the use of alcohol in a year.

The max does not have much significance either because the highest possible number in the data collection would be 993 meaning that there was not a use of alcohol in the past year, though the real max possible number would have to be 365 as it is the number of days there are in a year. Once again, it is demonstrated how the numbers used for the recollection of data from people who did not or do not consume alcohol affect the results of the study when taken into account.

What could be done to solve this problem would be to excempt these values when calculating dispersion values, or to change these values to NA or 0.

Alcohol2014 %>% filter(a.drugs2014$IRALCFY < 366) ->Drinkers

Drinkers$IRALCFY <- as.numeric(Drinkers$IRALCFY)

mean(Drinkers$IRALCFY)
## [1] 80.85421
max(Drinkers$IRALCFY)
## [1] 365
str(Drinkers$IRALCFY)
##  num [1:34371] 2 52 4 104 24 104 260 260 360 104 ...

The observations now vary from the Alcohol2014 table to due the exclusion of values 991 and 993. This allows us to see the real values of the data collected regarding the amount of alcohol consumption. As it is shown, the mean now shows an accurate number, 80.85421, in terms of the data being about the number of days people drink a year. Likewise, the max of the data shows 365 representing people who drink every day of the year.

rename(Alcohol2014, SEX = IRSEX) -> Alcohol2014

rename(Drinkers, SEX = IRSEX) -> Drinkers

Alcohol2014$SEX %>% recode('1'='Male', '2'='Female') -> Alcohol2014$SEX

Drinkers$SEX %>% recode('1'='Male', '2'='Female') -> Drinkers$SEX
rename(Alcohol2014, AGE = CATAG7) ->Alcohol2014
rename(Drinkers, AGE = CATAG7) ->Drinkers

Alcohol2014$AGE %>% recode('1'='12-13', '2'='14-15', '3'='16-17', '4'='18-20', '5'='21-25', '6'='26-34', '7'='35+') -> Alcohol2014$AGE

Drinkers$AGE %>% recode('1'='12-13', '2'='14-15', '3'='16-17', '4'='18-20', '5'='21-25', '6'='26-34', '7'='35+') -> Drinkers$AGE
ggplot(Drinkers, aes(IRALCFY)) + geom_histogram(bins=25)  + xlim(0,365) + xlab('Total Annual Drinking Days')

In the histogram above, both axis are comparing the amount of people who drink throughout the year. The x axis, being “Total Annual Drinking Days” shows the amount of days subjects drink through the course of the year. On the ‘y’ axis, we have the “count” variable, which shows the amount of people who drink a specific amount of days in the year.

As we observe the histogram above, it can be seen how most of the people drink 100 or less days a year. In the case that people would drink twice every week, then they would be drinking about 104 times a year. Even though, the majority of the people lay in between zero and a hundred days of drinking, demonstrating people usually drink less than twice a week on average.

ggplot(Alcohol2014, aes(IRALCFY, color=SEX)) + geom_freqpoly(bins=25)  + xlim(0,365) + xlab('Total Annual Drinking Days')

You might be able to use a frequency polygon when you are comparing two variables due to how the lines would be able to overlap, allowing a visual and easier comparison rather than analyzing two histograms.

Drinkers %>% ggplot(aes(x=AGE, y=IRALCFY)) + geom_boxplot(notch=TRUE, outlier.color = "gray") 

Drinkers below the legal drinking age of 21 demonstrate to be fewer than the people who are above 21 years old. The boxplot demonstrated above illustrates a continuous increase in the median of drinkers as they grow older, though with very similar results from the 21 to 35+ age range. At the age range of 18-20, there can be seen a big increase in the median of drinkers, which could mean that people usually start drinking with more frequence at age 18 onwards.

Drinkers %>% ggplot(aes(x=SEX, y=IRALCFY)) + geom_boxplot(notch=TRUE, outlier.color = "gray") 

The boxplote above demonstrates how men tend to be the sex that drinks more often, compared to females. Even though, there is not much difference in the median amount of drinking days, males show to drink more than females.

Drinkers %>% group_by(AGE) %>% summarise(Mean.IRALCFY = mean(IRALCFY))
## # A tibble: 7 x 2
##   AGE   Mean.IRALCFY
##   <chr>        <dbl>
## 1 12-13         26.2
## 2 14-15         25.3
## 3 16-17         34.8
## 4 18-20         55.5
## 5 21-25         81.9
## 6 26-34         85.3
## 7 35+           95.7
t.test(IRALCFY~SEX, Drinkers, na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  IRALCFY by SEX
## t = -26.05, df = 32491, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -26.89249 -23.12881
## sample estimates:
## mean in group Female   mean in group Male 
##             68.86589             93.87654

Based on the results obtained in the t test, it can be seen how the null hypothesis is rejected and the alternative hypothesis accepted due to the differende of mu1 and mu2 being different than zero, additionally to how the p value is much lower than 0.05. Additionally, the mean of both male and female drinkers is shown to demonstrates males to tend to drink more than females.

Underage<- filter(Drinkers, AGE=="18-20")
Legal<- filter(Drinkers, AGE=="21-25")
t.test(Underage$IRALCFY, Legal$IRALCFY, na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  Underage$IRALCFY and Legal$IRALCFY
## t = -17.099, df = 6867.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -29.36643 -23.32543
## sample estimates:
## mean of x mean of y 
##  55.50665  81.85258

The difference in means regarding the t-test of the drinker ages shows a bigger tendency of legal age drinkers to drink more, 81.85 days approximately, than those who are not yet of age, 55.51 approximately. This would make sense due to the ability that legal age drinkers have to purchase alcohol and can consume it legally. Through the p-value, the altenative hypothesis is proven right, as the difference in mu1 and mu2 is not equal to zero, thus proving legal age drinkers, between 21-25 years old, would drink more days throughout the years compared to 18-20 year olds. The confidence interval of both the age and sex t-tests demonstrate to be between -29.36643 to -23.32543 and between -26.89249 to -23.12881 respectively, proving confidence of 95% about the difference between the mean, suggesting strong evidence of a difference, leading to the disprove of the null hypothesis.

Comparing this results to the frequency polygram, we are good to assume the t.test calculations to be right as the graph also demonstrates a difference in the means of male and female drinkers in favor of males drinking more than females.

Drinkers%>% ggplot(aes(x=AGE, y=IRALCFY, fill=AGE)) + stat_summary(fun.y = "mean", geom = "bar") + stat_summary(fun.data = "mean_cl_normal", geom ="errorbar", fun.args = list(conf.int=95), width=0.35) + theme(legend.position = "none") + labs(y="Average Drinking Days Last Year")

Drinkers%>% ggplot(aes(x=SEX, y=IRALCFY, fill=SEX)) + stat_summary(fun.y = "mean", geom = "bar") + stat_summary(fun.data = "mean_cl_normal", geom ="errorbar", fun.args = list(conf.int=95), width=0.35) + theme(legend.position = "none") + labs(y="Average Drinking Days Last Year")

Viewing the bar graphs made comparing the ages of the drinkers studied, I would expect the 12 to 13 and 14 to 15 age ranges to not have a p-value of 05 due to how they are so close in terms of the mean. As it is shown in the graph, how both ranges are very close to an average of 25 drinking days per year.

Drinkers12.13<- filter(Drinkers, AGE=="12-13")
Drinkers14.15<- filter(Drinkers, AGE=="14-15")
t.test(Drinkers12.13$IRALCFY, Drinkers14.15$IRALCFY, na.rm=TRUE)
## 
##  Welch Two Sample t-test
## 
## data:  Drinkers12.13$IRALCFY and Drinkers14.15$IRALCFY
## t = 0.28141, df = 390.74, p-value = 0.7785
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.654680  7.543813
## sample estimates:
## mean of x mean of y 
##  26.22134  25.27678

As it was expected, the p-value demonstrated for both of the age ranges compared is 0.7785, being higher than 0.05 by a big amount, thus supporting the null hypothesis where it would be that both age ranges drink the similar amount of days in a year. Both of the means for the ranges also support this as they are very close to each other, where 12 to 13 year olds have a mean of 26.22134 compared to the 14 to 15 year olds having one of 25.27678. The confidence interval also demonstrates how it is likely for the difference to be 0, supporting the null hypothesis too.