Alcohol and Drug Analysis

Intro

In this statistical analysis, I will be examining male and female daily drug and alcohol use from the past year. I will analyze the differences in male and female drug and alcohol use throughout the past year, and also the difference in drug and alcohol use related to age throughout the past year. Here are my findings:


Analyzing the Histogram

The x-axis of this histogram represents the amount of days annually that both male and females consumed alcohol. The y-axis represents the count of male and females in the data that were involved in consuming alcohol throughout the past year. The histogram in all, represents the number of days that male and females of all ages consumed alcohol throughout the past year. Looking at the histogram, you can conclude that most people who drank last year drank less than twice a week. The mean number of days that people drank in the past year was 88. This mean shows that on average people drank about once a week, which supports my claim that male and females drank less than twice a week this past year. This histogram also excludes the people that have never drank before, and the people that did not drink in the past year. This furthers my point that male and females of all ages used alcohol less than twice a week last year.


Histogram Vs. Frequency Polygon

The cases between when you would use a histogram and when you would use a frequency polygon do differ. A histogram is good for showing data collected on one variable, while the frequency polygon is good for showing the data collected on multiple variables. Taking our histogram from above for instance, if we wanted to differ between males and females and compare which one has consumed more alcohol daily throughout a year, we would not be able to do that with a histogram. Yet, if we used a frequency plot to see the difference between male and female daily alcohol consumption throughout a year we would be able to do that. A histogram is good at showing a basic one variable model, while a frequency plot is good at showing multiple variable models. They both are very good tools in analyzing data, but you have to use them in the right situations.


Larger Binwidths Effect on the Frequency Polygon

The larger binwidth on this frequency polygon is very beneficial in the visualization aspect of the frequency polygon compared to the one above. With the binwidth being bigger in this frequency polygon, you are able to see the slope of the line much easier than the one above. This helps to show that most males and females drank less than 100 days last year. With the larger binwidth, you get a better understanding of the slopes of the lines, and come to a much easier conclusion of drinking days compared to people.


Age and Sex Influences on Drinking

From these boxplots, you can conclude that on average as you get older you drink more frequently, and also that males tend to drink more frequently than females. A tentative conclusion that you can make from this is that older males tend to drink more than older females, and also younger males tend to drink more than younger females. This is not a certain conclusion that you can conclude from these two graphs, but it is one that undoubtedly makes sense given the information that is given in these two boxplots.


##    12-13    14-15    16-17    18-20    21-25    26-34      35+ 
## 26.22134 25.27678 34.79498 55.50665 81.85258 85.31651 95.65255

## 
##  Welch Two Sample t-test
## 
## data:  IRALCFY by SEX
## t = 26.05, df = 32491, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  23.12881 26.89249
## sample estimates:
##   mean in group Male mean in group Female 
##             93.87654             68.86589

Summary of T-Test Results

In this t-test, I see that the mean number of drinking days of male and females are different, and that the null hypothesis is going to be rejected based off of the p-value.


## 
##  Welch Two Sample t-test
## 
## data:  underage$IRALCFY and legal$IRALCFY
## t = -17.099, df = 6867.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -29.36643 -23.32543
## sample estimates:
## mean of x mean of y 
##  55.50665  81.85258

Summary of T-Test Results

In this t-test, I see that the mean number of drinking days for legal drinkers is higher than that of illegal drinkers. I also see that the null hypothesis is going to be rejected based off of the p-value.


Validating the T-Test

We have about 5,500 data points in the female category of drinking days, and about 8,000 data points in the male category of drinking days. Looking at the frequency polygon above, we are able to assume that the t-test calculations are valid. The t-test shows that the mean number of drinking days for females is smaller than the mean number of drinking days for males, which is also shown in the frequency polygon.



Probability Values Based on Age

Based on my visual inspection of the two age groups in the bar graph, I would expect ages 12-13 and 14-15 to have a probability value less than 0.05.


## 
##  Welch Two Sample t-test
## 
## data:  Test1$IRALCFY and Test2$IRALCFY
## t = 0.28141, df = 390.74, p-value = 0.7785
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.654680  7.543813
## sample estimates:
## mean of x mean of y 
##  26.22134  25.27678

Final Annotation

Based on my visual analyzation, age groups 12-13 and 14-15 seemed to have the closest medians which was correct. The age group 12-13 had a median of 26, while the age group 14-15 had a median of 25. The bar graph presents a great visual representation which helped me to analyze that the medians would be closest between age groups 12-13 and 14-15.