The data set used is taken from 2012 National Survey on Drug Use and Health (NSDUH) public use data file.This survey is a series conducted to measure the prevalence and correlation of drug use in the United States. The surveys are designed to provide quarterly, as well as annual, estimates. The data set provides information on the use of illicit drugs, alcohol, and tobacco among members of United States households aged 12 and older.
Surveys have been conducted periodically since 1971, with the most recent ones in 1979, 1982, 1985, 1988, and 1990 through 2012. Currently, public use files are available for surveys from 1979 onward.
The original data file is available as an ASCII file with 3,120 variables and 55,268 observations.Since the original file contains a huge set therefore we summarize the entire data set here by taking a subset only.This survey data is a result of an exhaustive survey of respondents with random samples taken from each of the 50 states. The achieved sample size for the 2012 survey was 68,309 persons.
The null hypothesis can be stated as: " The variation in the first cigaratte use(in terms of the age of an individual) cannot explain the variation in the first alcohol use“(i.e. This effect is due to randomization only).
Alternate hypothesis: First cigarette use can explain some of the variation in first time alcohol use (by an individual).
The rationale behind formulating this hypothesis is that we might expect an effect of first cigarette usage on first alcohol usage for an individual.
We consider two variable here for our regression analysis:
The dependent variable (DV): Age when first had alcoholic beverage (ALCTRY)
Independent variable (IV): Age when first had a cigarette (CIGTRY)
Units of measure: For both the variables the unit of measure is ‘years’. Age in years is treated as a continuous variable for this analysis.
Each response is recorded with reference to a Case id. So we know that each case id is the unique identifier for each individial. Since the dataset is huge both in terms of the number of variables as well as the total observations in the survey therefore we subset the data-set to include only the case id, independent variable and dependent variable for subsequent analysis.Therefore we have the following subset of the original dataset with the case id, the dependent variable and the independent variable as the three columns:
drugs.data.table<-read.table(file.choose(), header = TRUE, sep=',')
summary(drugs.data.table)
## CASEID ALCTRY CIGTRY
## Min. : 1 Min. : 2.00 Min. : 7.00
## 1st Qu.: 90 1st Qu.:13.00 1st Qu.:14.00
## Median :179 Median :16.00 Median :16.00
## Mean :179 Mean :15.47 Mean :15.82
## 3rd Qu.:268 3rd Qu.:18.00 3rd Qu.:18.00
## Max. :357 Max. :27.00 Max. :35.00
str(drugs.data.table)
## 'data.frame': 357 obs. of 3 variables:
## $ CASEID: int 1 2 3 4 5 6 7 8 9 10 ...
## $ ALCTRY: int 12 17 22 21 13 18 6 18 17 18 ...
## $ CIGTRY: int 13 11 22 16 11 18 12 18 20 8 ...
By definition a regression equation describes how the mean value of a y-variable (ALCTRY in our case) relates to specific values of the x-variable(s) (CIGTRY) used to predict y.
# Linear Model
attach(drugs.data.table)
Drugs.lm <- lm(ALCTRY~CIGTRY)
# Summary of the model
summary(Drugs.lm)
##
## Call:
## lm(formula = ALCTRY ~ CIGTRY)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0548 -1.5604 -0.0564 1.4381 11.4428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.54775 0.69339 10.88 <2e-16 ***
## CIGTRY 0.50079 0.04268 11.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.983 on 355 degrees of freedom
## Multiple R-squared: 0.2795, Adjusted R-squared: 0.2774
## F-statistic: 137.7 on 1 and 355 DF, p-value: < 2.2e-16
attach(drugs.data.table)
## The following objects are masked from drugs.data.table (pos = 3):
##
## ALCTRY, CASEID, CIGTRY
# Scatter diagram
plot(CIGTRY,ALCTRY, pch=21, bg='blue',main=" Age(in years)of first Alcohol Use vs Age (in years )of first Cigarette Use",xlab = "Age(years) of first cigarette try", ylab = "Age (years) of first alcohol try")
plot(CIGTRY,ALCTRY, pch=21, bg='blue',main=" Age(in years)of first Alcohol Use vs Age (in years )of first Cigarette Use",xlab = "Age (years) of first cigarette try", ylab = "Age (years) of first alcohol try")
abline(Drugs.lm$coef, lwd=1)
par(mfrow=c(1,1))
Drugs.res <- resid(Drugs.lm)
plot(fitted(Drugs.lm), Drugs.res, pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=1)
plot(CIGTRY,ALCTRY, pch=21, bg='blue',main=" Age(in years)of first Alcohol Use vs Age (in years )of first Cigarette Use",xlab = "Age (years) of first cigarette try", ylab = "Age (years) of first alcohol try")
abline(Drugs.lm$coef, lwd=1)
abline(confint(Drugs.lm)[,1],col="red",lwd=1)
abline(confint(Drugs.lm)[,2],col="red",lwd =1)
Drugs.resid <- Drugs.lm$residuals
Drugs.st.resid <- rstandard(Drugs.lm)
#print(cbind(Drugs.resid, Drugs.st.resid))
par(mfrow=c(1,2))
plot(Drugs.resid,pch=23, bg='blue',cex=2,lwd=2,main="Residuals")
plot(Drugs.st.resid, pch=23, bg='red', cex=2, lwd=2, main="Standard Residuals")
Here we summarize the output of our regression analysis as well as the plots:
1.Linear regression model: The regression line fitted to this data can be written as: ^yi = 7.54775+0.50079xi (The hat notation is for the estimates)
b0= 7.54775 (y-intercept: First alcohol try incidence after eliminating first cigarette try)
b1= 0.50079 (slope: For each additional year of first cigarette try there is a 0.50079 years increase in the age of first alcohol try)
From the summary of the model we see that he significance level is quite high (low p-value).
The value of R^2 is around 27.95% which means that first cigarette use explains 27.95% of the observed variation in the first alcohol use (age in years being the unit of measure here). Also, calculating the correlation co-efficient r=(R2)1/2 gives a value of 5.286 (positive) so we can interpret that there is a small positive correlation between the two variables.
The scatter plots do suggest a linear increasing (positive slope) relationship between the variables and there are no patterns.
The plots of the residuals do not exhibit any pattern so the model is correct on average for all fitted values.
p-value(The p-value is really low, which means it is very unlikely to get the correlation between x,y just by chance): Since the p-value is 2.2 e-16 which is much less than the alpha level of 0.5 therefore we can say that the variability observed in the dependent variable can be attributed to the variability in the independent variable (with the given F-Statistic value) i.e. it is caused by something other than randomization. If we base our analysis on the’p-value’ approach only then based on our results we tend to reject the null hypothesis.
Confidence interval: As can be seen from our confidence interval plot, for 95% of the samples the regression line will have the nearly the same values for b1 and b0.
Prepared by: RTI International Research Triangle Park, North Carolina 27709 January 2013 Authors: Katherine B. Morton Peilan C. Martin Bonnie E. Shook-Sa James R. Chromy Erica L. Hirsch