Applied Linear Regression: Project 1

Abstract

The data set used is taken from 2012 National Survey on Drug Use and Health (NSDUH) public use data file.This survey is a series conducted to measure the prevalence and correlation of drug use in the United States. The surveys are designed to provide quarterly, as well as annual, estimates. The data set provides information on the use of illicit drugs, alcohol, and tobacco among members of United States households aged 12 and older.

Surveys have been conducted periodically since 1971, with the most recent ones in 1979, 1982, 1985, 1988, and 1990 through 2012. Currently, public use files are available for surveys from 1979 onward.

Dataset Used

The original data file is available as an ASCII file with 3,120 variables and 55,268 observations.Since the original file contains a huge set therefore we summarize the entire data set here by taking a subset only.This survey data is a result of an exhaustive survey of respondents with random samples taken from each of the 50 states. The achieved sample size for the 2012 survey was 68,309 persons.

Hypothesis under test and the variables involved

The null hypothesis can be stated as: " The variation in the first cigaratte use(in terms of the age of an individual) cannot explain the variation in the first alcohol use“(i.e. This effect is due to randomization only).

Alternate hypothesis: First cigarette use can explain some of the variation in first time alcohol use (by an individual).

The rationale behind formulating this hypothesis is that we might expect an effect of first cigarette usage on first alcohol usage for an individual.

We consider two variable here for our regression analysis:

  1. The dependent variable (DV): Age when first had alcoholic beverage (ALCTRY)

  2. Independent variable (IV): Age when first had a cigarette (CIGTRY)

Units of measure: For both the variables the unit of measure is ‘years’. Age in years is treated as a continuous variable for this analysis.

Each response is recorded with reference to a Case id. So we know that each case id is the unique identifier for each individial. Since the dataset is huge both in terms of the number of variables as well as the total observations in the survey therefore we subset the data-set to include only the case id, independent variable and dependent variable for subsequent analysis.Therefore we have the following subset of the original dataset with the case id, the dependent variable and the independent variable as the three columns:

drugs.data.table<-read.table(file.choose(), header = TRUE, sep=',')


summary(drugs.data.table)
##      CASEID        ALCTRY          CIGTRY     
##  Min.   :  1   Min.   : 2.00   Min.   : 7.00  
##  1st Qu.: 90   1st Qu.:13.00   1st Qu.:14.00  
##  Median :179   Median :16.00   Median :16.00  
##  Mean   :179   Mean   :15.47   Mean   :15.82  
##  3rd Qu.:268   3rd Qu.:18.00   3rd Qu.:18.00  
##  Max.   :357   Max.   :27.00   Max.   :35.00
str(drugs.data.table)
## 'data.frame':    357 obs. of  3 variables:
##  $ CASEID: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ALCTRY: int  12 17 22 21 13 18 6 18 17 18 ...
##  $ CIGTRY: int  13 11 22 16 11 18 12 18 20 8 ...

Linear Regression Model

By definition a regression equation describes how the mean value of a y-variable (ALCTRY in our case) relates to specific values of the x-variable(s) (CIGTRY) used to predict y.

# Linear Model

attach(drugs.data.table)

Drugs.lm <- lm(ALCTRY~CIGTRY) 

# Summary of the model

summary(Drugs.lm)
## 
## Call:
## lm(formula = ALCTRY ~ CIGTRY)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0548  -1.5604  -0.0564   1.4381  11.4428 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.54775    0.69339   10.88   <2e-16 ***
## CIGTRY       0.50079    0.04268   11.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.983 on 355 degrees of freedom
## Multiple R-squared:  0.2795, Adjusted R-squared:  0.2774 
## F-statistic: 137.7 on 1 and 355 DF,  p-value: < 2.2e-16

Scatter Plot

attach(drugs.data.table)
## The following objects are masked from drugs.data.table (pos = 3):
## 
##     ALCTRY, CASEID, CIGTRY
# Scatter diagram

plot(CIGTRY,ALCTRY, pch=21, bg='blue',main=" Age(in years)of first Alcohol Use vs Age (in years )of first Cigarette Use",xlab = "Age(years) of first cigarette try", ylab = "Age (years) of first alcohol try")

Regression

plot(CIGTRY,ALCTRY, pch=21, bg='blue',main=" Age(in years)of first Alcohol Use vs Age (in years )of first Cigarette Use",xlab = "Age (years) of first cigarette try", ylab = "Age (years) of first alcohol try")
abline(Drugs.lm$coef, lwd=1)

Plot of fitted vs Residuals

par(mfrow=c(1,1))

Drugs.res <- resid(Drugs.lm)
plot(fitted(Drugs.lm), Drugs.res, pch=21, cex=1, bg='blue',main="Plot of Fitted Values vs. Residuals", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=1)

Confidence Intervals

plot(CIGTRY,ALCTRY, pch=21, bg='blue',main=" Age(in years)of first Alcohol Use vs Age (in years )of first Cigarette Use",xlab = "Age (years) of first cigarette try", ylab = "Age (years) of first alcohol try")
abline(Drugs.lm$coef, lwd=1)
abline(confint(Drugs.lm)[,1],col="red",lwd=1)
abline(confint(Drugs.lm)[,2],col="red",lwd =1)

Plotting the residuals

Drugs.resid <- Drugs.lm$residuals 
Drugs.st.resid <- rstandard(Drugs.lm) 
#print(cbind(Drugs.resid, Drugs.st.resid))

par(mfrow=c(1,2))
plot(Drugs.resid,pch=23, bg='blue',cex=2,lwd=2,main="Residuals") 
plot(Drugs.st.resid, pch=23, bg='red', cex=2, lwd=2, main="Standard Residuals")

Interpretation of the results from the entire analysis

Here we summarize the output of our regression analysis as well as the plots:

1.Linear regression model: The regression line fitted to this data can be written as: ^yi = 7.54775+0.50079xi (The hat notation is for the estimates)

b0= 7.54775 (y-intercept: First alcohol try incidence after eliminating first cigarette try)

b1= 0.50079 (slope: For each additional year of first cigarette try there is a 0.50079 years increase in the age of first alcohol try)

From the summary of the model we see that he significance level is quite high (low p-value).

  1. The value of R^2 is around 27.95% which means that first cigarette use explains 27.95% of the observed variation in the first alcohol use (age in years being the unit of measure here). Also, calculating the correlation co-efficient r=(R2)1/2 gives a value of 5.286 (positive) so we can interpret that there is a small positive correlation between the two variables.

  2. The scatter plots do suggest a linear increasing (positive slope) relationship between the variables and there are no patterns.

  3. The plots of the residuals do not exhibit any pattern so the model is correct on average for all fitted values.

  4. p-value(The p-value is really low, which means it is very unlikely to get the correlation between x,y just by chance): Since the p-value is 2.2 e-16 which is much less than the alpha level of 0.5 therefore we can say that the variability observed in the dependent variable can be attributed to the variability in the independent variable (with the given F-Statistic value) i.e. it is caused by something other than randomization. If we base our analysis on the’p-value’ approach only then based on our results we tend to reject the null hypothesis.

  5. Confidence interval: As can be seen from our confidence interval plot, for 95% of the samples the regression line will have the nearly the same values for b1 and b0.

References

  1. 2012 NATIONAL SURVEY ON DRUG USE AND HEALTH SAMPLE DESIGN REPORT: Prepared for:Substance Abuse and Mental Health Services Administration Rockville, Maryland 20857

Prepared by: RTI International Research Triangle Park, North Carolina 27709 January 2013 Authors: Katherine B. Morton Peilan C. Martin Bonnie E. Shook-Sa James R. Chromy Erica L. Hirsch