Examination of Toothgrowth Dataset

Aaron Chandler

Exploratory Data Analysis

library(datasets)
toothgrowth<-ToothGrowth

The data set has the following structural characteristics:

The five number summary shows: A comparison of the five across the “supp” categories shows: Next, a plot between the dose and length shows the nature of each variable and the relationship between the two.

The findings of the exploratory analysis shows there is a positive relationship between Dose and Supp. However, given that both variables are of a categorical nature, each must be transformed to allow for meaningful analysis.

Inferential Analysis

First, transform each variable into a dummy variable. Supp will only require only dummy variable, while Dose will require two. If supp_dv=0, then supp=“VC”; if dose_1 and dose_2 both equal 0, then dose=.5.

tg<-toothgrowth
tg$dv_supp[tg$supp=="OJ"]<-1
tg$dv_supp[tg$supp=="VC"]<-0
tg$dv_1[tg$dose==1]<-1
tg$dv_1[tg$dose!=1]<-0
tg$dv_2[tg$dose==2]<-1
tg$dv_2[tg$dose!=2]<-0

myvars<- c("len","dv_supp","dv_1","dv_2")
tg1<-tg[myvars]

Now the data is set, let’s set up hypothesis testing.

Hypothesis Testing

The data set on which to estimate parameters now contains 3 dummy variables as explanatory variables and length as the dependent variable. Given unknown population means and variances of the length of the guinea pig teeth, and 60 observations, we could use the normal distribution for statistical testing of the regression and coefficient. However, given the relatively small number of observations and to be conserative in that regard, the t-distribution is selected.

For each b in the estimated regression, h_0: b_i=0, h_a: b_i!=0.

This requires a two-tail test and alpha is set to .05. Thus, the critical t-value for rejection is set at 2.0042 for 56 degress of freedom. For each b, the null hypothesis is rejected if the calculated t-statistic exceeds 2.0042.

reg<-lm(len~dv_supp+dv_1+dv_2,data=tg1)
summary(reg)
## 
## Call:
## lm(formula = len ~ dv_supp + dv_1 + dv_2, data = tg1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.085 -2.751 -0.800  2.446  9.650 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.7550     0.9883   8.859 3.05e-12 ***
## dv_supp       3.7000     0.9883   3.744 0.000429 ***
## dv_1          9.1300     1.2104   7.543 4.38e-10 ***
## dv_2         15.4950     1.2104  12.802  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.828 on 56 degrees of freedom
## Multiple R-squared:  0.7623, Adjusted R-squared:  0.7496 
## F-statistic: 59.88 on 3 and 56 DF,  p-value: < 2.2e-16

The summary results of the regression further support the findings of the exploratory analysis. The estimated coefficients are all positive and statistically significantly different from 0 at the .001 level of confidencem, thus we reject the null hypothesis for each beta cefficient in the regression. The same conclusion is drawn for the overall regression, that at least one of the coefficients statistically signficantly differs 0 at the .001 level of confidence. In this regard, the adjusted R-Squared of 75% is meaningful.

Further explanation of the coefficients is necessary given the use of dummy variables. First, the intercept represents the average length of teeth in guinea pigs that receive the VC supplement and .5mg doses of Vitamin C. That value is 8.755. Next, the intercept plus dv_supp indicates average tooth length of guinea pigs that receive the OJ supplement and but only .5mg doses. Finally, the intercept plus either dv_1 or dv_2 indicates the average tooth length of guinea pigs receiving either 1 or 2 mg doses and the VC supplement.

The last step is to calculate t-confidence intervals of the coefficients.

Assumptions

The above conclusions are predicated on assumptions about the t-distribution being respected. These assumptions include the data are iid normal; the data are symmetric around the mean (i.e. no skewness), and the t quantiles and parameters approach those of the normal distribution for large degrees of freedom.