library(datasets)
toothgrowth<-ToothGrowth
The data set has the following structural characteristics:
Number of Observations:
nrow(toothgrowth)
## [1] 60
Number of Columns:
ncol(toothgrowth)
## [1] 3
Column Names:
names(toothgrowth)
## [1] "len" "supp" "dose"
The five number summary shows:
summary(toothgrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
“Supp” is a categorical variable with two categories, OJ and VC, each with 30 observations.
A comparison of the five across the “supp” categories shows:
Supp=“OJ”
oj_len<-toothgrowth[toothgrowth$supp=="OJ",]
summary(oj_len)
## len supp dose
## Min. : 8.20 OJ:30 Min. :0.500
## 1st Qu.:15.53 VC: 0 1st Qu.:0.500
## Median :22.70 Median :1.000
## Mean :20.66 Mean :1.167
## 3rd Qu.:25.73 3rd Qu.:2.000
## Max. :30.90 Max. :2.000
sd(oj_len$len)
## [1] 6.605561
Supp=“VC”
vc_len<-toothgrowth[toothgrowth$supp=="VC",]
summary(vc_len)
## len supp dose
## Min. : 4.20 OJ: 0 Min. :0.500
## 1st Qu.:11.20 VC:30 1st Qu.:0.500
## Median :16.50 Median :1.000
## Mean :16.96 Mean :1.167
## 3rd Qu.:23.10 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
sd(vc_len$len)
## [1] 8.266029
Comparing the two sets results shows supplement “OJ” has a higher mean length and quartile distribution with a lower standard deviation compared to the supplement “VC” subset.
Next, a plot between the dose and length shows the nature of each variable and the relationship between the two.
plot(toothgrowth$dose,toothgrowth$len,main="Comparison of Length and Vitamin C Dose", ylab="Teeth Length",xlab="Vitamin C Dose")
The plot reveals the “dose” variable is also a categorical variable having three categories at .5, 1, and 2 mg; a table shows the count. The plot also shows that length seems to achieve greater values for higher doses. A comparison of the five number summaries is also included below.
table(toothgrowth$dose)
##
## 0.5 1 2
## 20 20 20
dose_.5<-toothgrowth[toothgrowth$dose==.5,]
dose_1<-toothgrowth[toothgrowth$dose==1,]
dose_2<-toothgrowth[toothgrowth$dose==2,]
summary(dose_.5$len)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 7.225 9.850 10.600 12.250 21.500
sd(dose_.5$len)
## [1] 4.499763
summary(dose_1$len)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.60 16.25 19.25 19.74 23.38 27.30
sd(dose_1$len)
## [1] 4.415436
summary(dose_2$len)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.50 23.52 25.95 26.10 27.83 33.90
sd(dose_2$len)
## [1] 3.77415
The results of the analysis above shows tooth length is longer for higher doses of Vitamin C, while the 2mg group also had the lowest standard deviation in length.
The findings of the exploratory analysis shows there is a positive relationship between Dose and Supp. However, given that both variables are of a categorical nature, each must be transformed to allow for meaningful analysis.
First, transform each variable into a dummy variable. Supp will only require only dummy variable, while Dose will require two. If supp_dv=0, then supp=“VC”; if dose_1 and dose_2 both equal 0, then dose=.5.
tg<-toothgrowth
tg$dv_supp[tg$supp=="OJ"]<-1
tg$dv_supp[tg$supp=="VC"]<-0
tg$dv_1[tg$dose==1]<-1
tg$dv_1[tg$dose!=1]<-0
tg$dv_2[tg$dose==2]<-1
tg$dv_2[tg$dose!=2]<-0
myvars<- c("len","dv_supp","dv_1","dv_2")
tg1<-tg[myvars]
Now the data is set, let’s set up hypothesis testing.
The data set on which to estimate parameters now contains 3 dummy variables as explanatory variables and length as the dependent variable. Given unknown population means and variances of the length of the guinea pig teeth, and 60 observations, we could use the normal distribution for statistical testing of the regression and coefficient. However, given the relatively small number of observations and to be conserative in that regard, the t-distribution is selected.
For each b in the estimated regression, h_0: b_i=0, h_a: b_i!=0.
This requires a two-tail test and alpha is set to .05. Thus, the critical t-value for rejection is set at 2.0042 for 56 degress of freedom. For each b, the null hypothesis is rejected if the calculated t-statistic exceeds 2.0042.
reg<-lm(len~dv_supp+dv_1+dv_2,data=tg1)
summary(reg)
##
## Call:
## lm(formula = len ~ dv_supp + dv_1 + dv_2, data = tg1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.085 -2.751 -0.800 2.446 9.650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.7550 0.9883 8.859 3.05e-12 ***
## dv_supp 3.7000 0.9883 3.744 0.000429 ***
## dv_1 9.1300 1.2104 7.543 4.38e-10 ***
## dv_2 15.4950 1.2104 12.802 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.828 on 56 degrees of freedom
## Multiple R-squared: 0.7623, Adjusted R-squared: 0.7496
## F-statistic: 59.88 on 3 and 56 DF, p-value: < 2.2e-16
The summary results of the regression further support the findings of the exploratory analysis. The estimated coefficients are all positive and statistically significantly different from 0 at the .001 level of confidencem, thus we reject the null hypothesis for each beta cefficient in the regression. The same conclusion is drawn for the overall regression, that at least one of the coefficients statistically signficantly differs 0 at the .001 level of confidence. In this regard, the adjusted R-Squared of 75% is meaningful.
Further explanation of the coefficients is necessary given the use of dummy variables. First, the intercept represents the average length of teeth in guinea pigs that receive the VC supplement and .5mg doses of Vitamin C. That value is 8.755. Next, the intercept plus dv_supp indicates average tooth length of guinea pigs that receive the OJ supplement and but only .5mg doses. Finally, the intercept plus either dv_1 or dv_2 indicates the average tooth length of guinea pigs receiving either 1 or 2 mg doses and the VC supplement.The last step is to calculate t-confidence intervals of the coefficients.
confint(reg)
## 2.5 % 97.5 %
## (Intercept) 6.775238 10.734762
## dv_supp 1.720238 5.679762
## dv_1 6.705297 11.554703
## dv_2 13.070297 17.919703
Each estimated coefficient is approximately 2.0042 standard errors from either side of its confidence interval bounds.
The above conclusions are predicated on assumptions about the t-distribution being respected. These assumptions include the data are iid normal; the data are symmetric around the mean (i.e. no skewness), and the t quantiles and parameters approach those of the normal distribution for large degrees of freedom.