I chose this inbuilt R dataset 'ToothGrowth because I thought it was a strange study, so wanted to analyze it. ToothGrowth data set contains the result from an experiment studying the effect of vitamin C on tooth growth in 60 Guinea pigs.
data("ToothGrowth")
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
Summary of the data
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
Plotting to Check Linearity
#Y~X
ggplot(ToothGrowth, aes(x=dose, y=len)) +
geom_point(size=2, shape=23) +geom_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'
m1 <- lm(ToothGrowth$len~ToothGrowth$dose, data = ToothGrowth)
print(m1)
##
## Call:
## lm(formula = ToothGrowth$len ~ ToothGrowth$dose, data = ToothGrowth)
##
## Coefficients:
## (Intercept) ToothGrowth$dose
## 7.423 9.764
residuals <- resid(m1)
hist(residuals)
qqnorm(residuals)
qqline(residuals)
Conclusion: In conclusion, even though this was a super simple dataset, and I had no data cleaning to do; I do believe a linear regression is a right fit since as the dose increases, the guinea pig's teeth length also increased. The residuals are also somewhat normally distributed in the normal probability plot.
Sources: http://www.sthda.com/english/wiki/r-built-in-data-sets