Refer to the le homework04data01.csv, which can be found in the homework section of Canvas, to answer Questions 1{4. This data set consists of antler-length data gathered by zoologists. They are interested in seeing if we can use antler length (in inches) to classify four species (a, b, c, or d for ease of notation).
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
# Load the data from the CSV file
data <- read.csv("homework04data01.csv")
# Create a box plot for the antler length of each species
ggplot(data, aes(x = Species, y = Length)) +
geom_boxplot() +
xlab("Species") +
ylab("Antler length (inches)")
We can observe from the box plot that the inter quartile range and the
mean antler lengths are different for a, b c d species. We can see a a
higher length and upper and lower values of a and c in comparison to b
and d. b has the lowest average value in comparison to other
species.
model <- aov(Length~Species, data = data)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 3 90.0 30.001 8.121 8.58e-05 ***
## Residuals 81 299.2 3.694
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value of 8.58e-05 is less than 0.05 or 0.1 which is typically considered to be statistically significant, this case the null hypothesis should be rejected. The result shows that the rejecting the null hypothesis conclude mean anter length are diff for every species.
Construct 95% condence intervals for each pair of species by using Tukey multiple comparisons of means. Do any of the pairings imply a dierence in average antler length between species?
tukey_test <- TukeyHSD(aov(Length ~ Species, data = data))
tukey_test
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Length ~ Species, data = data)
##
## $Species
## diff lwr upr p adj
## b-a -1.3656667 -3.08777369 0.3564404 0.1683081
## c-a 1.5306364 -0.02706805 3.0883408 0.0559481
## d-a -0.6176429 -2.09373459 0.8584489 0.6920003
## c-b 2.8963030 1.20807683 4.5845292 0.0001303
## d-b 0.7480238 -0.86520632 2.3612539 0.6183736
## d-c -2.1482792 -3.58469904 -0.7118594 0.0010291
plot(tukey_test, las = 1 )
Based on the above results in their 95% confidence intervals of respective mean levels, (d-c) and (c-b) do not have zero. Moreover, the p-adj values for the aforementioned couples are =<0.001. As a result, the null hypothesis can be disproved at several levels of significance, including = 0.1, 0.05, 0.01, and 0.005.
H 0: Equal Means H0: Means are equal Hence only for the above 2 pairs we can certainly say the means are not equal but not for other pairs at least at a significance level of α = 0.5.
Dry drilling is one of the processes for hydraulic drilling of rock. An experiment was conducted to determine whether the time (y, minutes) it takes to dry drill a certain distance in rock changes with the depth at which drilling begins (x, feet). Refer to the 6414 HW4 Drill.csv, which can be found in the homework section of Canvas, to answer Questions 5{10. Use the significance level 0.05.
data1 <- read.csv("6414_HW4_Drill.csv")
# Create a box plot for the antler length of each species
ggplot(data1, aes(x = DEPTH, y = TIME)) +
geom_point() +
xlab("Species") +
ylab("Antler length (inches)")
6. Construct and run a simple linear regression model using depth as the
predicting variable and time as the response variable. Submit your
solution and report the estimated coefficients.
model_lm <- lm(TIME ~ DEPTH, data = data1)
summary(model_lm)
##
## Call:
## lm(formula = TIME ~ DEPTH, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5466 -0.4563 0.1022 0.5472 2.2607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.789603 0.666312 7.188 3.13e-06 ***
## DEPTH 0.014388 0.002847 5.053 0.000143 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.432 on 15 degrees of freedom
## Multiple R-squared: 0.63, Adjusted R-squared: 0.6053
## F-statistic: 25.54 on 1 and 15 DF, p-value: 0.0001429
coefficients(model_lm)
## (Intercept) DEPTH
## 4.78960280 0.01438785
Since the intercept 4.78 and the estimated coefficient is 0.014 TIME = 4.78 + 0.014 * DEPTH For every increase in depth of 1, the predicted time increases by 0.019 units.
For every increase in depth of 1, the predicted time increases by 0.019 units. The R-squared value for the model is 0.63, which means that 63% of the variability in time can be explained by the linear relationship with depth. Overall, the results suggest that this is a moderately strong linear regression model, with depth accounting for a significant proportion of the variability in time.
residuals <- model_lm$residuals
qqnorm(residuals)
hist(residuals)
Together, these plots help us understand that the error terms are normally distributed. The normality plot compares the distribution of the residuals to a theoretical normal distribution, while the histogram provides a visual representation of the distribution of the residuals.
Since the residuals are normally distributed, there is a straight line on the normality plot and a roughly bell-shaped histogram.
plot(model_lm, which = 1)
-The plot does not show any clear pattern or trend, indicating that the
identical distribution assumption may hold. However, there are a few
outliers with high residuals, which may indicate that the model does not
fit well for those observations.
The normality plot and histogram of the residuals suggest that the errors are approximately normally distributed. The residual vs. fitted plot shows that there is no obvious pattern in the residuals, indicating that the identical distribution assumption is likely to hold.
Therefore, we can conclude that the simple linear regression model is a good fit for the data, and that depth is a significant predictor of time.