Hw4_Regression

Refer to the le homework04data01.csv, which can be found in the homework section of Canvas, to answer Questions 1{4. This data set consists of antler-length data gathered by zoologists. They are interested in seeing if we can use antler length (in inches) to classify four species (a, b, c, or d for ease of notation).

Produce box plots, and based on those box plots, tell whether the four species antler lengths are different or not.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

# Load the data from the CSV file
data <- read.csv("homework04data01.csv")

# Create a box plot for the antler length of each species
ggplot(data, aes(x = Species, y = Length)) +
  geom_boxplot() +
  xlab("Species") +
  ylab("Antler length (inches)")

We can observe from the box plot that the inter quartile range and the mean antler lengths are different for a, b c d species. We can see a a higher length and upper and lower values of a and c in comparison to b and d. b has the lowest average value in comparison to other species.

Use analysis of variance to test if there is a difference in mean antler length based on species. Use = 0.10.

model <- aov(Length~Species, data = data)
summary(model)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Species      3   90.0  30.001   8.121 8.58e-05 ***
## Residuals   81  299.2   3.694                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value of 8.58e-05 is less than 0.05 or 0.1 which is typically considered to be statistically significant, this case the null hypothesis should be rejected. The result shows that the rejecting the null hypothesis conclude mean anter length are diff for every species.

From the Analysis of Variance results, identify and report SSE, SSTr, MSE, MSTr.

As we can see in the summary of the anova model, SSError (SSE) is 299.2, SSModel(SSTr) is 90.0, MSError (MSE) is 3.694 and MSModel (MSTr) value is 30.001.

Construct 95% condence intervals for each pair of species by using Tukey multiple comparisons of means. Do any of the pairings imply a dierence in average antler length between species?

tukey_test <- TukeyHSD(aov(Length ~ Species, data = data))

tukey_test

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Length ~ Species, data = data)
## 
## $Species
##           diff         lwr        upr     p adj
## b-a -1.3656667 -3.08777369  0.3564404 0.1683081
## c-a  1.5306364 -0.02706805  3.0883408 0.0559481
## d-a -0.6176429 -2.09373459  0.8584489 0.6920003
## c-b  2.8963030  1.20807683  4.5845292 0.0001303
## d-b  0.7480238 -0.86520632  2.3612539 0.6183736
## d-c -2.1482792 -3.58469904 -0.7118594 0.0010291

plot(tukey_test, las = 1 )

Based on the above results in their 95% confidence intervals of respective mean levels, (d-c) and (c-b) do not have zero. Moreover, the p-adj values for the aforementioned couples are =<0.001. As a result, the null hypothesis can be disproved at several levels of significance, including = 0.1, 0.05, 0.01, and 0.005.

H 0: Equal Means H0: Means are equal Hence only for the above 2 pairs we can certainly say the means are not equal but not for other pairs at least at a significance level of α = 0.5.

Dry drilling is one of the processes for hydraulic drilling of rock. An experiment was conducted to determine whether the time (y, minutes) it takes to dry drill a certain distance in rock changes with the depth at which drilling begins (x, feet). Refer to the 6414 HW4 Drill.csv, which can be found in the homework section of Canvas, to answer Questions 5{10. Use the significance level 0.05.

Construct and submit a scatter plot of y versus x. Do you observe a relationship?

data1 <- read.csv("6414_HW4_Drill.csv")
# Create a box plot for the antler length of each species
ggplot(data1, aes(x = DEPTH, y = TIME)) +
  geom_point() +
  xlab("Species") +
  ylab("Antler length (inches)")

6. Construct and run a simple linear regression model using depth as the predicting variable and time as the response variable. Submit your solution and report the estimated coefficients.

model_lm <- lm(TIME ~ DEPTH, data = data1)
summary(model_lm)

## 
## Call:
## lm(formula = TIME ~ DEPTH, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5466 -0.4563  0.1022  0.5472  2.2607 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.789603   0.666312   7.188 3.13e-06 ***
## DEPTH       0.014388   0.002847   5.053 0.000143 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.432 on 15 degrees of freedom
## Multiple R-squared:   0.63,  Adjusted R-squared:  0.6053 
## F-statistic: 25.54 on 1 and 15 DF,  p-value: 0.0001429

coefficients(model_lm)

## (Intercept)       DEPTH 
##  4.78960280  0.01438785

Since the intercept 4.78 and the estimated coefficient is 0.014 TIME = 4.78 + 0.014 * DEPTH For every increase in depth of 1, the predicted time increases by 0.019 units.

Give an interpretation. Is the predictor statistically signficant? You can answer by using the p-value. Is this a strong simple linear regression model based on the significance?

Since the p value for the linear regression analysis is 0.000143 which is less than the required value of 0.05 this suggests that depth is a statistically significant predictor of time.

For every increase in depth of 1, the predicted time increases by 0.019 units. The R-squared value for the model is 0.63, which means that 63% of the variability in time can be explained by the linear relationship with depth. Overall, the results suggest that this is a moderately strong linear regression model, with depth accounting for a significant proportion of the variability in time.

Generate the normality plot and the histogram in order to test the normal distribution assumption of the error terms.

residuals <- model_lm$residuals
qqnorm(residuals)

hist(residuals)

Together, these plots help us understand that the error terms are normally distributed. The normality plot compares the distribution of the residuals to a theoretical normal distribution, while the histogram provides a visual representation of the distribution of the residuals.

Since the residuals are normally distributed, there is a straight line on the normality plot and a roughly bell-shaped histogram.

Generate the residuals vs. tted values graph and test the identical distribution assumption.

plot(model_lm, which = 1)

-The plot does not show any clear pattern or trend, indicating that the identical distribution assumption may hold. However, there are a few outliers with high residuals, which may indicate that the model does not fit well for those observations.

What is your overall conclusion, i.e., give a concise interpretation of your results.

We can conclude that there is a linear relationship between depth and time. The estimated regression equation is TIME = 4.78 + 0.014 * DEPTH. The p-value for the slope coefficient is very small, indicating that the predictor (depth) is statistically significant in predicting the response variable (time).

The normality plot and histogram of the residuals suggest that the errors are approximately normally distributed. The residual vs. fitted plot shows that there is no obvious pattern in the residuals, indicating that the identical distribution assumption is likely to hold.

Therefore, we can conclude that the simple linear regression model is a good fit for the data, and that depth is a significant predictor of time.

Hw4_Regression

Abhilasha

2023-02-20