library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(dplyr)
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.2
setwd("C:\\Users\\srini\\OneDrive\\Desktop\\Regression Analysis\\HW4")
getwd()
## [1] "C:/Users/srini/OneDrive/Desktop/Regression Analysis/HW4"

Part 2:

Question 4:

antler_data=read.csv("hw4-s25-antlers.csv")
ggplot(antler_data, aes(x = Species, y = Length)) +
  geom_boxplot() +
  xlab("Species") +
  ylab("Antler length (inches)")

The box plot shows the distribution of antler lengths across four species. Species c has the highest median antler length, while species b has the lowest. Species a and d have similar medians, falling between the two. Variability differs among species, with species c showing the widest range and species b the narrowest. While some overlap exists among species a, b, and d, species c stands out with consistently longer antlers. The whiskers indicate variation, but no extreme outliers are evident. These patterns suggest notable differences in antler lengths among species.

Question 5:

aov_model=aov(Length~Species, data = antler_data)
summary(aov_model)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Species      3   90.0  30.001   8.121 8.58e-05 ***
## Residuals   81  299.2   3.694                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA results indicate a significant difference in mean antler lengths among species. The F-value of 8.121 measures the ratio of variance between species to variance within species. The p-value (8.58 × 10⁻⁵) is much smaller than 0.05, leading to the rejection of the null hypothesis that all species have the same mean antler length. This suggests that at least one species differs significantly from the others in terms of antler length.

The F-critical value for 𝛼=0.05 is 2.717. Since the observed F-value (8.121) is greater than this critical value, we reject the null hypothesis, confirming that at least one species has a significantly different mean antler length.

Question 6:

SSE (Error Sum of Squares) : 299.2

SSTr (Treatment Sum of Squares) : 90

MSE (Mean Square Error) : 3.694

MSTr (Mean Square for Treatments) : 30.001

Question 7:

tukey_test=TukeyHSD(aov(Length ~ Species, data = antler_data))

tukey_test
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Length ~ Species, data = antler_data)
## 
## $Species
##           diff         lwr        upr     p adj
## b-a -1.3656667 -3.08777369  0.3564404 0.1683081
## c-a  1.5306364 -0.02706805  3.0883408 0.0559481
## d-a -0.6176429 -2.09373459  0.8584489 0.6920003
## c-b  2.8963030  1.20807683  4.5845292 0.0001303
## d-b  0.7480238 -0.86520632  2.3612539 0.6183736
## d-c -2.1482792 -3.58469904 -0.7118594 0.0010291
plot(tukey_test, las = 1 )

Based on the above results in their 95% confidence intervals of respective mean levels, (d-c) and (c-b) do not have zero. Moreover, the p-adj values for the aforementioned couples are =<0.001. As a result, the null hypothesis can be disproved at several levels of significance, including = 0.1, 0.05, 0.01, and 0.005.

However, the comparisons b-a, d-a, d-b, and d-c include zero, indicating no significant difference in average antler length for these species pairs.

Part 3:

Question 8:

salary_data=read.csv("hw4-s25-Salary-MLR.csv")
salary_lm=lm(Salary~YearsExperience, data=salary_data)
summary(salary_lm)
## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = salary_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10940  -6074  -1575   5665  19526 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      20522.4     3171.3   6.471 4.41e-07 ***
## YearsExperience  10549.5      503.6  20.948  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8146 on 29 degrees of freedom
## Multiple R-squared:  0.938,  Adjusted R-squared:  0.9359 
## F-statistic: 438.8 on 1 and 29 DF,  p-value: < 2.2e-16

Estimated Coefficients:

  1. β0 : 20522.4

  2. β1 : 10549

Estimated Variance (σ2) : 8146

R squared (R2) : 0.938 Adjusted R squared : 0.9359

Question 9:

salary_mlm=lm(Salary~., data=salary_data)
summary(salary_mlm)
## 
## Call:
## lm(formula = Salary ~ ., data = salary_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9966  -4996  -2296   3956  23107 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      17847.9     3787.0   4.713 6.08e-05 ***
## YearsExperience   9779.1      787.9  12.412 6.69e-13 ***
## Rating            1452.0     1149.8   1.263    0.217    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8064 on 28 degrees of freedom
## Multiple R-squared:  0.9413, Adjusted R-squared:  0.9372 
## F-statistic: 224.7 on 2 and 28 DF,  p-value: < 2.2e-16

Estimated Coefficients:

  1. β0 : 17847.9

  2. β1 : 9779.1

  3. β2 : 1452.0

Estimated Variance (σ2) : 8064

R squared (R2) : 0.9413 Adjusted R squared : 0.9372

Question 10:

The multiple regression model provides a slight improvement over the simple regression model in explaining salary variation. The R² value increases from 0.938 to 0.9413, and the adjusted R² rises from 0.9359 to 0.9372, indicating a marginally better fit. Additionally, the estimated variance decreases from 8146 to 8064, suggesting a slight reduction in unexplained variance. The coefficient for Years of Experience decreases from 10,549.5 to 9,779.1 when including Rating as an additional predictor, while Rating itself has a coefficient of 1,452.0, indicating a small positive effect on salary. Overall, while the multiple regression model shows a slight improvement, Years of Experience remains the dominant predictor, with Rating contributing only a minor additional effect.