library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(dplyr)
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.2
setwd("C:\\Users\\srini\\OneDrive\\Desktop\\Regression Analysis\\HW4")
getwd()
## [1] "C:/Users/srini/OneDrive/Desktop/Regression Analysis/HW4"
antler_data=read.csv("hw4-s25-antlers.csv")
ggplot(antler_data, aes(x = Species, y = Length)) +
geom_boxplot() +
xlab("Species") +
ylab("Antler length (inches)")
The box plot shows the distribution of antler lengths across four species. Species c has the highest median antler length, while species b has the lowest. Species a and d have similar medians, falling between the two. Variability differs among species, with species c showing the widest range and species b the narrowest. While some overlap exists among species a, b, and d, species c stands out with consistently longer antlers. The whiskers indicate variation, but no extreme outliers are evident. These patterns suggest notable differences in antler lengths among species.
aov_model=aov(Length~Species, data = antler_data)
summary(aov_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 3 90.0 30.001 8.121 8.58e-05 ***
## Residuals 81 299.2 3.694
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA results indicate a significant difference in mean antler lengths among species. The F-value of 8.121 measures the ratio of variance between species to variance within species. The p-value (8.58 × 10⁻⁵) is much smaller than 0.05, leading to the rejection of the null hypothesis that all species have the same mean antler length. This suggests that at least one species differs significantly from the others in terms of antler length.
The F-critical value for 𝛼=0.05 is 2.717. Since the observed F-value (8.121) is greater than this critical value, we reject the null hypothesis, confirming that at least one species has a significantly different mean antler length.
SSE (Error Sum of Squares) : 299.2
SSTr (Treatment Sum of Squares) : 90
MSE (Mean Square Error) : 3.694
MSTr (Mean Square for Treatments) : 30.001
tukey_test=TukeyHSD(aov(Length ~ Species, data = antler_data))
tukey_test
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Length ~ Species, data = antler_data)
##
## $Species
## diff lwr upr p adj
## b-a -1.3656667 -3.08777369 0.3564404 0.1683081
## c-a 1.5306364 -0.02706805 3.0883408 0.0559481
## d-a -0.6176429 -2.09373459 0.8584489 0.6920003
## c-b 2.8963030 1.20807683 4.5845292 0.0001303
## d-b 0.7480238 -0.86520632 2.3612539 0.6183736
## d-c -2.1482792 -3.58469904 -0.7118594 0.0010291
plot(tukey_test, las = 1 )
Based on the above results in their 95% confidence intervals of respective mean levels, (d-c) and (c-b) do not have zero. Moreover, the p-adj values for the aforementioned couples are =<0.001. As a result, the null hypothesis can be disproved at several levels of significance, including = 0.1, 0.05, 0.01, and 0.005.
However, the comparisons b-a, d-a, d-b, and d-c include zero, indicating no significant difference in average antler length for these species pairs.
salary_data=read.csv("hw4-s25-Salary-MLR.csv")
salary_lm=lm(Salary~YearsExperience, data=salary_data)
summary(salary_lm)
##
## Call:
## lm(formula = Salary ~ YearsExperience, data = salary_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10940 -6074 -1575 5665 19526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20522.4 3171.3 6.471 4.41e-07 ***
## YearsExperience 10549.5 503.6 20.948 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8146 on 29 degrees of freedom
## Multiple R-squared: 0.938, Adjusted R-squared: 0.9359
## F-statistic: 438.8 on 1 and 29 DF, p-value: < 2.2e-16
Estimated Coefficients:
β0 : 20522.4
β1 : 10549
Estimated Variance (σ2) : 8146
R squared (R2) : 0.938 Adjusted R squared : 0.9359
salary_mlm=lm(Salary~., data=salary_data)
summary(salary_mlm)
##
## Call:
## lm(formula = Salary ~ ., data = salary_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9966 -4996 -2296 3956 23107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17847.9 3787.0 4.713 6.08e-05 ***
## YearsExperience 9779.1 787.9 12.412 6.69e-13 ***
## Rating 1452.0 1149.8 1.263 0.217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8064 on 28 degrees of freedom
## Multiple R-squared: 0.9413, Adjusted R-squared: 0.9372
## F-statistic: 224.7 on 2 and 28 DF, p-value: < 2.2e-16
Estimated Coefficients:
β0 : 17847.9
β1 : 9779.1
β2 : 1452.0
Estimated Variance (σ2) : 8064
R squared (R2) : 0.9413 Adjusted R squared : 0.9372
The multiple regression model provides a slight improvement over the simple regression model in explaining salary variation. The R² value increases from 0.938 to 0.9413, and the adjusted R² rises from 0.9359 to 0.9372, indicating a marginally better fit. Additionally, the estimated variance decreases from 8146 to 8064, suggesting a slight reduction in unexplained variance. The coefficient for Years of Experience decreases from 10,549.5 to 9,779.1 when including Rating as an additional predictor, while Rating itself has a coefficient of 1,452.0, indicating a small positive effect on salary. Overall, while the multiple regression model shows a slight improvement, Years of Experience remains the dominant predictor, with Rating contributing only a minor additional effect.