Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
df <- read.csv('https://raw.githubusercontent.com/jforster19/Data605/main/Salary_Data.csv')
df <- df |> mutate(gender_num=if_else(str_trim(Gender)=='Male',1,0))
glimpse(df)
## Rows: 375
## Columns: 7
## $ Age <int> 32, 28, 45, 36, 52, 29, 42, 31, 26, 38, 29, 48, 35…
## $ Gender <chr> "Male", "Female", "Male", "Female", "Male", "Male"…
## $ Education.Level <chr> "Bachelor's", "Master's", "PhD", "Bachelor's", "Ma…
## $ Job.Title <chr> "Software Engineer", "Data Analyst", "Senior Manag…
## $ Years.of.Experience <dbl> 5, 3, 15, 7, 20, 2, 12, 4, 1, 10, 3, 18, 6, 14, 2,…
## $ Salary <int> 90000, 65000, 150000, 60000, 200000, 55000, 120000…
## $ gender_num <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0,…
summary(df)
## Age Gender Education.Level Job.Title
## Min. :23.00 Length:375 Length:375 Length:375
## 1st Qu.:31.00 Class :character Class :character Class :character
## Median :36.00 Mode :character Mode :character Mode :character
## Mean :37.43
## 3rd Qu.:44.00
## Max. :53.00
## NA's :2
## Years.of.Experience Salary gender_num
## Min. : 0.00 Min. : 350 Min. :0.0000
## 1st Qu.: 4.00 1st Qu.: 55000 1st Qu.:0.0000
## Median : 9.00 Median : 95000 Median :1.0000
## Mean :10.03 Mean :100577 Mean :0.5173
## 3rd Qu.:15.00 3rd Qu.:140000 3rd Qu.:1.0000
## Max. :25.00 Max. :250000 Max. :1.0000
## NA's :2 NA's :2
hist(df$Salary)
num_df <- df |> select(c(Age,Years.of.Experience,Salary,gender_num)) |> filter(!is.na(Salary))
#summary(num_df)
kdepairs(num_df)
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
The correlation plot does not work with NA values and therefor they were excluded to compare the relationships between the points. The plots indicate that age and years of experience are very highly correlated to one another and salary, which can be explored further to see if any independence assumptions are violated. It would make sense that in general years of experience could be a proxy measure to guess an employee’s age.
salary_lm <- lm(Salary~Years.of.Experience^2+Age+gender_num*Years.of.Experience,data=num_df)
summary(salary_lm)
##
## Call:
## lm(formula = Salary ~ Years.of.Experience^2 + Age + gender_num *
## Years.of.Experience, data = num_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58472 -6889 476 10726 73296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30692.0 16917.6 -1.814 0.070460 .
## Years.of.Experience 4198.8 706.4 5.944 6.45e-09 ***
## Age 2279.6 625.9 3.642 0.000309 ***
## gender_num 3025.9 3259.9 0.928 0.353904
## Years.of.Experience:gender_num 430.6 274.0 1.571 0.116963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17130 on 368 degrees of freedom
## Multiple R-squared: 0.8752, Adjusted R-squared: 0.8738
## F-statistic: 645.1 on 4 and 368 DF, p-value: < 2.2e-16
The F statistic’s p-value indicates that the model provides a better fit than the just the intercept itself. Both of the p-values for the years of experience and age are significant, while it appears that gender and the interaction term may not be significant. The R-squared as well as the adjusted term indicate that the model predicates about 87% of the variance of the dependent variable
Coefficient Interpretation: For each additional unit (year) of age there is an approximate 2279.6 increase in the expected salary. For each additional unit (year) of experience the squared term has a 4198.8 increase in the expected salary Given the one hot encoding for males vs females, there is a 3025.9 increase for salary for males There is a 430.6 increase in salary for each unit of increase in the interaction term (gender x years of experience)
par(mfrow=c(2,2))
plot(salary_lm)
The standard diagnostic plots indicate that potentially one of the normality assumptions is violated. The Fitted Residuals do not appear to have any clear pattern and seem to be somewhat evenly scattered around zero; however, the normal qq plot has a very large deviation from the projected normal line and seems to indicate a violation that the residuals be approximately normal.
vif(salary_lm)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## Years.of.Experience Age
## 27.180601 24.803066
## gender_num Years.of.Experience:gender_num
## 3.369806 4.663457
As expected the VIF factor for both years of experience and age exceed the acceptable level of collinearity to not violate the assumptions needed to conduct a linear regression model which makes it unreliable to assess these coefficients.
salary_lm2 <- lm(Salary~Years.of.Experience^2+Years.of.Experience*gender_num,data=num_df)
summary(salary_lm2)
##
## Call:
## lm(formula = Salary ~ Years.of.Experience^2 + Years.of.Experience *
## gender_num, data = num_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65254 -7617 395 9736 74543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30302.2 2437.5 12.431 <2e-16 ***
## Years.of.Experience 6663.4 205.9 32.370 <2e-16 ***
## gender_num 3337.3 3312.5 1.007 0.314
## Years.of.Experience:gender_num 325.2 277.0 1.174 0.241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17420 on 369 degrees of freedom
## Multiple R-squared: 0.8707, Adjusted R-squared: 0.8696
## F-statistic: 828.2 on 3 and 369 DF, p-value: < 2.2e-16
This model may be slightly less statistically significant given the p-values; however, the R squared and adjusted R-Squared seem to be around the same value as other iterations of this model. The R-Squared Adjusted remains around 87% and while only
Coefficient Interpretation: The intercept seems to make a bit more sense in this case as a base salary is positive consistent with the underlying data at 30,302.2 should someone have zero years of experience For each additional unit (year) of experience the squared term has a 6664.3 increase in the expected salary Given the one hot encoding for males vs females, there is a 3337.3 increase for salary for males There is a 325.2 increase in salary for each unit of increase in the interaction term (gender x years of experience)
vif(salary_lm2)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## Years.of.Experience gender_num
## 2.234103 3.367487
## Years.of.Experience:gender_num
## 4.611432
The values in the VIF calculation still seem to indicate some collinearity, but all of the terms are under the threshold of 5.
salary_lm3 <- lm(Salary~Years.of.Experience+gender_num,data=num_df)
summary(salary_lm3)
##
## Call:
## lm(formula = Salary ~ Years.of.Experience + gender_num, data = num_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66150 -8522 967 9624 75409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28504.1 1897.2 15.024 < 2e-16 ***
## Years.of.Experience 6843.1 137.8 49.662 < 2e-16 ***
## gender_num 6598.0 1806.0 3.653 0.000296 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17430 on 370 degrees of freedom
## Multiple R-squared: 0.8702, Adjusted R-squared: 0.8695
## F-statistic: 1240 on 2 and 370 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(salary_lm3)
Perhaps more transformations (boxcox or power transform) on the input are necessary given that the residuals similar to prior version of the models do not appear to approximate a normal curve
A more standard linear model appears to be more appropriate and squaring any of the independent variables in this case does not make sense. There needs to be some non-linear relationship in the data to warrant a quadratic/polynomial term in the model and perhaps in those cases a linear prediction is not the best solution in trying to make predictions, but of course it depends on the use case and what someone is trying to accomplish.