Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Load dataset

df <- read.csv('https://raw.githubusercontent.com/jforster19/Data605/main/Salary_Data.csv')
df <- df |> mutate(gender_num=if_else(str_trim(Gender)=='Male',1,0))
glimpse(df)
## Rows: 375
## Columns: 7
## $ Age                 <int> 32, 28, 45, 36, 52, 29, 42, 31, 26, 38, 29, 48, 35…
## $ Gender              <chr> "Male", "Female", "Male", "Female", "Male", "Male"…
## $ Education.Level     <chr> "Bachelor's", "Master's", "PhD", "Bachelor's", "Ma…
## $ Job.Title           <chr> "Software Engineer", "Data Analyst", "Senior Manag…
## $ Years.of.Experience <dbl> 5, 3, 15, 7, 20, 2, 12, 4, 1, 10, 3, 18, 6, 14, 2,…
## $ Salary              <int> 90000, 65000, 150000, 60000, 200000, 55000, 120000…
## $ gender_num          <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0,…

Review Summary Statistics

summary(df)
##       Age           Gender          Education.Level     Job.Title        
##  Min.   :23.00   Length:375         Length:375         Length:375        
##  1st Qu.:31.00   Class :character   Class :character   Class :character  
##  Median :36.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :37.43                                                           
##  3rd Qu.:44.00                                                           
##  Max.   :53.00                                                           
##  NA's   :2                                                               
##  Years.of.Experience     Salary         gender_num    
##  Min.   : 0.00       Min.   :   350   Min.   :0.0000  
##  1st Qu.: 4.00       1st Qu.: 55000   1st Qu.:0.0000  
##  Median : 9.00       Median : 95000   Median :1.0000  
##  Mean   :10.03       Mean   :100577   Mean   :0.5173  
##  3rd Qu.:15.00       3rd Qu.:140000   3rd Qu.:1.0000  
##  Max.   :25.00       Max.   :250000   Max.   :1.0000  
##  NA's   :2           NA's   :2

Assessing the distribution of the predictor

hist(df$Salary)

Review relationship of dependent and independent variables available

num_df <- df |> select(c(Age,Years.of.Experience,Salary,gender_num)) |> filter(!is.na(Salary))
#summary(num_df)
kdepairs(num_df)
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

The correlation plot does not work with NA values and therefor they were excluded to compare the relationships between the points. The plots indicate that age and years of experience are very highly correlated to one another and salary, which can be explored further to see if any independence assumptions are violated. It would make sense that in general years of experience could be a proxy measure to guess an employee’s age.

Building a MLR Model

salary_lm <- lm(Salary~Years.of.Experience^2+Age+gender_num*Years.of.Experience,data=num_df)
summary(salary_lm)
## 
## Call:
## lm(formula = Salary ~ Years.of.Experience^2 + Age + gender_num * 
##     Years.of.Experience, data = num_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58472  -6889    476  10726  73296 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -30692.0    16917.6  -1.814 0.070460 .  
## Years.of.Experience              4198.8      706.4   5.944 6.45e-09 ***
## Age                              2279.6      625.9   3.642 0.000309 ***
## gender_num                       3025.9     3259.9   0.928 0.353904    
## Years.of.Experience:gender_num    430.6      274.0   1.571 0.116963    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17130 on 368 degrees of freedom
## Multiple R-squared:  0.8752, Adjusted R-squared:  0.8738 
## F-statistic: 645.1 on 4 and 368 DF,  p-value: < 2.2e-16

The F statistic’s p-value indicates that the model provides a better fit than the just the intercept itself. Both of the p-values for the years of experience and age are significant, while it appears that gender and the interaction term may not be significant. The R-squared as well as the adjusted term indicate that the model predicates about 87% of the variance of the dependent variable

Coefficient Interpretation: For each additional unit (year) of age there is an approximate 2279.6 increase in the expected salary. For each additional unit (year) of experience the squared term has a 4198.8 increase in the expected salary Given the one hot encoding for males vs females, there is a 3025.9 increase for salary for males There is a 430.6 increase in salary for each unit of increase in the interaction term (gender x years of experience)

Diagnostic Plots

par(mfrow=c(2,2))
plot(salary_lm)

The standard diagnostic plots indicate that potentially one of the normality assumptions is violated. The Fitted Residuals do not appear to have any clear pattern and seem to be somewhat evenly scattered around zero; however, the normal qq plot has a very large deviation from the projected normal line and seems to indicate a violation that the residuals be approximately normal.

Variance Inflation Factor: Assessing Multicollinearity

vif(salary_lm)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##            Years.of.Experience                            Age 
##                      27.180601                      24.803066 
##                     gender_num Years.of.Experience:gender_num 
##                       3.369806                       4.663457

As expected the VIF factor for both years of experience and age exceed the acceptable level of collinearity to not violate the assumptions needed to conduct a linear regression model which makes it unreliable to assess these coefficients.

Alternative MLR Model

salary_lm2 <- lm(Salary~Years.of.Experience^2+Years.of.Experience*gender_num,data=num_df)
summary(salary_lm2)
## 
## Call:
## lm(formula = Salary ~ Years.of.Experience^2 + Years.of.Experience * 
##     gender_num, data = num_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65254  -7617    395   9736  74543 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     30302.2     2437.5  12.431   <2e-16 ***
## Years.of.Experience              6663.4      205.9  32.370   <2e-16 ***
## gender_num                       3337.3     3312.5   1.007    0.314    
## Years.of.Experience:gender_num    325.2      277.0   1.174    0.241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17420 on 369 degrees of freedom
## Multiple R-squared:  0.8707, Adjusted R-squared:  0.8696 
## F-statistic: 828.2 on 3 and 369 DF,  p-value: < 2.2e-16

This model may be slightly less statistically significant given the p-values; however, the R squared and adjusted R-Squared seem to be around the same value as other iterations of this model. The R-Squared Adjusted remains around 87% and while only

Coefficient Interpretation: The intercept seems to make a bit more sense in this case as a base salary is positive consistent with the underlying data at 30,302.2 should someone have zero years of experience For each additional unit (year) of experience the squared term has a 6664.3 increase in the expected salary Given the one hot encoding for males vs females, there is a 3337.3 increase for salary for males There is a 325.2 increase in salary for each unit of increase in the interaction term (gender x years of experience)

vif(salary_lm2)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##            Years.of.Experience                     gender_num 
##                       2.234103                       3.367487 
## Years.of.Experience:gender_num 
##                       4.611432

The values in the VIF calculation still seem to indicate some collinearity, but all of the terms are under the threshold of 5.

Simpler MLR

salary_lm3 <- lm(Salary~Years.of.Experience+gender_num,data=num_df)
summary(salary_lm3)
## 
## Call:
## lm(formula = Salary ~ Years.of.Experience + gender_num, data = num_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66150  -8522    967   9624  75409 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          28504.1     1897.2  15.024  < 2e-16 ***
## Years.of.Experience   6843.1      137.8  49.662  < 2e-16 ***
## gender_num            6598.0     1806.0   3.653 0.000296 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17430 on 370 degrees of freedom
## Multiple R-squared:  0.8702, Adjusted R-squared:  0.8695 
## F-statistic:  1240 on 2 and 370 DF,  p-value: < 2.2e-16

Diagnostic Plots

par(mfrow=c(2,2))
plot(salary_lm3)

Perhaps more transformations (boxcox or power transform) on the input are necessary given that the residuals similar to prior version of the models do not appear to approximate a normal curve

A more standard linear model appears to be more appropriate and squaring any of the independent variables in this case does not make sense. There needs to be some non-linear relationship in the data to warrant a quadratic/polynomial term in the model and perhaps in those cases a linear prediction is not the best solution in trying to make predictions, but of course it depends on the use case and what someone is trying to accomplish.