It seems appropriate to combine the two because typically when negotiating salaries employees, including CEOs, will negotiate a lower salary to get a higher bonus and vice versa. So combining the two variables to get total compensation does seem appropriate.
For my model I chose to use the variables number of years at the firm, the industry, the compensation for five years and the companies sales.
fullmod3 <- lm(compensation ~ YearsFirm + Industry + Compfor5Yrs + Sales, ceo_df2)
vif(fullmod3)
## GVIF Df GVIF^(1/(2*Df))
## YearsFirm 1.206710 1 1.098503
## Industry 2.065558 48 1.007585
## Compfor5Yrs 1.219638 1 1.104372
## Sales 1.510742 1 1.229123
The model is linear outside of a few outliers and the industry variable has a higher VIF of 8 but it doesnt suggest multicollinearity, just something we need to keep an eye on.
As I stated in the question above linearity and looks fine minus a few outliers and the same applies to heteroskedasticity
Based off the change to the adjusted R-2 it seams that the industry variable had the biggest impact on the adjusted R-2
The model accounts for about 39% in the variability in CEO salaries
It seems that R handles missing dat by just adding an NA for all the variables and no this does not seem appropriate
Other variables I find important would the number of employees, valuation or stock price, other financial information such as debt, capital investments,etc.