CEO salaries have drawn increasingly negative attention from shareholders and media outlets. The scope of this assignment is to identify what variables contribute to CEO salaries and bonuses. Each row is an observation of a corporate CEO and each variable is some type of descriptor for the CEO and/or firm.

Does it seem appropriate to combine salary and bonuses together as a single variable, “compensation”, or should these two be estimated separately?

In my approach I estimated for each seperately as salary and bonus did not seem to be an accurate predictor of one another. Adding both estimates would be the best model to predict total compensation.

What variable explains most of the variation in CEO compensation?

For estimating Salary the interaction between Sales and Profit as well as the interaction between Age and YearsCEO have the most influence on predicting Salary. For estimating Bonus Compfor5Yrs coefficient helps explain Bonus the most.

## 
## Call:
## lm(formula = Salary ~ Age * YearsFirm + Age * YearsCEO + YearsFirm * 
##     YearsCEO + StGains + Compfor5Yrs + poly(Sales, 3, raw = TRUE) * 
##     Profits, data = ceo_salary)
## 
## Standardized Coefficients::
##                         (Intercept)                                 Age 
##                          0.00000000                          0.04579288 
##                           YearsFirm                            YearsCEO 
##                          1.43602782                         -1.66437129 
##                             StGains                         Compfor5Yrs 
##                         -0.23949606                          0.26978291 
##         poly(Sales, 3, raw = TRUE)1         poly(Sales, 3, raw = TRUE)2 
##                          1.88665860                         -4.84469295 
##         poly(Sales, 3, raw = TRUE)3                             Profits 
##                          5.22425860                          0.44663280 
##                       Age:YearsFirm                        Age:YearsCEO 
##                         -1.41876482                          2.73956436 
##                  YearsFirm:YearsCEO poly(Sales, 3, raw = TRUE)1:Profits 
##                         -0.79927243                         -1.36777581 
## poly(Sales, 3, raw = TRUE)2:Profits poly(Sales, 3, raw = TRUE)3:Profits 
##                          4.15508416                         -4.89859395
## 
## Call:
## lm(formula = Bonus ~ YearsFirm + Other + Compfor5Yrs + Profits + 
##     ReturnOver5Yrs, data = ceo_salary)
## 
## Standardized Coefficients::
##    (Intercept)      YearsFirm          Other    Compfor5Yrs        Profits 
##      0.0000000      0.1106899      0.1126141      0.2557700      0.1879157 
## ReturnOver5Yrs 
##      0.1634008

What is the overall fit of your best fitting model(s)?

The Salary model can account for 60% of variance in Salaries, while the Bonus model can only determine 19% of variance.

Based on the Normal Q-Q chart, at positve 1.5 standard deviations the models are over fitting – bonus a lot more due to it’s low confidence. Salary does slight under fitting below negative 2 standard deviations.

Homoskedasticity exists with the dataset, but with independent variable transformations of polynomials and interactions the data became more normalized. I also checked for autocorellation and there does not appear to exist in the dataset.

##  lag Autocorrelation D-W Statistic p-value
##    1    -0.002822324      1.972921    0.83
##  Alternative hypothesis: rho != 0
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1158413      1.766219   0.032
##  Alternative hypothesis: rho != 0

How does R handle missing data when using it in a regression model? Does this seem appropriate for the situation?

Where there is missing data for a variables, R will exclude that data point from the calculation and will not account for it. As an analyst we whould not allow this and should correct or perform transforms to fix this issue. In addittion to the transformation alreay performed, we Could try applying log functions to variables or include dummy variables as supplemental variables in the model when there is missing data.

What other unaccounted for ‘Z’ variables do you believe exist that influence CEO salaries?

I could imagine a prior role(s) variable in the case a COO would more likely be promoted. In addition I would be intrested in historical CEOs data per company as a company may have a limit on CEO compensation pacakges. Also, International company, Publicly traded, where the companies is located.