#Question 1. Descriptive Analytics 1.1 Prepare and display a histogram for the response variable of interest: salary

library(carData)
data("Salaries")
hist(Salaries$salary) 

1.2 Prepare a qqplot with a qqline line for salary

qqnorm(Salaries$salary)
qqline(Salaries$salary)

1.3 Answer briefly: based on your review of the data, is salary somewhat normally distributed? Why or why not?

I would say that this is not normally distributed. It has a positive right skew to the histogram.

1.4 Using the ggpairs(){GGally} function, display a correlation chart with all variables

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
ggpairs(Salaries)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.5 Answer briefly: based on your review of the data, does it appear to be a salary gender gap? Why or why not?

Yes there is a salary gender gap between males and females. The males have a higher median income as well as Q3 salary. The maximum a female can make is signficantly lower than what a male can make. This can be noticed by the gap between the female maximum and the male max and outliers.

1.6 Answer briefly: are there any other observations from the correlation chart worth noting? Please elaborate.

Yes, although there were positive correaltions for two categories that stood out. There was .41 positive correaltion between between yrs. since PhD and Salary as well as yrs. of service and salary. There may be some multicollenarity that make cause this and the scatterplot on years since phd is showing signs of hetroskedasticity This implies that faculty salary levels are positively correalated with experience.

#2. Small OLS Regression Model

2.1 Fit an OLS regression model that predicts salaries on sex (only) and store the results in an object named “fit.sexonly”. Then use the summary() function to view the results of this mode.

fit.sexonly= lm(salary ~ sex, data = Salaries)
summary(fit.sexonly)
## 
## Call:
## lm(formula = salary ~ sex, data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57290 -23502  -6828  19710 116455 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   101002       4809  21.001  < 2e-16 ***
## sexMale        14088       5065   2.782  0.00567 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30030 on 395 degrees of freedom
## Multiple R-squared:  0.01921,    Adjusted R-squared:  0.01673 
## F-statistic: 7.738 on 1 and 395 DF,  p-value: 0.005667

2.2 Based on this result, does this model support the argument for gender salary inequality? Why or why not?

The model does support the argument for gender salary inequality. The p value for sexMale is .00567 which suggest that salary is heavily dependent on sex. Unfortunately it doesn’t tell the whole story. The Adjusted R-Squared is .0167 this means that sex can explain less than 1% of the variance in salary.

#3. Large OLS Regression Model

3.1 Then fit an OLS regression model that predicts salaries on ALL remaining variables and store the results in a model named “fit.all”. Use the summary() function to display the results.

fit.all= lm(salary~., data = Salaries)
summary(fit.all)
## 
## Call:
## lm(formula = salary ~ ., data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65248 -13211  -1775  10384  99592 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    65955.2     4588.6  14.374  < 2e-16 ***
## rankAssocProf  12907.6     4145.3   3.114  0.00198 ** 
## rankProf       45066.0     4237.5  10.635  < 2e-16 ***
## disciplineB    14417.6     2342.9   6.154 1.88e-09 ***
## yrs.since.phd    535.1      241.0   2.220  0.02698 *  
## yrs.service     -489.5      211.9  -2.310  0.02143 *  
## sexMale         4783.5     3858.7   1.240  0.21584    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22540 on 390 degrees of freedom
## Multiple R-squared:  0.4547, Adjusted R-squared:  0.4463 
## F-statistic:  54.2 on 6 and 390 DF,  p-value: < 2.2e-16

3.2 Does this larger model support the salary inequality hypothesis? Why or why not? Why do you think that results changed when more variables were added to the model?

The larger model shows that yrs of service and yrs. since phd is more significant in determining salary than sex. Sex has little significance at .2 and the t-value is closer to zero than the other variables meaning it has the weakest relationship to salary.

#4. Analysis of Variance

4.1 Conduct an ANOVA test (use the anova() function) to compare the reduced model (fit.sexonly) against the full model (fit.all).

anova(fit.sexonly, fit.all)
## Analysis of Variance Table
## 
## Model 1: salary ~ sex
## Model 2: salary ~ rank + discipline + yrs.since.phd + yrs.service + sex
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1    395 3.5632e+11                                  
## 2    390 1.9812e+11  5 1.582e+11 62.286 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.2 According to the results of the ANOVA test, which of the two models has more predictive power? Briefly explain why. According to this analysis, what is your conclusion about whether there is a gender pay gap (note: this is not the end of this saga - we will be specifying other models later, which will yield different results.)

I would think that the fit.all ANOVA has the most predicitve power. The fit.all model has more variables and the p value for the fit.all model shows a high degree of significance. Also the F-statistic is large which also indicated a strong relationship to salary.

#5. Test a Couple of OLS Assumptions

5.1 We will discuss multicollinearity in depth in a few days. Because multicollinearity is a most important problem in predictive modeling, let’s get a little exposure to it. Obtain the Variance Inflation Factors (VIF’s) for each of the variables in the full model. To do this, first load the library {car} which contains the vif() function. After you load the car library enter vif(fit.all). Each variable will have a VIF score associated with it.

library(car)
vif(fit.all)
##                   GVIF Df GVIF^(1/(2*Df))
## rank          2.013193  2        1.191163
## discipline    1.064105  1        1.031555
## yrs.since.phd 7.518936  1        2.742068
## yrs.service   5.923038  1        2.433729
## sex           1.030805  1        1.015285

5.2 Is there a problem with multicollinearity? Why or why not? Look at the column labeled GVIF. We will discuss this in more depth soon, but for now, any GVIF>5 raises a red flag, but is tolerable. Anything above 10 is severe.

The GVIF for yrs.since.phd and yrs.service are above 5. Yrs.since.phd is very close to 10. So there is some issues with multicollinearity.

5.3 Conduct a Breusch-Pagan test for Heteroskedasticity

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(fit.all)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit.all
## BP = 65.055, df = 6, p-value = 4.205e-12

5.4 Display the residual plot. Note: the plot() function with lm() objects displays 4 key linear model plots. In this case, we only need the first plot, which is the residual vs. fitted values plot. To display the first plot only, enter this command plot(fit.all, which=1).

plot(fit.all, which = 1)

Heteroskedasticity? Why or why not? In your answer, please refer to both, the BP test and the residual plot.

Heteroskedasticity is present in both the residual plot and BP test. In the residual plat if you look along the fitted line you see the variance start off narrow and then seperate along the line. The p value in the BP test is significantly less than .005 which rejects the null hypothesis for homoscedasticity.

#6. Weighted Least Squares (WLS) Regression.

6.1 Regardless of your heteroskedasticity tests above, fit a WLS model using residuals from the full model (fit.all). Store this new model in an object named fit.all.wls. Tip: fitting a WLS model with residuals is a two step process. The first step is to run an OLS model (fit.all) and extract the residuals, which you already did above. So all you have to do now is the second step, which is to fit the WLS model, using 1/fit.all$residuals^2 as the weights. After you fit the model, display the summary() results fo your WLS model.

fit.all.wls = lm(salary~.,data = Salaries, weights = 1/fit.all$residuals^2)
summary(fit.all.wls)
## 
## Call:
## lm(formula = salary ~ ., data = Salaries, weights = 1/fit.all$residuals^2)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2985 -1.0024 -0.8546  0.9960  1.3857 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   64897.52     710.12  91.390   <2e-16 ***
## rankAssocProf 13372.32     350.33  38.171   <2e-16 ***
## rankProf      45544.09     474.68  95.947   <2e-16 ***
## disciplineB   14293.17     182.62  78.269   <2e-16 ***
## yrs.since.phd   529.08      26.90  19.666   <2e-16 ***
## yrs.service    -503.16      37.82 -13.303   <2e-16 ***
## sexMale        5974.91     691.68   8.638   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.001 on 390 degrees of freedom
## Multiple R-squared:  0.9977, Adjusted R-squared:  0.9977 
## F-statistic: 2.846e+04 on 6 and 390 DF,  p-value: < 2.2e-16

6.2 Respond briefly: based on your WLS results, is there empirical evidence of gender salary inequality? Why or why not?

Yes, based upon the model’s pvalue and Adjusted R-Squared. The Adjusted R-Squared is .99 which means the variables in this model explain 99% of variance in salary. The p value is significantly under .005 which suggests that the variables show high significance when predicting salary. Sex’s p value more specifically is highly significant.

6.3 Briefly comment on the differences between the OLS and WLS models. Which one do you believe? Why and why not?

The difference between the OLS and WLS models is the weighted residuals. Weights were applied to he Min, Median, Max, and the Q1/Q3. The weights are applied to the most extreme residuals pulling the line up. This puts the fitted line more inline with the data points. This allows for a more normal distribution. I believe the WLS because it weighs down some of the outliers that were throwing off the entire OLS model. Weighing those helps to give it a more accurate prediction.

#7. Transformations - Categorical Data

7.1 Fit a regression model using the Cars93{MASS} data set to predict Price as a function of Type, MPG.city, AirBags and Origin. Then, display the summary() results of this model. Notice that R automatically converts Factor variables with categorical data to binary variables. For example, the “AirBags” factor variable was converted into 2 binary variables, one for each type of AirBag, with “Driver & Passenger” left out as the reference variable.

library(MASS)
cars93.fit = lm(Price ~ Type+MPG.city+AirBags+Origin, data = Cars93)
summary(cars93.fit)
## 
## Call:
## lm(formula = Price ~ Type + MPG.city + AirBags + Origin, data = Cars93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.177  -3.853  -1.176   2.865  28.119 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         38.6020     4.7806   8.075 4.62e-12 ***
## TypeLarge            3.0755     2.7739   1.109 0.270739    
## TypeMidsize          5.1573     2.1830   2.362 0.020496 *  
## TypeSmall           -0.2819     2.5978  -0.109 0.913856    
## TypeSporty           0.3151     2.3294   0.135 0.892722    
## TypeVan             -0.8718     2.9036  -0.300 0.764744    
## MPG.city            -0.7957     0.1912  -4.162 7.68e-05 ***
## AirBagsDriver only  -4.3447     1.9076  -2.278 0.025322 *  
## AirBagsNone         -8.9089     2.2844  -3.900 0.000195 ***
## Originnon-USA        5.1411     1.4387   3.573 0.000590 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.302 on 83 degrees of freedom
## Multiple R-squared:  0.616,  Adjusted R-squared:  0.5744 
## F-statistic: 14.79 on 9 and 83 DF,  p-value: 5.166e-14

7.2 Provide a brief interpretation of the coefficient values and significance for the AirBagsDriver only and AirBagsNone predictors. Please remember to comment on the sign of the effect.

The T-Values for both categories are not larger that the estimates but both are sufficiently far away from zero. This would suggest that both variables have a relationship with price. I gather that the negative status of the values would mean that the realtionship is negatively correlated meaning that the price reduces as we remove airbags.

7.3 Relevel and Compare. Now, suppose that you want to compare prices of cars with air bags to those without airbags. Do this, please re-level the AirBags factor variable so that the reference level is changed to “None”. Then fit the regression model again after re-leveling AirBags. Display the summary() results of this model.

Cars93$AirBags <- relevel(Cars93$AirBags, ref = "None")
summary(lm(Price ~ Type+MPG.city+AirBags+Origin, data = Cars93))
## 
## Call:
## lm(formula = Price ~ Type + MPG.city + AirBags + Origin, data = Cars93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.177  -3.853  -1.176   2.865  28.119 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                29.6931     4.7225   6.288 1.43e-08 ***
## TypeLarge                   3.0755     2.7739   1.109 0.270739    
## TypeMidsize                 5.1573     2.1830   2.362 0.020496 *  
## TypeSmall                  -0.2819     2.5978  -0.109 0.913856    
## TypeSporty                  0.3151     2.3294   0.135 0.892722    
## TypeVan                    -0.8718     2.9036  -0.300 0.764744    
## MPG.city                   -0.7957     0.1912  -4.162 7.68e-05 ***
## AirBagsDriver & Passenger   8.9089     2.2844   3.900 0.000195 ***
## AirBagsDriver only          4.5643     1.6720   2.730 0.007735 ** 
## Originnon-USA               5.1411     1.4387   3.573 0.000590 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.302 on 83 degrees of freedom
## Multiple R-squared:  0.616,  Adjusted R-squared:  0.5744 
## F-statistic: 14.79 on 9 and 83 DF,  p-value: 5.166e-14

7.4 Inspect the coefficients in the two models (before and after re-leveling) and answer briefly: What is the difference in interpretation between the “AirBagsDrive only” coefficients between the 2 models? Explain any change in coefficient values, sign and significance.

After we releveled the model the airbag variables still suggest a relationship to price because we removed the option none it suggest that there is an increase in price as we add airbags. The p-values for both variables are highly significant which reaffirms their signficance to the change in price.

#8. Polynomial Transformation

8.1 Using the Salaries{car} data set, fit a linear model to predict salary as a function of rank and yrs.since.phd. Store the linear model in an object named “fit.linear”. Display the summary() results.

fit.linear = lm(salary ~ rank+yrs.since.phd, data = Salaries)
summary(fit.linear)
## 
## Call:
## lm(formula = salary ~ rank + yrs.since.phd, data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -67148 -16683  -1323  11835 105552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   81186.23    2964.17  27.389  < 2e-16 ***
## rankAssocProf 13932.18    4345.76   3.206  0.00146 ** 
## rankProf      47860.42    4412.63  10.846  < 2e-16 ***
## yrs.since.phd   -80.37     129.47  -0.621  0.53510    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23650 on 393 degrees of freedom
## Multiple R-squared:  0.3948, Adjusted R-squared:  0.3902 
## F-statistic: 85.47 on 3 and 393 DF,  p-value: < 2.2e-16

8.2 Answer briefly: What are the best predictors of faculty salaries? Why?

According to the model the best predictors of faculty salaries is rank of professor. Both variables that rank professors have t-values far from zero which suggest a relationship and their p-values are either very signficant or high significant unlik yrs. since phd.

8.3 Who makes higher salaries, Assistant Professors, Associate Professors or Professors? How much more, on average?

Professors appear to make more than all the other ranks of professors. The average salary of all types of professors is 81K but professors have the biggest jumps at 48K in salary as faculty go from associate to tenured professors.

8.4 Does the number of years since obtaining a PhD makes a difference in the salary? Why or why not?

No it does not. The p-value for this variable is .53 which cites that it doesn’t have a significant impact on salares. The t-value is near 0 indicating there is little to no relationship.

8.5 Then fit a polynomial model of power 4. Leave rank alone (since it is Factor variable). Apply the polynomial transformation only on the yrs.since.phd variable using the poly() function. Store the resulting model in an object named “fit.poly”. Display the summary() results.

fit.poly=lm(salary~rank+poly(yrs.since.phd, 4), data=Salaries) 
summary(fit.poly)
## 
## Call:
## lm(formula = salary ~ rank + poly(yrs.since.phd, 4), data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54287 -16094  -1994  11395 103566 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                80921       5974  13.545  < 2e-16 ***
## rankAssocProf              15277       6409   2.384   0.0176 *  
## rankProf                   45255       7420   6.099 2.57e-09 ***
## poly(yrs.since.phd, 4)1    -2133      44221  -0.048   0.9615    
## poly(yrs.since.phd, 4)2   -44716      37340  -1.198   0.2318    
## poly(yrs.since.phd, 4)3   -55759      28253  -1.974   0.0491 *  
## poly(yrs.since.phd, 4)4    23288      24699   0.943   0.3463    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23470 on 390 degrees of freedom
## Multiple R-squared:  0.4086, Adjusted R-squared:  0.3995 
## F-statistic: 44.91 on 6 and 390 DF,  p-value: < 2.2e-16

8.6 Conduct an ANOVA test to evaluate if the polynomial model has more predictive power than the linear model.

anova(fit.linear, fit.poly)
## Analysis of Variance Table
## 
## Model 1: salary ~ rank + yrs.since.phd
## Model 2: salary ~ rank + poly(yrs.since.phd, 4)
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1    393 2.1985e+11                               
## 2    390 2.1485e+11  3 4999068143 3.0247 0.02953 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.7 Does the polynomial model have more predictive power than the linear model? Why or why not?

The only difference is the cubed polynomial variable which shows a p-value of .05 and the t-values is sufficiently away from zero. The cubed polynomial for yrs.since.phd fits the intercept better than the linear model. I can’t say that it would be better at predicting just for the fact that a cubed polynomial variable won’t do well once new data is submitted.

8.8 Based on these polynomial regression results, how would you interpret the effct of yrs.since.phd? (As we discussed in class, polynomials of power>2 are very hard to interpret, but give it a try)? In your answer, please look at the coefficients of all the different polynomial terms and provide a general interpretation.

Looking at the polynomial regression it would appear the more fit you give to the yrs.since.phd the more it affects the salary negatively I look at it as, the more years since your phd the more it negatively impacts the faculty’s salaries.

8.9 There is a well-known phenomenon in academics called “salary compression” in which newly minted PhD’s command higher salaries in the market than older professors. Take a look at the coefficient values and significance levels of both, the rank and all the polynomial terms and discuss whether you see evidence of salary compresion or not. Please briefly explain your rationale.

Looking the both regression models there is defintely some salary compression the more the years added to the yrs. since phd the worst the salary becomes until about poly 4 where we see an increase of salary.

#9. Standardized Coefficients

9.1 Using the Cars93{MASS} data set, fit a model to predict a car’s price as a function of the car’s type, city miles per gallon, air bags and origin. Store the results in an object named fit.unstd and display the summary results for this linear model object.

fit.unstd = lm(Price ~ Type+MPG.city+AirBags+Origin, data = Cars93)
summary(fit.unstd)
## 
## Call:
## lm(formula = Price ~ Type + MPG.city + AirBags + Origin, data = Cars93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.177  -3.853  -1.176   2.865  28.119 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                29.6931     4.7225   6.288 1.43e-08 ***
## TypeLarge                   3.0755     2.7739   1.109 0.270739    
## TypeMidsize                 5.1573     2.1830   2.362 0.020496 *  
## TypeSmall                  -0.2819     2.5978  -0.109 0.913856    
## TypeSporty                  0.3151     2.3294   0.135 0.892722    
## TypeVan                    -0.8718     2.9036  -0.300 0.764744    
## MPG.city                   -0.7957     0.1912  -4.162 7.68e-05 ***
## AirBagsDriver & Passenger   8.9089     2.2844   3.900 0.000195 ***
## AirBagsDriver only          4.5643     1.6720   2.730 0.007735 ** 
## Originnon-USA               5.1411     1.4387   3.573 0.000590 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.302 on 83 degrees of freedom
## Multiple R-squared:  0.616,  Adjusted R-squared:  0.5744 
## F-statistic: 14.79 on 9 and 83 DF,  p-value: 5.166e-14

9.2 Then, using the lm.beta(){lm.beta} function, extract and the standardized regression coefficients for this model and display the results. Store the results in an object named lm.std and display its summary()

library(lm.beta)
lm.std <- lm.beta(fit.unstd)
summary(lm.std)
## 
## Call:
## lm(formula = Price ~ Type + MPG.city + AirBags + Origin, data = Cars93)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.177  -3.853  -1.176   2.865  28.119 
## 
## Coefficients:
##                           Estimate Standardized Std. Error t value
## (Intercept)               29.69310      0.00000    4.72251   6.288
## TypeLarge                  3.07554      0.10338    2.77387   1.109
## TypeMidsize                5.15727      0.22813    2.18304   2.362
## TypeSmall                 -0.28187     -0.01227    2.59777  -0.109
## TypeSporty                 0.31511      0.01173    2.32941   0.135
## TypeVan                   -0.87178     -0.02683    2.90359  -0.300
## MPG.city                  -0.79575     -0.46296    0.19121  -4.162
## AirBagsDriver & Passenger  8.90892      0.34998    2.28440   3.900
## AirBagsDriver only         4.56426      0.23687    1.67198   2.730
## Originnon-USA              5.14108      0.26742    1.43868   3.573
##                           Pr(>|t|)    
## (Intercept)               1.43e-08 ***
## TypeLarge                 0.270739    
## TypeMidsize               0.020496 *  
## TypeSmall                 0.913856    
## TypeSporty                0.892722    
## TypeVan                   0.764744    
## MPG.city                  7.68e-05 ***
## AirBagsDriver & Passenger 0.000195 ***
## AirBagsDriver only        0.007735 ** 
## Originnon-USA             0.000590 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.302 on 83 degrees of freedom
## Multiple R-squared:  0.616,  Adjusted R-squared:  0.5744 
## F-statistic: 14.79 on 9 and 83 DF,  p-value: 5.166e-14

9.3 Answer briefly: what is the difference between the unstandardized and standardized regression results? Why would you use standardized variables or coefficients?

There is no difference other than the extra column in the standardized regression result. A person may use beta coefficients in order to get more info, such as how much effect does each variable have on the dependent variable in this case price.

9.4 Answer briefly: is it OK to standardize binary or categorical variables like Type or AirBags How would you get around this issue? Yes it is ok to standardize binary or categorical variables. The best way to get around this issue is to creat dummy variables.

#10. Log Models

10.1 Using the read.table() function, read the Credit.csv data set into a data frame named credit. Ensure that you use header=TRUE. We want to use this data to predict credit Rating. Then display a histogram and a qq-plot for the Rating variable. It should be pretty obvious from the histogram that this variable is not normal, although the qq-plot is borderline.

credit <- read.table("C:/Users/Christian/Documents/Credit.csv", header = TRUE, sep = ",")
hist(credit$Rating)

qqnorm(credit$Rating)
qqline(credit$Rating)

10.2 Even if the response variable is not normal, if the residual of the regression model is fairly normal, then it is OK to use the response variable without transformation. Let’s explore that. Fit a model called fit.linear to predict Rating, using Income, Age and Gender as predictors. Display a summary() of the results. Then plot the resulting fit.linear model, but just display the residual plot, using the which=2 parameter.

fit.linear <- lm(Rating ~ Income+Age+Gender, data = credit)
summary(fit.linear)
## 
## Call:
## lm(formula = Rating ~ Income + Age + Gender, data = credit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -180.226  -77.204   -0.342   78.129  169.052 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  212.0946    17.1029  12.401   <2e-16 ***
## Income         3.5034     0.1367  25.628   <2e-16 ***
## Age           -0.3304     0.2793  -1.183    0.238    
## GenderFemale   5.4432     9.4804   0.574    0.566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.74 on 396 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6251 
## F-statistic: 222.7 on 3 and 396 DF,  p-value: < 2.2e-16
plot(fit.linear, which = 2)

10.3 After inspecting the residual plot, do you think that this model is OK? Or do you think that you need to log-transform the Rating variable?

The data is not too terribly skewed to suggest that we would need to log-transform the Rating variable.

10.4 Regardless of your answer to 2.2, it would not be a bad idea to test a few log tranformations to see if we get better predictive accuracy. Please fit both, a log-linear model (loging only the response variable Rating) and a log-log (loging only the response variable Rating and Income. Store the results of the first model in an object named fit.log.linear and the second one in an object named fit.log.log. Display the summary() for both models.

fit.log.linear = lm(log (Rating)~ Income+Age+Gender, data = credit)
summary(fit.log.linear)
## 
## Call:
## lm(formula = log(Rating) ~ Income + Age + Gender, data = credit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99344 -0.21076  0.04697  0.25875  0.52991 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4381923  0.0599182  90.760   <2e-16 ***
## Income        0.0088430  0.0004789  18.465   <2e-16 ***
## Age          -0.0013459  0.0009784  -1.376     0.17    
## GenderFemale  0.0229726  0.0332137   0.692     0.49    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3319 on 396 degrees of freedom
## Multiple R-squared:  0.4654, Adjusted R-squared:  0.4614 
## F-statistic: 114.9 on 3 and 396 DF,  p-value: < 2.2e-16
fit.log.log = lm(log(Rating)~ log(Income)+Age+Gender, data = credit)
summary(fit.log.log)
## 
## Call:
## lm(formula = log(Rating) ~ log(Income) + Age + Gender, data = credit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9002 -0.2105  0.0400  0.2712  0.6775 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2661905  0.0993755  42.930   <2e-16 ***
## log(Income)   0.4389052  0.0248821  17.639   <2e-16 ***
## Age          -0.0011171  0.0009974  -1.120    0.263    
## GenderFemale  0.0137189  0.0339043   0.405    0.686    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3388 on 396 degrees of freedom
## Multiple R-squared:  0.4429, Adjusted R-squared:  0.4387 
## F-statistic: 104.9 on 3 and 396 DF,  p-value: < 2.2e-16

10.5 Please provide a quick interpretation of the Income or log(Income) coefficient for each of the three fitted models.

The Income coefficient appears to have a large t-value compared to the estimate which indicates a definite relationship and the p-vaule is highly significant as well. Income appears to have a profound impact on the variable Rating.

10.6 Using the Adjusted R-Square as a guide, which of the three models is the best (please note that you cannot compare the 3 models with ANOVA because they are not nested)

The log linear model appears to have the highest Adjusted R-Square meaning that it’s model explains the highest percentage of variables in the model.