Homework 3

Formatting Note

Being that this homework has many more sub questions than previous ones, I’m going to move the R chunks to right after the sub questions versus doing the massive wall of code outputs. I assume this will be a welcome change compared to previous efforts.

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

-A) A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

##Question A math work
  #observed price of home sold
house_price_actual <- 145000

  #predicted price of home sold
house_price_prediction <-  -10536 + (53.8*1240) + (2.84*18000)

  #get residual
  house_price_residual <- house_price_prediction - house_price_actual

For the predicted sale price of the house using 1240 sq ft for x1 and 18000 square ft for x2, I found 107296 and the residual when comparing to the observed sale price of $145,000 is -37704. This means that our model was under-predicted the observed value of home sales by a pretty substantial amount. I suspect this is due to the model failing to account for other variables that explain other factors in the final sale price of a home. Distance to downtown, school districts, or other variables likely play a part in describing the distance between our predicted and actual values.

-B) For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

For each square foot the predicted home price increase is $53.80. This is because the slope for that variable is in change for each increment of the variable on the outcome variable controlling for the other variables in the equation for this particular type of multiple regression. Since lot sized is fixed here, we wind up with the proof of that concept as the intercept and x2 terms would remain constant while only the 53.8x term would change.

-C) According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

If we divide the x1 slope by the x2 slope, we will get the square footage the lot size would have to increase by in order to equal the incremental change from a square foot increase of the home size. Calculated as 18.94 square feet rounding to the same digits as that slope term.

Question 2

The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

##load the data for the following questions
salary_dat <- as_tibble(salary)

-A) Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

## Ho: men_salary = women_salary. Ha: men_salary =/= women_salary, telling it to fit intercept explicitly in the lm function of means with a factor variable
salary_lm_a <- lm(salary ~ 1 + sex, data = salary_dat)

#use confint to find 95% for sex parameter
tidy(salary_lm_a) #I had a plan for using the broom functions I swear

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   24697.      938.     26.3  5.76e-31
2 sexFemale     -3340.     1808.     -1.85 7.06e- 2

#I just realized I should finish the homework instead of mess around with pacakges.
confint(salary_lm_a)

               2.5 %    97.5 %
(Intercept) 22812.81 26580.773
sexFemale   -6970.55   291.257

By fitting a model for the data and getting a fit for the intercept, which will represent the Male salary data and the fit for the Female survey data we can compare their means and get p-value for the Female salary data. With the estimated level for the Female variable at -3340 of the intercept/Male data, we can say that we estimate that men are being paid more by just over 3300 dollars. We are confident of this at the .1 level for the two sided test, and looking at the confint range it seems likely that people in the Female dataset at this school were getting paid lower looking at the sex variable alone.

-B) Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

## Do the linear model for all of the variables. Year is lowercase unlike Q
salary_lm_b <- lm(salary ~ sex + degree + rank + year + ysdeg, data = salary_dat)

#look at the model summary. point estimate has flipped to positive. hmm.
summary(salary_lm_b)


Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
sexFemale    1166.37     925.57   1.260    0.214    
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

#confint for the sexFemale
confint(salary_lm_b, "sexFemale", level = .95)

              2.5 %   97.5 %
sexFemale -697.8183 3030.565

Only the next question asks for the interpretation, but I’ll put that question for this variable examination here. The p-value for the sexFemale variable is above thresholds for significance, so the confidence in our interval of difference is pretty low. I’m suspicious of it. The positive point estimate as well as the range on the interval, however, would suggest that holding all else equal that sexFemale is associated with a higher salary.

-C) Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

The rank variable and its levels of Professor and Associate Professor are significant all the way up at the functional 0 level in relation to the Assistant Professor level. The coefficients show that each level of rank that a Professor climbs is associate with a little bit more than five thousand dollar increase in our predicted salary formula.

The year someone has been in the current rank variable level is significant at the same level code, but with a much lower slope when compared to the coefficient leap between ranks suggesting that getting across the cataegory threshold in rank is more important than time served for salary. Between 10 and 12 years of service equates to the change in a rank level.

The degreePhD coefficient is not significant with its p of .180, but it does have a positive value of 1388.61. I was a little surprised by this one, but perhaps the rank variable or something else picks up the effects of higher education? I would have assumed it would have been a little bit higher from the very little I know about general education union contracts. But hey, it was the 1980’s who knows what was going on.

The ysdeg variable has a slightly negative slope, but it is just beyond the .1 level of significance. The only thing I could think of that might explain the slope generally, though I’m not certain it’s really helping the model all that much given the other categories, would be to adjust some of the longer serving professors who only have masters degrees downwards slightly, and since we don’t have an interaction between the ysdeg and degree level it’s messy enough to not be at a statistically significant level.

-D) Change the baseline category for the rank variable. Interpret the coefficients related to rank again. My lack of clear reading of the instructions on the variables almost got me again, just like the quizzes! However, I realized just in time that the rank variable has three levels and I was thinking about the degree variable in my head for some reason. Phew, okay I’m going to relevel the rank variable on Associate Professor and check it out.

#relevel the rank variable, run the model again
salary_dat$rank <- relevel(salary_dat$rank, ref = "Assoc")

#new model
salary_lm_d <- lm(salary ~ rank + sex + degree + year + ysdeg, data = salary_dat)

#summarizing the model
summary(salary_lm_d)


Call:
lm(formula = salary ~ rank + sex + degree + year + ysdeg, data = salary_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21038.41    1109.12  18.969  < 2e-16 ***
rankAsst    -5292.36    1145.40  -4.621 3.22e-05 ***
rankProf     5826.40    1012.93   5.752 7.28e-07 ***
sexFemale    1166.37     925.57   1.260    0.214    
degreePhD    1388.61    1018.75   1.363    0.180    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Now that the middle rank is the reference category, the coefficients have shifted so that the roughly five thousand dollar increase we predict as you go up the scale is now seen in the in the -5292.36 coefficient value of stepping down to the Asst level, and the same 5826.4 increase that we saw in the last model going from Assoc to Prof.

-E) Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts. Exclude the variable rank, refit, and summarize how your findings changed, if they did.

#same as salary_lm_b but removing rank.
salary_lm_e <- lm(salary ~ sex + degree + year + ysdeg, data = salary_dat)

#summary of model
summary(salary_lm_e)


Call:
lm(formula = salary ~ sex + degree + year + ysdeg, data = salary_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
sexFemale   -1286.54    1313.09  -0.980 0.332209    
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

confint(salary_lm_e, "sexFemale", level = .95)

              2.5 %   97.5 %
sexFemale -3928.138 1355.049

The sexFemale coefficient is back to being a negative value relative to the Male level in our intercept. The p value is above significance thresholds, however.

Outside of the main point of the question, I’m a little stumped on why the degreePhD coefficient is negative relative to the Masterslevel in the intercept besides perhaps small sample size messiness. Theysdeg variable looks like it picks up some of the descriptive power that the rank variable left behind as we would expect. I would have assumed some of the same for PhD. I’m sure my self-narration of something that’s potentially indicating the whole function is wrong and therefore all of my answers will be amusing after I see the actual answers.

-F) Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

#relevel rank
salary_dat$rank <- relevel(salary_dat$rank, ref = "Asst")

#make a dummy variable for years since degree less than 15 to note as hired by new_dean.
salary_dat <- salary_dat %>%
  mutate(new_dean = case_when(
    ysdeg <= 15 ~ 1,
    ysdeg > 15 ~ 0
  ))



#the model
salary_lm_f <- lm(salary ~ year + sex + degree + rank + new_dean  + new_dean*year, data = salary_dat)

#summary of model
summary(salary_lm_f)


Call:
lm(formula = salary ~ year + sex + degree + rank + new_dean + 
    new_dean * year, data = salary_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-3309.8 -1102.5  -265.2   539.2  9339.4 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   12011.05    1931.93   6.217 1.62e-07 ***
year            495.72      97.42   5.088 7.20e-06 ***
sexFemale      1115.96     862.06   1.295   0.2022    
degreePhD      1161.63     859.23   1.352   0.1833    
rankAssoc      5416.07    1079.72   5.016 9.14e-06 ***
rankProf      11612.80    1284.64   9.040 1.37e-11 ***
new_dean       3789.08    1867.72   2.029   0.0486 *  
year:new_dean  -195.43     183.99  -1.062   0.2940    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2358 on 44 degrees of freedom
Multiple R-squared:  0.8629,    Adjusted R-squared:  0.8411 
F-statistic: 39.58 on 7 and 44 DF,  p-value: < 2.2e-16

confint(salary_lm_f, "new_dean", level = .95)

            2.5 %  97.5 %
new_dean 24.93623 7553.22

#check variance inflation factor for model
car::vif(salary_lm_f)

                  GVIF Df GVIF^(1/(2*Df))
year          2.639661  1        1.624703
sex           1.366920  1        1.169154
degree        1.562142  1        1.249857
rank          3.611003  2        1.378501
new_dean      8.153135  1        2.855369
year:new_dean 3.665354  1        1.914512

Multicollinearity is a concern in this question because the years since highest degree and the new variable new_dean would be measuring similar things and therefore not clearing the additivity consideration. The variable year in rank would also potentially be overlapping with the new_dean dummy variable, as the same pre and post 15 year threshold would potentially be seen there. To deal with this, I added an interaction effect with the year in rank variable and the new_dean variable, and after looking at the adjusted R-squared for the model with and without ysdeg and a ysdeg*new_dean and even the monstrosity of ysdeg*years*new_dean, dropping ysdeg made for a better fit. This makes sense as the variable itself lost significance when the rank level was included previously, so dropping that term and adding the interaction effect should deal with the multicollinearity. I also checked the variable inflation factor, and using the second GVIF since I’m using more than one coefficient, and none of the values is near or over 5 to investigate further. The new_dean and ysdeg did have the highest values, however.

I do find support at the .05 significance level that being hired by the new Dean are making a higher wage than those that were not. The estimated dollar amount of the effect is 3789.08, and I printed the 95% confidence interval above as well.

Question 3

-A) Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)

data("house.selling.price")
#I like tibbling and assigning to dats, 
#even if smss wants only data() to be the way
house_dat <- as_tibble(house.selling.price)

#make the model for Old set as reference
house_lm_a_old <- lm(Price ~ New + Size, data = house_dat)


#summary of new model
summary(house_lm_a_old)


Call:
lm(formula = Price ~ New + Size, data = house_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
New          57736.283  18653.041   3.095  0.00257 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

-B) Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient. The combined equation above would be y = -40230.87 + 116.13x1 + 57736.28x2 where x1 = sq ft and x2 = if the house is New (dummy variable = 1). Each of the terms is significant at the .001 level, and size is significant all the way to the 0 code. These terms are all strongly associated with the estimated Price for a home.

A New home is worth 57,736.28 in price compared to an old home, or 17505.41 if we were just looking at a model for new homes. The intercept is the starting point price for an old home, which at -40,230.88 functions as a sort of tax on being pre-owned that the size of the house must compensate for. An older home would have to be larger than a newer home to have the same predicted price.

We can consolidate the starting price point estimates down for an equation for new homes as Price = 17505.41 + 116.13*x1 and for old homes Price = 116.13*x1 -40230.87.

-C) Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

#for the new house with x1 = 3000 sq ft
price_new_3k <- (116.13*3000) + 17505.41
  
#for the old house with x1 3000 sq ft
price_old_3k <- (116.13*3000) - 40230

For the 3000 square foot new home, 365895.4. For the old 3000 square foot home the prediction is 308160. I wish I would have been shopping for a home in Gainesville, FL in 2006. Well, except for the whole housing bubble thing that is about to happen back then. Living through that once was enough.

-D) Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

#make the new house price model with new and size interaction term
house_lm_d <- lm(Price ~ Size + New + New*Size, data = house_dat)

#summary of the model
summary(house_lm_d)


Call:
lm(formula = Price ~ Size + New + New * Size, data = house_dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

-E) Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

#start with the overall interaction model to simplify
  #smss uses z for categorical, switch to that vs x2 from earlier.
  #z = new, x = sq ft

#house_lm_e <- -22227.81 -78527.50x + 104.44x + 61.92(x*z)

#old model line
#house_lm_e_old <- -22227.81 + (104.44*x)

#new model line
#house_lm_e_new <- -100755.31 + (104.44*x) + (61.92*x) 
#house_lm_e_new <- -100755.31 + (166.36*x)

I commented out the line functions as to not have to get the actual assignment of the Size variables and all that, but the line for old homes is Predicted Price = -22227.81 + (104.44*size) and for new homes it is Predicted Price = -100755.31 + (166.36*size).

-F) Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

#(i) x = 3000 sq ft for new home
lm_int_predict_30k_new <- (166.36*3000) -100755.31

#(ii) x = 3000 sq ft for old home
lm_int_predict_30k_old <- (104.44*3000) -22227.81

For (i) the predicted selling price is 398324.7. For (ii) the predicted selling price is 291092.2.

-G) Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

#(i) x = 1500 sq ft for new home
lm_int_predict_15k_new <- (166.36*1500) - 100755.31

#(ii) x = 1500 sq ft for old home
lm_int_predict_15k_old <- (104.44*1500) -22227.81

For (i) the predicted selling price is 148784.7 and for (ii) it is predicted to be 134432.2. The gap is smaller when compared to the answers for (F), since the slope is higher for new homes. We would expect this as the house size got bigger that new homes would be worth an increasing level more than old homes, roughly 62 dollars per square foot more for each incremental increase.

-H) Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

#do anova on the two models instead of summary!

anova(house_lm_a_old, house_lm_d)

Analysis of Variance Table

Model 1: Price ~ New + Size
Model 2: Price ~ Size + New + New * Size
  Res.Df        RSS Df  Sum of Sq      F   Pr(>F)   
1     97 2.8161e+11                                 
2     96 2.5957e+11  1 2.2041e+10 8.1519 0.005272 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Instead of just comparing the R^2 and such from the summaries of the models already printed, I decided to look at the anova results comparing the house_lm_a_old model with no interaction to the house_lm_d model which included the interaction effects. At the .001 significance level, the model including the interaction term does improve the model at capturing the data. I would use the house_lm_d model for predictions as a result if choosing between the two.