HW3_DACSS603

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 +2.84x2

(A) A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

# the predicted selling price

 -10536 + (53.8*1240) + (2.84*18000)

[1] 107296

Based on this model, we would predict the selling price of the home to be $107,296. However, the actual selling price was $145,000. So, the residual is 145000 - 107296 = 37704. This means that the actual price is $37,704 greater than the predicted price. One reason this could happen is because there may be other variables that affect the selling price that are unexplained by the model. Perhaps whether a house is new is another factor that could be incorporated in, or perhaps there is an interaction between these two variables.

(B) For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

We can figure this out by holding the lot size constant. Looking at the formula (ŷ = −10,536 + 53.8x1 + 2.84x) we can see that the for a fixed lot size we see that the house selling price is predicted to increase by $53.8 for each sq. foot increase in home size.

(C) According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

Again, looking at the formula, for a fixed lot size we see that the house selling price is predicted to increase by ~$3 for each sq. food increase in lot size. So you would have to increase by ~19 sq feet (53.8/2.84 = 18.94) in order to have the same impact as a one-square-foot increase in home size.

Question 2

(ALR, 5.17, slightly modified)

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

(A) Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

data("salary")

#box plot
boxplot(salary~sex, salary)

summary_salary <- salary %>%
  group_by(sex) %>%
  summarise(average_salary = mean(salary), min_salary = min(salary), max_salary = max(salary))

#summary plot
ggplot(summary_salary) +   
  geom_point(aes(x = sex, y = average_salary), color = "#FF5C35", size = 4) +
  geom_errorbar(aes(x = sex, ymin = min_salary, ymax = max_salary), color = "#FF5C35", width = 0.5) +
  labs(x = "Sex",  y = "Salary") +
  geom_text(aes(x = sex, y = max_salary, label = max_salary), 
            family = "Avenir", size=3, color = "#33475b", hjust = -3) +
  geom_text(aes(x = sex, y = min_salary, label = min_salary), 
            family = "Avenir", size=3, color = "#33475b", hjust = -3) +
  geom_text(aes(x = sex, y = average_salary, label = round(average_salary, digits = 2)), 
            family = "Avenir", size=3, color = "#33475b", hjust = -0.5) +
  theme(axis.text.x = element_text(family = "Avenir", color = "#33475b", size=10),
        axis.text.y = element_text(family = "Avenir", color = "#33475b", size=8),
        axis.title.y = element_text(family = "Avenir", color = "#33475b", size=13),
        axis.title.x = element_text(family = "Avenir", color = "#33475b", size=13))

Without regard to any other variable, we see that the mean salary for men and women are not the same. From the visuals alone, we don’t know if this is significant. The mean salary for men is higher. However, it also seems like there is an Woman who makes the most (so there is a bigger range for women). From this visual alone, we don’t know if this is significant.

# t.test to text the null hypothesis that men and women make the same amount. 

t.test(salary~sex, data = salary)


    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14

Even though men have a mean salary that is higher than women, the p-value (.09) is higher than a significance level of 5%. Therefore, we cannot reject the null hypothesis that the mean salary for men and women are the same.

(B) Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

I used the lm() function to run a linear regression.

summary(lm(salary ~ ., data = salary))


Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Next, I used ’confint()` to find the 95% confidence interval of the means for all variables.

confint(lm(salary ~ ., data = salary))

                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

From this we see that at a 95% confidence interval of the mean for the difference in salary between males and females, the lower bound -$697.82 and the upperbound is $3,030.56. So, 95% of the time males will make between $697.82 more and $3,030.56 less than women.

(C) Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

Salary and degreePhD - from the regression we see that if an employee has a PhD with all other variables holding constant, we would predict that their salary would be $1,388.61 higher, or a positive slope. However, the p-value of this effect is 0.18, so it is not significant at the 0.05 level (which means it is a weak predictor).
Salary and rankAssoc - we can see that if an employee has a rank of Associate we’d predict their salary would be $5292.36 higher (baseline is rankAsst), all other variables holdign constant. Again, this indicates a positive slope. The p value for rankAssoc is 3.22e-05 and statistically significant at the 0.00l level, so there is a strong effect.
Salary and rankProf - we can see that if an employee has a rank of Professor, holding all other variables constant, we’d predict salary to be $11,118.76 higher (baseline is rankAsst), or a positive slope. The p value for rankAssoc is 3.22e-05 and statistically significant at the 0.00l level, so there is a strong effect.
Salary and sexFemale - from the regression output we can see that if an employee is female and every other variable is held constant, we’d predict that their salary would increase by $1,166.37 (a positive slope). However, the p value for if your sex is female is .21 and statistically insignificant at the p = 0.05 level.
Salary and year - We see that, holding all else constant, we predict that employees will see an increase of $476.31 per year. Thus we would see a positive slope. The p-value for year is 8.65e-06 and statistically significant at the 0.001 level.
Salary and ysdeg - We can see that, holding all else constant, each incremental year since degree reduces salary by $124.57, a negative slope. However, the p value for ysdeg is .12 and statistically insignificant at p = 0.05.

The strongest predictors of salary are rank and years in rank. The adjusted R-squared value suggests that the coefficients explains 83.57% of the salary differences.

(D) Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

I changed the baseline category for the rank variable from Assistant to Professor.

#salary$professor <- ifelse(salary$rank == 'Prof', 1, 0)
salary$rank <- relevel(salary$rank, ref = 'Prof')
summary(lm(salary ~ ., data = salary))


Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Salary and rankAsst - we can see that if an employee has a rank of Assistant we’d predict their salary would be $11118.76 lower than the baseline category of Professor. This indicates a negative slope. The p-value for rankAsst with the baseline of professor is 1.62e-10 and statistically significant at the 0.00l level, so there is a strong effect.
Salary and rankAssoc - we can see that if an employee has a rank of Associate, holding all other variables constant, we’d predict salary to be $5826.40 lower than the baseline category of Professor. Here the p-value for rankAssoc is 7.28e-07 and statistically significant at the 0.00l level, so there is a strong effect.

(E) Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.

Exclude the variable rank, refit, and summarize how your findings changed, if they did.

#exclude the variable rank

summary(lm(salary ~ . -rank, data = salary))


Call:
lm(formula = salary ~ . - rank, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

Ignoring rank:

Salary and degreePhD - We see that if an employee had a PhD, we’d predict that their salary would be $3299.35 lower (or a negative slope). This effect has a p-value of 0.014704 so it is significant at the 0.05 level.
Salary and sexFemale - We see that if an employee is female, we’d predict their salary would be $1286.54 lower, or a negative slope. This effect has a p-value of 0.332209, so it is not statistically significant at the 0.5 level.
Salary and year - We see that the model would predict that an employee’s salary would go up by $351.97 for every year at that rank (or a positive slope). This is significant at the 0.05 level.
Salary and ysdeg - We see that the model would predict that an employee’s salary would go up by $339.40 for every year since their degree (or a positive slope).This is significant at the 0.001 level.

Note: if rank is tainted by Sex, we could argue that other variables in this data set are tainted as well, so using data like these to resolve issues of discrimination will never satisfy everyone.

(F) Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

In order to test this hypothesis I created a dummy variable (newdean). If ysdeg is 15 years or less it will code as 1. Otherwise it will be coded as 0.

Multicollinearity could be a concern if two variables are strongly correlated. In order to avoid this I removed ysdegree because of the overlap with the new dummy variable (the dummy variable was based on ysdeg). [ELIZA TAKE ANOTHER LOOK HERE]

#dummy variable

salary$newdean <- ifelse(salary$ysdeg <= 15, 1, 0)

summary(lm(salary ~ . - ysdeg, data = salary))


Call:
lm(formula = salary ~ . - ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24425.32    1107.52  22.054  < 2e-16 ***
degreePhD      818.93     797.48   1.027   0.3100    
rankAsst    -11096.95    1191.00  -9.317 4.54e-12 ***
rankAssoc    -6124.28    1028.58  -5.954 3.65e-07 ***
sexFemale      907.14     840.54   1.079   0.2862    
year           434.85      78.89   5.512 1.65e-06 ***
newdean       2163.46    1072.04   2.018   0.0496 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16

Based on this regression, it appears that employees hired uder the new dean are predicted to make $2163.46 more than employees hired before the new dean. However, this correlation is only significant at the 0.05 level - so it is significant, but it is not the strongest predictor of salary.

Question 3

(SMSS 13.7 & 13.8 combined, modified)

(Data file: house.selling.price in smss R package)

(A) Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of Size of home (in square feet) and whether the home is New (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)

I downloaded the data and then I ran a linear regression using the lm() function.

#downloading the data
data("house.selling.price")

# price is the outcome variable and size and new are the explanatory variables
summary(lm(Price ~ Size + New, data = house.selling.price))


Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Overall we can see that the coefficient explains 71.7% of the housing prices. We also see that both Size and New p-values are less than .05:

We reject the null hypothesis that there is no correlation between Size and selling price of the home.
We reject the null hypothesis that there is no correlation between New and selling price of the home.

Both variables are statically significant to the selling price of the home.

(B) Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

Based the above regression model, the prediction equation for the price of a home would be yhat = -40230.867 + 116.132x + 57736.283z where x = the size of the home and z = 1 if the home is new or z = 0 if the home is not new.

For new homes, z = 1 so the prediction equation for the selling price of a new home is yhat = -40230.867 + 116.132x + 57736.283 or yhat = 17505.416 + 116.132x
For not new homes, z = 0 so the prediction equation for the selling price of a not-new home is yhat = -40230.867 + 116.132x + 0 or yhat = -40230.867 + 116.132x

We see in the regression results and the formulas that Size and New both have a positive effect on the selling price of a home. For both new and old homes, every 1 sq foot increase in Size we predict an increase in the selling price of a home by ~$116. For houses of all sizes, where a home is new (z = 1), we predict a new home to be ~$57,736 more expensive. In this model, the impact of each variable is separate: there is no interaction. We see that both variables have small p-values; Size is significant at the 0.001 level and New is significant at the 0.01 level.

(C) Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

#predicted selling price for a new home

17505.416 + (116.132*3000)

[1] 365901.4

#predicted selling price for a not new home

-40230.867 + (116.132*3000)

[1] 308165.1

Based on the predictive formulas from the regression model, the predicted selling price for a new home that is 3000 sq. feet is $365,901 and the predicted selling price for a not new home of the same size is $308,165.

(D) Fit another model, this time with an interaction term allowing interaction between Size and New, and report the regression results

# price is the outcome variable and size and new are the explanatory variables - allowing interaction between size and new
summary(lm(Price ~ Size + New + Size*New, data = house.selling.price))


Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

We now see that the coefficients explain 73.6% of the pricing differences. This is increased from just using size and new as variables.

Selling Price and Size:New - We see the coefficient is $61.92 (a positive slope). The p-value for Size:New is 0.00527, so it is significant at the 0.01 level. We can reject the null hypothesis that there is no relationship between Size:New and selling price of the home.
Selling Price and New - Now that we are including the interaction variable, New on it’s own is insignificant, as the p-value when from 0.003 to 0.12697.

(E) Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

Based the above regression model, the prediction equation for the price of a home would be yhat = -22227.808 + 104.438x + -78527.502z + 601.916xz where x = the size of the home and z = 1 if the home is new or z = 0 if the home is not new.

For new homes, z = 1 so the prediction equation for the selling price of a new home is yhat = -22227.808 + 104.438x + -78527.502 + 61.916xz
For not new homes, z = 0 so the prediction equation for the selling price of a not-new home is yhat = -22227.808 + 104.438x + 0 + 0 or yhat = -22227.808 + 104.438x

(F) Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

#predicted selling price for a new home
#x = size (3000)
#z = new/not new (1)
#yhat = -22227.808 + 104.438x + -78527.502z + 601.916xz

-22227.808 + (104.438*3000) + (-78527.502*1) + (61.916*3000*1)

[1] 398306.7

#predicted selling price for a not new home

-22227.808 + (104.438*3000) + (-78527.502*0) + (61.916*3000*0)

[1] 291086.2

Based on the predictive formulas from the regression model, the predicted selling price for a new home that is 3000 sq. feet is $398,306.70 and the predicted selling price for a not new home of the same size is $291,086.20.

(G) Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

#predicted selling price for a new home
# yhat = -22227.808 + 104.438x + -78527.502z + 601.916xz

-22227.808 + (104.438*1500) + (-78527.502*1) + (61.916*1500*1)

[1] 148775.7

#predicted selling price for a not new home
# yhat = -22227.808 + 104.438x + -78527.502z + 601.916xz

-22227.808 + (104.438*1500) + (-78527.502*0) + (61.916*1500*0)

[1] 134429.2

Based on the predictive formulas from the regression model, the predicted selling price for a new home that is 1500 sq. feet is $148,775.7 and the predicted selling price for a not new home of the same size is $134,429.20. For new homes, each additional sq. foot adds ~$706 (104.438 + 601.916) to the predicted selling price and for not new homes, each sq. food adds ~$104 in predicted selling price.

(H) Do you think the model with interaction or the one without it represents the relationship of Size and New to the outcome price? What makes you prefer one model over another?

I think the model that allows for interactions does a better job of representing the relationship between Size and New to the selling price. When comparing the two models we see that adjusted R-squared is greater for the model that allows the interaction, which suggests that the coefficients explain slightly more of the interaction between size, new and selling price. Additionally, we see that the created variable Size:New is statistically significant, so we can reject the null hypothesis that Size:New is not correlated to selling price.