DACSS 603 Homework 3
(SMSS 11.2, except part (d))
A. A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
# predictive equation: ŷ = −10,536 + 53.8x1 + 2.84x2
yhat = -10536 + 53.8*1240 + 2.84*18000
print(paste("The predicted selling price is", yhat))
[1] "The predicted selling price is 107296"
[1] "The residual error is 37704"
Interpretation: The property sold 37,704 higher than what the model predicted.
B. For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
For fixed lot size, the positive regression coefficient is 53.8/ sq ft. This means a house selling price increases $53.8 for each sq ft increase in size, given that X2(independent variable) is constant.
C. According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
#Z=lot size increase to have the same impact as a 1 sq ft increase in home size
#(2.84)*(Z)=(53.8)*(1 sq ft)
x1_coef= 53.8 #house size
x2_coef= 2.84 #lot size
Z=x1_coef/x2_coef
Z
[1] 18.94366
Z1 <- as.integer(round(Z)) #round up answer
print(paste("A lot size increase of", Z1, "sq ft have same impact per 1 sq ft increase in house size"))
[1] "A lot size increase of 19 sq ft have same impact per 1 sq ft increase in house size"
(ALR, 5.17, slightly modified)
(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.
[1] 52 6
kable(head(salary), format = "markdown", digits = 10,caption = "**1980 Salaries of faculty at Midwestern college**")
degree | rank | sex | year | ysdeg | salary |
---|---|---|---|---|---|
Masters | Prof | Male | 25 | 35 | 36350 |
Masters | Prof | Male | 13 | 22 | 35350 |
Masters | Prof | Male | 10 | 23 | 28200 |
Masters | Prof | Female | 7 | 27 | 26775 |
PhD | Prof | Male | 19 | 30 | 33696 |
Masters | Prof | Male | 16 | 21 | 28516 |
A. Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
t.test(salary~sex, data=salary)
Welch Two Sample t-test
data: salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-567.8539 7247.1471
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
EXPLANATION: With a p-value of 0.09, we fail to reject the null hypothesis. We accept the alternative hypothesis: true difference in means between group Male and group Female is not equal to zero. The mean of Male group (24,696) is higher than mean of Female group (21,357)
B.1. Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex.
Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
t.test(salary~sex, data=salary)
Welch Two Sample t-test
data: salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-567.8539 7247.1471
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
The 95% confidence interval for the difference in salary between males and females is [-567.8539 , 7247.1471]
C. Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variablestab <- matrix(c("degree", "NOT significant", "PhD salary change =1388.61", "rank","YES- significant","Assoc salary change=5292.36", "sex","NOT significant","Female salary change= 1166.37","year"," YES- significant"," Salary change per year=476.31", "ysdegree"," NOT significant"," Salary change per ysdegree=-124.57 " ), ncol=3, byrow=TRUE)
colnames(tab) <- c('Variable','Statistical Significance','Slope:salary increase per change in variable')
tab <- as.table(tab)
kable(tab)
Variable | Statistical Significance | Slope:salary increase per change in variable | |
---|---|---|---|
A | degree | NOT significant | PhD salary change =1388.61 |
B | rank | YES- significant | Assoc salary change=5292.36 |
C | sex | NOT significant | Female salary change= 1166.37 |
D | year | YES- significant | Salary change per year=476.31 |
E | ysdegree | NOT significant | Salary change per ysdegree=-124.57 |
D. Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
Call:
lm(formula = rank ~ degree + salary + sex + year + ysdeg, data = salary)
Coefficients:
(Intercept) degreePhD salary sexFemale year
-0.5925499 -0.4710229 0.0001085 -0.3130130 -0.0616205
ysdeg
0.0469866
INTERPRETATION: Based on the coefficients, degree, sex and year variables have NEGATIVE linear relationship. The variables salary and ysdeg have POSITIVE linear relationship.
E. Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.
Exclude the variable rank, refit, and summarize how your findings changed, if they did.
Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexFemale -1286.54 1313.09 -0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
INTERPRETATION:
Excluding rank, the model is NOT a better fit with Multiple R-squared: 0.6312,Adjusted R-squared: 0.5998. The variables : degree, year and ysdeg are statistically significant. The results showed that Female sex had negative coefficient -1286.54.
F. Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.
Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?
lm(salary ~ rank+ sex + year*ysdeg, data = salary)
Call:
lm(formula = salary ~ rank + sex + year * ysdeg, data = salary)
Coefficients:
(Intercept) rankAssoc rankProf sexFemale year
16352.169 5254.351 10351.178 867.705 312.454
ysdeg year:ysdeg
-86.317 5.581
Based on the new variable coefficient ysdeg:year, there is support for the hypothesis that the people hired by the new Dean are making higher than those that were not.
(SMSS 13.7 & 13.8 combined, modified)
(Data file: house.selling.price in smss R package)
[1] 100 7
case | Taxes | Beds | Baths | New | Price | Size |
---|---|---|---|---|---|---|
1 | 3104 | 4 | 2 | 0 | 279900 | 2048 |
2 | 1173 | 2 | 1 | 0 | 146500 | 912 |
3 | 3076 | 4 | 2 | 0 | 237700 | 1654 |
4 | 1608 | 3 | 2 | 0 | 200000 | 2068 |
5 | 1454 | 3 | 3 | 0 | 159900 | 1477 |
6 | 2997 | 3 | 2 | 1 | 499900 | 3153 |
A. Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Price_New_aov <-aov(Price ~ New, data = house.selling.price)
summary.aov(Price_New_aov)
Df Sum Sq Mean Sq F value Pr(>F)
New 1 2.274e+11 2.274e+11 28.29 6.61e-07 ***
Residuals 98 7.878e+11 8.039e+09
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(Price ~ New, data = house.selling.price)
Call:
lm(formula = Price ~ New, data = house.selling.price)
Coefficients:
(Intercept) New
138567 152396
INTERPRETATION: The coefficient 152396 means the selling price increase for variable “NEW” = 1.
C. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
Dummy variable: 1=NEW, 0=NOT NEW
EQUATION: Price =138567 + 57736.283New +116.132 size
Price_NEW = 138567 + 57736.283*1 +116.132 *3000
print(paste("Predicted Selling Price for NEW home is", Price_NEW))
[1] "Predicted Selling Price for NEW home is 544699.283"
Price_OLD = 138567 + 57736.283*0 +116.132 *3000
print(paste("Predicted Selling Price for NOT NEW home is", Price_OLD))
[1] "Predicted Selling Price for NOT NEW home is 486963"
D. Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
Call:
lm(formula = Price ~ Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
This model is only slightly better fitting than previous model based on adjusted R-squared of 73.63% compared to 71.69%. The variables size and the new variable size:new are statistically significant with p-values < 0.05
E. Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
ggplot(house.selling.price, aes(x = Size, y = Price, color=New)) +
geom_point() +
geom_smooth(method = "lm",colour = 7) +
labs(x="Size", y="Price", title = "Predicted selling price vs size for NEW and NOT NEW homes")
ggplot(New_HSP, aes(x = Size, y = Price, color=New)) +
geom_point() +
geom_smooth(method = "lm",colour = 7) +
labs(x="Size", y="Price", title = "Predicted selling price vs size for NEW homes")
ggplot(Old_HSP, aes(x = Size, y = Price)) +
geom_point() +
geom_smooth(method = "lm",colour = 7) +
labs(x="Size", y="Price", title = "Predicted selling price vs size for NOT NEW homes")
F. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
Dummy variable: 1=NEW, 0=NOT NEW
EQUATION: Price =138567 + 57736.283New +116.132 size
Price_NEW = 138567 + 57736.283*1 +116.132 *3000
print(paste("Predicted Selling Price for NEW home is", Price_NEW))
[1] "Predicted Selling Price for NEW home is 544699.283"
Price_OLD = 138567 + 57736.283*0 +116.132 *3000
print(paste("Predicted Selling Price for NOT NEW home is", Price_OLD))
[1] "Predicted Selling Price for NOT NEW home is 486963"
G. Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
Dummy variable: 1=NEW, 0=NOT NEW
EQUATION: Price =138567 + 57736.283New +116.132 size
Price_NEW_1500 = 138567 + 57736.283*1 +116.132 *1500
print(paste("Predicted Selling Price for 1500 sqft NEW home is", Price_NEW_1500))
[1] "Predicted Selling Price for 1500 sqft NEW home is 370501.283"
Price_OLD_1500 = 138567 + 57736.283*0 +116.132 *1500
print(paste("Predicted Selling Price for 1500 sqft NOT NEW home is", Price_OLD_1500))
[1] "Predicted Selling Price for 1500 sqft NOT NEW home is 312765"
Price_diff_3000 = Price_NEW - Price_OLD
Price_diff_1500 = Price_NEW_1500 - Price_OLD_1500
print(paste("There is no price difference in prices NEW vs NOT NEW 3000 sqft", Price_diff_3000," vs NEW vs NOT NEW 1500 sqft house",Price_diff_1500 ))
[1] "There is no price difference in prices NEW vs NOT NEW 3000 sqft 57736.2830000001 vs NEW vs NOT NEW 1500 sqft house 57736.283"
H. Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?
See Figure 1 and 2 below. The model with the interaction Size:New is a better fitting than model without interaction based on adjusted R-squared of 73.63% compared to 71.69%. The variables are statistically significant with p-value: < 2.2e-16
ggplot(house.selling.price, aes(x = Size*New, y = Price)) +
geom_point() +
geom_smooth(method = "lm",colour = 7) +
labs(x="Size", y="Price", title = "Figure 1:Predicted selling price With Interaction")
ggplot(house.selling.price, aes(x = Size, y = Price)) +
geom_point() +
geom_smooth(method = "lm",colour = 7) +
labs(x="Size", y="Price", title = "Figure 2:Predicted selling price WITHOUT INTERACTION")