DACSS-603
(SMSS 11.2, except part (d))
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2
A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
predicted_price<--10536+53.8*1240+2.84*18000
predicted_price
[1] 107296
Given the variable information above (y=selling price, x1=house size, x2=lot size) as well as the data in the prediction equation (1,240 sqft house on a 18,000 sqft lot), the predicted selling price of the home is therefore $107,296. Now to calculate the residual between this and the actual selling price ($145,000).
residual<-145000-predicted_price
residual
[1] 37704
The residual between the two prices indicates that the home was sold for $37,740 more than the predicted selling price.
For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
Since the coefficient for home size is 53.8x1, the price of the home would go up $53.80 for each sqft increase in home size.
According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
Since the coefficient for lot size is 2.84x2, every sqft increase would add $2.84 to the overall price of the home. However (with lot size) to achieve the same impact as the price increase of home size, we need to do some simple division.
53.8/2.84
[1] 18.94366
As the math shows, the lot size would need to increase by 18.9 sqft in order to have the same impact as an increase of 1 sqft of home space.
(ALR, 5.17, slightly modified)
(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.
A. Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
degree rank sex year ysdeg salary
1 Masters Prof Male 25 35 36350
2 Masters Prof Male 13 22 35350
3 Masters Prof Male 10 23 28200
4 Masters Prof Female 7 27 26775
5 PhD Prof Male 19 30 33696
6 Masters Prof Male 16 21 28516
summary(salary)
degree rank sex year ysdeg
Masters:34 Asst :18 Male :38 Min. : 0.000 Min. : 1.00
PhD :18 Assoc:14 Female:14 1st Qu.: 3.000 1st Qu.: 6.75
Prof :20 Median : 7.000 Median :15.50
Mean : 7.481 Mean :16.12
3rd Qu.:11.000 3rd Qu.:23.25
Max. :25.000 Max. :35.00
salary
Min. :15000
1st Qu.:18247
Median :23719
Mean :23798
3rd Qu.:27259
Max. :38045
Now I will filter the dataset down to focus on sex and salary.
Welch Two Sample t-test
data: salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-567.8539 7247.1471
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
24696.79-21357.14
[1] 3339.65
Without regard to education level and job rank, the mean salary for males is $24,696.79 whereas the mean salary for females is $21,357.14 – a $3,339.65 difference.
B. Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
1 2 3 4 5 6 7
34412.44 30316.19 28762.69 28001.84 33566.07 31869.70 25433.42
8 9 10
32243.42 30708.21 30583.64
set.seed(3)
df<- data.frame(degree = sample(salary$degree, size = 10, replace = T),
rank = sample(salary$rank, size = 10, replace = T),
sex = sample(salary$sex, size = 10, replace = T),
year = sample(salary$year, size = 10, replace = T),
ysdeg = sample(salary$ysdeg, size = 10, replace = T))
predict(fit,df)
1 2 3 4 5 6 7
15368.63 20671.56 14287.78 29407.55 20069.45 32082.22 23606.80
8 9 10
16837.87 21524.03 28077.52
summary(fit)
Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
confint(fit,'sexFemale')
2.5 % 97.5 %
sexFemale -697.8183 3030.565
After running a multiple linear regression and calculating it with a 95% confidence interval, the difference in salary between males and females is between -697.82 and 3030.57.
C. Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables
set.seed(3)
df <- data.frame(degree = sample(salary$degree, size = 10, replace = T),
rank = sample(salary$rank, size = 10, replace = T),
sex = sample(salary$sex, size = 10, replace = T),
year = sample(salary$year, size = 10, replace = T),
ysdeg = sample(salary$ysdeg, size = 10, replace = T))
predict(fit,df)
1 2 3 4 5 6 7
15368.63 20671.56 14287.78 29407.55 20069.45 32082.22 23606.80
8 9 10
16837.87 21524.03 28077.52
summary(fit)
Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
?salary
To summarize, salary increases by:
a. $1388.61 if the individual has a PhD b. $5292.36 if the individual is an Associate Professor c. $11,118.75 if the individual is a full/tenured Professor
d. $1,166.37 if the individual is female
e. $476.31 every year the individual is at their current rank
However, salary decreases by $124.57 for every year that passes since the individual attained their highest degree/rank level. All slopes are positive except for this one. Additionally, individuals’ rank and the years they spent at their current rank are statistically significant (less than 0.05).
D. Change the baseline category for the rank variable. Interpret the coefficients related to rank again
fit2 <- lm(salary~rank+sex+degree+year+ysdeg, data=salary)
set.seed(3)
df2 <- data.frame(rank=sample(salary$rank,size=10,replace=T),
sex=sample(salary$sex,size=10,replace=T),
degree=sample(salary$degree,size=10,replace=T),
year=sample(salary$year,size=10,replace=T),
ysdeg=sample(salary$ysdeg,size=10,replace=T))
predict(fit2,df2)
1 2 3 4 5 6 7
25098.78 25963.92 15676.39 24969.76 22624.44 26255.82 16925.83
8 9 10
29123.01 31254.18 28077.52
summary(fit2)
Call:
lm(formula = salary ~ rank + sex + degree + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
degreePhD 1388.61 1018.75 1.363 0.180
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
In this calculation I reordered the categories to change the baseline (I put ‘rank’ and ‘sex’ first rather than ‘degree’ and ‘rank’). None of the figures changed, though, so I’m not sure if this is correct or if I miscalculated.
E. Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.
Exclude the variable rank, refit, and summarize how your findings changed, if they did.
fit3 <- lm(salary~degree+sex+year+ysdeg,data=salary)
set.seed(3)
df3 <- data.frame(degree=sample(salary$degree,size=10,replace=T),
sex=sample(salary$sex,size=10,replace=T),
year=sample(salary$year,size=10,replace=T),
ysdeg=sample(salary$ysdeg,size=10,replace=T))
predict(fit3,df3)
1 2 3 4 5 6 7
15631.50 25840.16 23116.76 29904.74 30290.05 27114.13 19427.73
8 9 10
25588.74 23098.28 21451.56
summary(fit3)
Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexFemale -1286.54 1313.09 -0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
With variable ‘rank’ removed, salary decreases by:
a. $3299.35 if the individual has a PhD
b. $1,286.54 if the individual is female
However, salary increases by:
a. $351.97 for each year the individual spends at their current rank
b. $339.40 each year after the individual earned their highest degree
The slopes are split 50/50 in terms of how many are positive and negative. ‘degreePhD’, ‘year’, and ‘ysdeg’ are all statistically significant (less than 0.05) while ‘sexFemale’ is not.
F.Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.
Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?
Call:
lm(formula = salary ~ rank + degree + sex + year + ysdeg + year *
ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4055.4 -1007.7 -172.6 800.0 9275.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16289.506 944.422 17.248 < 2e-16 ***
rankAssoc 5627.056 1184.705 4.750 2.19e-05 ***
rankProf 11475.286 1389.241 8.260 1.71e-10 ***
degreePhD 1557.213 1028.855 1.514 0.1373
sexFemale 1233.531 925.994 1.332 0.1897
year 318.343 174.450 1.825 0.0748 .
ysdeg -172.406 89.161 -1.934 0.0596 .
year:ysdeg 7.094 6.578 1.078 0.2867
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2394 on 44 degrees of freedom
Multiple R-squared: 0.8588, Adjusted R-squared: 0.8363
F-statistic: 38.22 on 7 and 44 DF, p-value: < 2.2e-16
The slope for year*ysdeg is positive, therefore supporting the hypothesis.
A. Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)
Before doing any specific calculations, I’m going to first present summary info to get a sense of the dataset.
case Taxes Beds Baths New Price Size
1 1 3104 4 2 0 279900 2048
2 2 1173 2 1 0 146500 912
3 3 3076 4 2 0 237700 1654
4 4 1608 3 2 0 200000 2068
5 5 1454 3 3 0 159900 1477
6 6 2997 3 2 1 499900 3153
summary(house.selling.price)
case Taxes Beds Baths
Min. : 1.00 Min. : 20 Min. :2 Min. :1.00
1st Qu.: 25.75 1st Qu.:1178 1st Qu.:3 1st Qu.:2.00
Median : 50.50 Median :1614 Median :3 Median :2.00
Mean : 50.50 Mean :1908 Mean :3 Mean :1.96
3rd Qu.: 75.25 3rd Qu.:2238 3rd Qu.:3 3rd Qu.:2.00
Max. :100.00 Max. :6627 Max. :5 Max. :4.00
New Price Size
Min. :0.00 Min. : 21000 Min. : 580
1st Qu.:0.00 1st Qu.: 93225 1st Qu.:1215
Median :0.00 Median :132600 Median :1474
Mean :0.11 Mean :155331 Mean :1629
3rd Qu.:0.00 3rd Qu.:169625 3rd Qu.:1865
Max. :1.00 Max. :587000 Max. :4050
str(house.selling.price)
'data.frame': 100 obs. of 7 variables:
$ case : int 1 2 3 4 5 6 7 8 9 10 ...
$ Taxes: int 3104 1173 3076 1608 1454 2997 4054 3002 6627 320 ...
$ Beds : int 4 2 4 3 3 3 3 3 5 3 ...
$ Baths: int 2 1 2 2 3 2 2 2 4 2 ...
$ New : int 0 0 0 0 0 1 0 1 0 0 ...
$ Price: int 279900 146500 237700 200000 159900 499900 265500 289900 587000 70000 ...
$ Size : int 2048 912 1654 2068 1477 3153 1355 2075 3990 1160 ...
?house.selling.price
Now I will conduct a regression analysis in which y = selling price(USD) in terms of house size (sqft), and whether the home is new (1 = yes, 0 = no)
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
B.Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
y = -40230.87 + 116.13x1 + 57736.28x2 – where x1 = house size and x2 = new or not new. Since x2 is determined by ‘1’ (new) or ‘0’ (not new) the formula can be reconfigured slightly.
New homes:
y = -40230.87 + 116.13x1 + 57736.28
Not new homes:
y = -40230.87 + 116.13x1
With these figures in mind, I will first look at the first 10 entries in the ‘house.selling.price’ dataset in order to create a new object.
'data.frame': 100 obs. of 7 variables:
$ case : int 1 2 3 4 5 6 7 8 9 10 ...
$ Taxes: int 3104 1173 3076 1608 1454 2997 4054 3002 6627 320 ...
$ Beds : int 4 2 4 3 3 3 3 3 5 3 ...
$ Baths: int 2 1 2 2 3 2 2 2 4 2 ...
$ New : int 0 0 0 0 0 1 0 1 0 0 ...
$ Price: int 279900 146500 237700 200000 159900 499900 265500 289900 587000 70000 ...
$ Size : int 2048 912 1654 2068 1477 3153 1355 2075 3990 1160 ...
head(house.selling.price,10)
case Taxes Beds Baths New Price Size
1 1 3104 4 2 0 279900 2048
2 2 1173 2 1 0 146500 912
3 3 3076 4 2 0 237700 1654
4 4 1608 3 2 0 200000 2068
5 5 1454 3 3 0 159900 1477
6 6 2997 3 2 1 499900 3153
7 7 4054 3 2 0 265500 1355
8 8 3002 3 2 1 289900 2075
9 9 6627 5 4 0 587000 3990
10 10 320 3 2 0 70000 1160
Next, I will conduct a multiple regression to show the relationship between a new home’s selling price and its size:
Call:
lm(formula = Price ~ New + Size, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
New 57736.283 18653.041 3.095 0.00257 **
Size 116.132 8.795 13.204 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Now I will conduct a correlation test:
cor.test(house.selling.price$Size,house.selling.price$New)
Pearson's product-moment correlation
data: house.selling.price$Size and house.selling.price$New
t = 4.1212, df = 98, p-value = 7.891e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2032530 0.5399831
sample estimates:
cor
0.3843277
While controlling for size, predictor variables ‘New’ and ‘Size’ have p-values of 0.00257 and 2e-16 respectively. Both p-values are statistically significant as they are less than 0.05. This indicates that the null hypothesis can be rejected (there is no relationship between ‘New’ and ‘Price’ OR between ‘Size’ and ‘Price’ of new homes). By calculating the correlation, we can see that the correlation between ‘New’ and ‘Size’ is 0.3843, which is a wear relationship.
C. Find the predicted selling price for a home of 3,000 sqft that is (a) new, and (b) not new.
sqftNew <-- 40230.87+116.13*3000+57736
sqftNew
[1] 365895.1
If the house is new, the predicted selling price is $365,895.10
sqftNotNew <-- 40230.87+116.13*3000
sqftNotNew
[1] 308159.1
If the house isn’t new, the predicted selling price is $308,159.10
D. Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
size_new_interaction <- summary(lm(Price~Size+New+Size*New, data=house.selling.price))
size_new_interaction
Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
E. Report the lines relating the predicted selling price to the size for homes that are (a) new and (b) not new.
library(ggplot2)
ggplot(data=house.selling.price,aes(x=Size,y=Price, color=New))+
geom_point()+
geom_smooth(method="lm",se=F)
As can be seen in the graph, the variables in the scatterplot has a linear/correlative relationship, indicating that as size increases price does as well. However, going by the colors of the dots (which correspond to the newness of the house), the relationship isn’t as black-and-white. There are new houses (light blue dots) scattered throughout the graph, for the most part along the slope line. The older homes (dots that aren’t light blue) are mostly concentrated at the lower right end of the graph, but there are also a few that surpass the price/size of brand new houses.
F. Find the predicted selling price for a home of 3,000 sqft that is (a) new and (b) not new.
New_B <-- 22227.81+104.44*3000-78527.50+61.9*3000*1
New_B
[1] 398264.7
The predicted selling price for a New home with the above measurements is $398,264.70.
NotNew_B <--22227.81+104.4*3000
NotNew_B
[1] 290972.2
The predicted selling price for a not-new home with the above measurements is $290,972.20
G. Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
NewC <-- 22227.81+104.4*1500-78527.50+61.9*1500
NewC
[1] 148694.7
The predicted selling price of a new 1500sqft home is $148,694.70
NotNew_C <-- 22227.81+104.4*1500
NotNew_C
[1] 134372.2
The predicted selling price of a 1500sqft home that ISN’T new is $134,372.20
Compared to the figures in part F (in which the house size is double the size at 3000sqft), the predicted selling prices in this part (G) are much lower. For a new 3000sqft house, the predicted selling price is $398,264.70, while a new 1500sqft house’s predicted selling price is $148,694.70. The size decreased by half and so too did the price, indicating that these two variables are correlated and have a linear relationship. The predicted selling price of a 3000sqft house that’s NOT new is $290,972.20. The predicted selling price of a 1500sqft house that’s not new is $134,372.20. The difference in price between the two prices is $156,000. This indicates that for “older” or simply not new houses, the price is more steeply related to size than it is in new houses.
H. Do you think the model with interaction or the one without it represents the relationship of ‘Size’ and ‘New’ to the outcome price? What makes you prefer one over the other?
I think a model without interaction would best show the relationship of ‘Size’ and ‘New’ with outcome price; the model with interaction best represents the relationship between ‘Size’ and ‘Price’ rather than ‘Size’ and ‘New’