HOMEWORK 3

DACSS 603, Spring 2022

Erin Tracy
3/20/2022

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2

A. A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

prediction<- -10536+53.8*1240+2.84*18000 
prediction
[1] 107296
residual<- 145000 -prediction
residual
[1] 37704

The prediction was $107,296. This was calculated by inserting the data (1240 sq ft size of home and 18,000 sq ft lot size) into the equation.

Since the actual selling price was $145,000, the residual is 37,704. That is the difference between the actual price and the predicted price.

For some reason, this home sold for 37,704 more than it was predicted to sell for.

B. For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

lot_size<- 30000

#House sizes
a<- 2000
b<- 2001
c<- 2002
d<- 2003
e<- 2004

prediction_a<- -10536+53.8*a+2.84*lot_size
prediction_b<- -10536+53.8*b+2.84*lot_size
prediction_c<- -10536+53.8*c+2.84*lot_size
prediction_d<- -10536+53.8*d+2.84*lot_size
prediction_e<- -10536+53.8*e+2.84*lot_size


prediction_a
[1] 182264
prediction_b
[1] 182317.8
prediction_c
[1] 182371.6
prediction_d
[1] 182425.4
prediction_e
[1] 182479.2
prediction_a - prediction_b
[1] -53.8
prediction_b - prediction_c
[1] -53.8
prediction_c - prediction_d
[1] -53.8
prediction_d - prediction_e
[1] -53.8

I used sample data of a 10,000 sq lot size and a 1,000 sq ft house size, and increased the house size by 1. The difference is consistently $53.8. I then altered the lot size to 30,000 sq ft and house size to 2,000 sq ft and increased the size by 1 sq ft. The difference was still consistently 53.8

C. According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

53.8/2.84
[1] 18.94366
house_size<- 1000

#Lot Sizes:
x<- 10000
y<- 10001
z<- 10002


prediction_x<- -10536+53.8*house_size+2.84*x*18.94
prediction_y<- -10536+53.8*house_size+2.84*y*18.94
prediction_z<- -10536+53.8*house_size+2.84*z*18.94


prediction_x
[1] 581160
prediction_y
[1] 581213.8
prediction_z
[1] 581267.6
prediction_x - prediction_y
[1] -53.7896
prediction_y - prediction_z
[1] -53.7896

For each square foot of house size, the home price increases by $53.8. For each square foot of lot size, the price increases by $2.84. To increase the price of the home by changing lot size, the same amount as changing 1 sq foot of house size, I would need to divide $53.8 by 2.84, the answer is 18.94

I did a test using sample data again, and this proved correct.

Question 2

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

library(alr4)
data("salary")
?salary
head(salary)
   degree rank    sex year ysdeg salary
1 Masters Prof   Male   25    35  36350
2 Masters Prof   Male   13    22  35350
3 Masters Prof   Male   10    23  28200
4 Masters Prof Female    7    27  26775
5     PhD Prof   Male   19    30  33696
6 Masters Prof   Male   16    21  28516
str(salary)
'data.frame':   52 obs. of  6 variables:
 $ degree: Factor w/ 2 levels "Masters","PhD": 1 1 1 1 2 1 2 1 2 2 ...
 $ rank  : Factor w/ 3 levels "Asst","Assoc",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ sex   : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 1 2 1 1 1 ...
 $ year  : int  25 13 10 7 19 16 0 16 13 13 ...
 $ ysdeg : int  35 22 23 27 30 21 32 18 30 31 ...
 $ salary: int  36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
summary(salary)
     degree      rank        sex          year            ysdeg      
 Masters:34   Asst :18   Male  :38   Min.   : 0.000   Min.   : 1.00  
 PhD    :18   Assoc:14   Female:14   1st Qu.: 3.000   1st Qu.: 6.75  
              Prof :20               Median : 7.000   Median :15.50  
                                     Mean   : 7.481   Mean   :16.12  
                                     3rd Qu.:11.000   3rd Qu.:23.25  
                                     Max.   :25.000   Max.   :35.00  
     salary     
 Min.   :15000  
 1st Qu.:18247  
 Median :23719  
 Mean   :23798  
 3rd Qu.:27259  
 Max.   :38045  
str(salary)
'data.frame':   52 obs. of  6 variables:
 $ degree: Factor w/ 2 levels "Masters","PhD": 1 1 1 1 2 1 2 1 2 2 ...
 $ rank  : Factor w/ 3 levels "Asst","Assoc",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ sex   : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 1 2 1 1 1 ...
 $ year  : int  25 13 10 7 19 16 0 16 13 13 ...
 $ ysdeg : int  35 22 23 27 30 21 32 18 30 31 ...
 $ salary: int  36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...

A. Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

t.test(salary~sex, data = salary)

    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

Mean Salary is not the same. The mean salary of females (21357) is lower than the mean salary of males (24691).

B. Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

fit <- lm(salary~ degree + rank + sex + year + ysdeg, data = salary)


head(predict(fit), n = 10)
       1        2        3        4        5        6        7 
34412.44 30316.19 28762.69 28001.84 33566.07 31869.70 25433.42 
       8        9       10 
32243.42 30708.21 30583.64 
set.seed(3)
df<- data.frame(degree = sample(salary$degree, size = 10, replace = T),
                rank = sample(salary$rank, size = 10, replace = T),
                sex = sample(salary$sex, size = 10, replace = T),
                year = sample(salary$year, size = 10, replace = T),
                ysdeg = sample(salary$ysdeg, size = 10, replace = T))
                

predict(fit,df)
       1        2        3        4        5        6        7 
15368.63 20671.56 14287.78 29407.55 20069.45 32082.22 23606.80 
       8        9       10 
16837.87 21524.03 28077.52 
summary(fit)

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
confint(fit, 'sexFemale')
              2.5 %   97.5 %
sexFemale -697.8183 3030.565

I found a 95% confidence interval that the difference in salary between males and females is between -697 and 3030

library(stargazer)

m1<-lm(salary~sex,data=salary)
m2<-lm(salary~sex+degree+rank, data=salary)
m3<-lm(salary~sex+degree+rank+degree*rank, data=salary)
stargazer(m1,m2,m3, type='text',
          covariate.labels = c('sex', 'degree', 'rank', 'degree*rank'),
          dep.var.labels = 'salary',
          star.cutoffs = NA)

===============================================================================
                                        Dependent variable:                    
                    -----------------------------------------------------------
                                              salary                           
                            (1)                 (2)                 (3)        
-------------------------------------------------------------------------------
sex                     -3,339.647           -862.412            -578.041      
                        (1,807.716)          (978.365)           (985.957)     
                                                                               
degree                                       1,038.021           3,524.061     
                                             (942.928)          (1,695.485)    
                                                                               
rank                                         4,710.542           6,003.751     
                                            (1,174.879)         (1,618.400)    
                                                                               
degree*rank                                 11,650.640          12,568.690     
                                            (1,001.670)         (1,138.609)    
                                                                               
degreePhD:rankAssoc                                             -3,504.919     
                                                                (2,398.807)    
                                                                               
degreePhD:rankProf                                              -3,670.395     
                                                                (2,282.361)    
                                                                               
Constant                24,696.790          17,921.290          17,242.450     
                         (937.978)           (855.470)           (931.848)     
                                                                               
-------------------------------------------------------------------------------
Observations                52                  52                  52         
R2                         0.064               0.764               0.779       
Adjusted R2                0.045               0.744               0.750       
Residual Std. Error 5,782.082 (df = 50) 2,992.955 (df = 47) 2,958.782 (df = 45)
F Statistic         3.413 (df = 1; 50)  38.087 (df = 4; 47) 26.497 (df = 6; 45)
===============================================================================
Note:                                                                        NA

(Just experimenting with Stargazer)

C. Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

set.seed(3)
df<- data.frame(degree = sample(salary$degree, size = 10, replace = T),
                rank = sample(salary$rank, size = 10, replace = T),
                sex = sample(salary$sex, size = 10, replace = T),
                year = sample(salary$year, size = 10, replace = T),
                ysdeg = sample(salary$ysdeg, size = 10, replace = T))
predict(fit,df)
       1        2        3        4        5        6        7 
15368.63 20671.56 14287.78 29407.55 20069.45 32082.22 23606.80 
       8        9       10 
16837.87 21524.03 28077.52 
summary(fit)

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
?salary

Salary increases by:

1388 if the degree is PHD

5292 if the rank is Assoc

11118 if the rank is Prof

1166 if Female

476 for each year in current rank

Salary decreases by:

124 for years since highest degree

All slopes are positive, except for Years since highest degree earned.

Rank and years in current rank are statistically significant (less than 0.05),other factors are not.

D. Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

fit3 <- lm(salary~ degree + year + rank + sex + ysdeg, data = salary)


set.seed(3)
df3<- data.frame(degree = sample(salary$degree, size = 10, replace = T),
                year = sample(salary$year, size = 10, replace = T),
                rank = sample(salary$rank, size = 10, replace = T),
                sex = sample(salary$sex, size = 10, replace = T),
                ysdeg = sample(salary$ysdeg, size = 10, replace = T))

predict(fit3,df3)
       1        2        3        4        5        6        7 
17963.93 12836.86 25406.54 32741.72 22980.27 32558.53 23606.80 
       8        9       10 
20225.00 17713.56 28656.89 
summary(fit3)

Call:
lm(formula = salary ~ degree + year + rank + sex + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
year          476.31      94.91   5.018 8.65e-06 ***
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

I reordered the variables to change the baseline, but nothing changed, so I think I misunderstood the task.

E. Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.

Exclude the variable rank, refit, and summarize how your findings changed, if they did.

fit2 <- lm(salary~ degree + sex + year + ysdeg, data = salary)


set.seed(3)
df2<- data.frame(degree = sample(salary$degree, size = 10, replace = T),
                sex = sample(salary$sex, size = 10, replace = T),
                year = sample(salary$year, size = 10, replace = T),
                ysdeg = sample(salary$ysdeg, size = 10, replace = T))

predict(fit2,df2)
       1        2        3        4        5        6        7 
15631.50 25840.16 23116.76 29904.74 30290.05 27114.13 19427.73 
       8        9       10 
25588.74 23098.28 21451.56 
summary(fit2)

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

I removed rank.

Salary decreases by:

3299 if the degree is PHD

1286 if Female

Salary increases by:

352 for each year in current rank

339 for years since highest degree

All slopes are positive, except for Years since highest degree earned.

Degree PHD, Years in Current Rank and Years since highest degree earned are all statistically significant (less than 0.05), sex is not statistially significant.

F. Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

fit5 <- lm(salary~ rank + degree + sex + year + ysdeg+ year*ysdeg, data = salary)
summary(fit5)

Call:
lm(formula = salary ~ rank + degree + sex + year + ysdeg + year * 
    ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4055.4 -1007.7  -172.6   800.0  9275.7 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16289.506    944.422  17.248  < 2e-16 ***
rankAssoc    5627.056   1184.705   4.750 2.19e-05 ***
rankProf    11475.286   1389.241   8.260 1.71e-10 ***
degreePhD    1557.213   1028.855   1.514   0.1373    
sexFemale    1233.531    925.994   1.332   0.1897    
year          318.343    174.450   1.825   0.0748 .  
ysdeg        -172.406     89.161  -1.934   0.0596 .  
year:ysdeg      7.094      6.578   1.078   0.2867    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2394 on 44 degrees of freedom
Multiple R-squared:  0.8588,    Adjusted R-squared:  0.8363 
F-statistic: 38.22 on 7 and 44 DF,  p-value: < 2.2e-16

The slope for year* ysdeg is positive, supporting the hypothesis.

Question 3

(Data file: house.selling.price in smss R package)

library(smss)
data(house.selling.price)
head(house.selling.price)
  case Taxes Beds Baths New  Price Size
1    1  3104    4     2   0 279900 2048
2    2  1173    2     1   0 146500  912
3    3  3076    4     2   0 237700 1654
4    4  1608    3     2   0 200000 2068
5    5  1454    3     3   0 159900 1477
6    6  2997    3     2   1 499900 3153
summary(house.selling.price)
      case            Taxes           Beds       Baths     
 Min.   :  1.00   Min.   :  20   Min.   :2   Min.   :1.00  
 1st Qu.: 25.75   1st Qu.:1178   1st Qu.:3   1st Qu.:2.00  
 Median : 50.50   Median :1614   Median :3   Median :2.00  
 Mean   : 50.50   Mean   :1908   Mean   :3   Mean   :1.96  
 3rd Qu.: 75.25   3rd Qu.:2238   3rd Qu.:3   3rd Qu.:2.00  
 Max.   :100.00   Max.   :6627   Max.   :5   Max.   :4.00  
      New           Price             Size     
 Min.   :0.00   Min.   : 21000   Min.   : 580  
 1st Qu.:0.00   1st Qu.: 93225   1st Qu.:1215  
 Median :0.00   Median :132600   Median :1474  
 Mean   :0.11   Mean   :155331   Mean   :1629  
 3rd Qu.:0.00   3rd Qu.:169625   3rd Qu.:1865  
 Max.   :1.00   Max.   :587000   Max.   :4050  
str(house.selling.price)
'data.frame':   100 obs. of  7 variables:
 $ case : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Taxes: int  3104 1173 3076 1608 1454 2997 4054 3002 6627 320 ...
 $ Beds : int  4 2 4 3 3 3 3 3 5 3 ...
 $ Baths: int  2 1 2 2 3 2 2 2 4 2 ...
 $ New  : int  0 0 0 0 0 1 0 1 0 0 ...
 $ Price: int  279900 146500 237700 200000 159900 499900 265500 289900 587000 70000 ...
 $ Size : int  2048 912 1654 2068 1477 3153 1355 2075 3990 1160 ...
?house.selling.price

A. Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)

summary(lm(Price~Size+New, data= house.selling.price))

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

B. Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

y= -40230.87 + 116.13x1 + 57736x2 where x1 is house size and x2 is new/not new

Since X2 is 1 if new and 0 if not new, these formulas can be rewritten:

NEW HOMES

y= -40230.87 + 116.13x1 + 57736

NOT NEW HOMES

y= -40230.87 + 116.13x1

C. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

New3000<--40231+116*3000+57736
New3000
[1] 365505
NotNew3000<--40231+116*3000
NotNew3000
[1] 307769

If this home is new, the predicted selling price is 365,896

If this home is not new, the predicted selling price is 307,769

D. Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

housedata_interaction<-summary(lm(Price~Size+New+Size*New, data= house.selling.price))
housedata_interaction

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E. Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

library(ggplot2)

ggplot(data = house.selling.price, aes(x = Size, y = Price, color = New)) +
  geom_point()+
    geom_smooth(method="lm", se = F)

F. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

BNew3000<--22228+104*3000-78528+62*3000*1
BNew3000
[1] 397244
BNotNew3000<--22228+104*3000
BNotNew3000
[1] 289772

If this home is new, the predicted selling price is 397,244

If this home is not new, the predicted selling price is 289,772

G. Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

BNew1500<--22228+104*1500-78528+62*1500
BNew1500
[1] 148244
BNotNew1500<--22228+104*1500
BNotNew1500
[1] 133772

If this home is new, the predicted selling price is 148,244

If this home is not new, the predicted selling price is 133,772

It makes sense that a house half the size of a 3000 sq ft house would be significantly less expensive. It is interesting that the difference in price between a new vs not new house is much greater for larger houses than smaller houses.

H. Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

Unlike the example that we reviewed in class with classic cars, for house prices the model without interaction better represents the relationship of size to price. Both new and not new houses are more expensive if they are larger. However larger houses are much more expensive if new than if not new.