Homework 3

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

Answer

The predicted selling price is $107,296 and the residual is $37,704. This indicates that the house sold for $37,704 more than was predicted by the equation.
Controlling for lot size, we predict a $53.80 increase in selling price for every one-square-foot increase in size of home. In multiple regression, the coefficient/slope of each predictor variable describes the effect of that variable when all other variables are held constant (i.e. controlled for).
A 18.94 square-foot increase in lot size would predict an increase in selling price equivalent to a one-square-foot increase in home size ($53.80).

Solution

The prediction equation is

\[{selling\;price}={-10,536}+{53.8}*{size\;of\;home}+{2.84}*{lot\;size}\] To calculate the predicted selling price, I’ll substitute the given values for the variables size of home and lot size.

Show code

# assign variables
a <- -10536
b1 <- 53.8
b2 <- 2.84
x1 <- 1240
x2 <- 18000
actualP <- 145000

# calculate predicted price
predP <- a + (b1 * x1) + (b2 * x2)

# view predicted price
predP

[1] 107296

Show code

# calculate residual
residual <- actualP - predP

# view residual
residual

[1] 37704

A one-square-foot increase in home size predicts a $53.80 increase in selling price. To calculate the square-foot increase in lot size that would predict the same increase in selling price, I’ll use the following simple equation.

\[2.84*x2=53.80\] or

\[x2=\frac{53.80}{2.84}\]

Show code

# calculate lot size increase
lotInc <- b1/b2

# view lot size increase
lotInc

[1] 18.94366

Question 2

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables
Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “a variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts. Exclude the variable rank, refit, and summarize how your findings changed, if they did.
Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary. Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

A

First I’ll inspect the data.

Show code

# load data
data("salary")

# create object
salary <- salary

# get summary
summary(salary)

     degree      rank        sex          year            ysdeg      
 Masters:34   Asst :18   Male  :38   Min.   : 0.000   Min.   : 1.00  
 PhD    :18   Assoc:14   Female:14   1st Qu.: 3.000   1st Qu.: 6.75  
              Prof :20               Median : 7.000   Median :15.50  
                                     Mean   : 7.481   Mean   :16.12  
                                     3rd Qu.:11.000   3rd Qu.:23.25  
                                     Max.   :25.000   Max.   :35.00  
     salary     
 Min.   :15000  
 1st Qu.:18247  
 Median :23719  
 Mean   :23798  
 3rd Qu.:27258  
 Max.   :38045

There are five predictor variables, three of which are categorical (degree, rank, sex) and two of which are quantitative (year, which refers to years in current rank, and ysdeg, which refers to years since highest degree earned).

The null hypothesis is that the mean salary for males and females is the same. That is, sex has no effect on salary. The alternative hypothesis, then, is that the mean salary for males and females is not the same.

Thus \[H_0:\mu_1=\mu_2\] and \[H_a:\mu_1\ne\mu_2\]

I’ll use the t.test() function to conduct a two-sample t-test.

Show code

# conduct t test
t.test(salary ~ sex, data = salary, var.equal = TRUE)


    Two Sample t-test

data:  salary by sex
t = 1.8474, df = 50, p-value = 0.0706
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -291.257 6970.550
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14

While we can see that the mean salary for males ($24,696.79) is different from that of females ($21,357.14), a $P$-value of .071 tells us that this difference wouldn’t be so unlikely if the null hypothesis were true that we are forced to reject it. We are thus unable to reject the null hypothesis at this time.

B

Now I’ll fit a multiple regression model and get a 95% confidence interval for the difference in mean salaries.

Show code

# fit model
fitS <- lm(salary ~ degree + rank + sex + year + ysdeg, data = salary)

# get summary
summary(fitS)


Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678 < 0.0000000000000002 ***
degreePhD    1388.61    1018.75   1.363                0.180    
rankAssoc    5292.36    1145.40   4.621       0.000032163431 ***
rankProf    11118.76    1351.77   8.225       0.000000000162 ***
sexFemale    1166.37     925.57   1.260                0.214    
year          476.31      94.91   5.018       0.000008653790 ***
ysdeg        -124.57      77.49  -1.608                0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 0.00000000000000022

Show code

# get CI
confint(fitS, "sexFemale", level = .95)

              2.5 %   97.5 %
sexFemale -697.8183 3030.565

The baseline category for sex is male, which we know because sexFemale is included in the above regression output. The coefficient for sexFemale tells us that, controlling for all other variables, a female professor is predicted to earn an average of $1,166.37 more than a male professor.

The 95% confidence interval is -697.8183 to 3030.565. This tells us that a female professor is predicted to earn anywhere from $697.82 less than a male professor to $3,030.57 more than a male professor. Because the confidence interval includes zero, it’s possible that male professors earn more than female professors, female professors earn more than male professors, or male and female professors earn the same as each other.

We are thus unable to reject the null hypothesis at this time. This corresponds with the conclusion reached based on the $P$-value.

C

We can see the coefficient/slope of each predictor variable by looking at the Estimate column in the summary of our lm() output (above). The coefficient/slope of each predictor variable indicates the mean expected increase in our response variable (salary) for each unit increase in that predictor variable if all other variables are held constant. The Std. Error column gives the standard error of the estimate for each predictor variable, the t value column gives the $t$ statistic and the column Pr(>|t|) gives the $P$-value.

Let’s look at each predictor variable individually.

Degree: A one-unit increase in degree (i.e. moving from an MS to PhD) is estimated to yield a $1,388.61 increase in salary. However, the $P$-value is .180, which is not statistically significant. What this really means is that we can’t be sure that the coefficient/slope of this predictor variable isn’t zero.
Rank: Because this variable is a factor with 3 levels, the output is laid out a bit differently. We can interpret it as follows: moving from an Assistant Prof. to an Associate Prof. is estimated to yield a $5,292.36 increase in salary while moving from an Assistant Prof. to a Full Prof. is estimated to yield a $11,118.76 increase in salary. The $P$-value for both coefficients/slopes are statistically significant, meaning we can be fairly certain the coefficient/slope of this variable is not zero.
Sex: Being a female is estimated to yield a $1,166.37 increase in salary. However, the $P$-value is not statistically significant.
Year: Each additional year in one’s current rank is estimated to yield a $476.31 increase in salary. The $P$-value is statistically significant.
Ysdeg: Each year that’s passed since one’s highest degree earned is estimated to yield a $124.57 decrease in salary. However, the $P$-value is not statistically significant.

Looking at the predictor variables relative to one another, it appears that one’s rank and number of years in that rank are the most significant predictors of salary. The high $P$-values for every other predictor variable tell us that we cannot be sure that the effect of those variables is not zero.

D

We can see from the summary of our lm() output (above) that the baseline category of the rank variable is Assistant Prof. because it’s not included in the output. R automatically drops the baseline (or reference) category to avoid multicollinearity. We can use the levels() function to retrieve factor levels of a particular variable and the relevels() function to manually assign the factor levels of a particular variable.

Here I’ll change the baseline category from Assistant Prof. to Full Prof.

Show code

# get factor levels
levels(salary$rank)

[1] "Asst"  "Assoc" "Prof"

Show code

# change baseline rank
rankNew <- relevel(salary$rank, "Prof")

# fit model again
fitS2 <- lm(salary ~ degree + rankNew + sex + year + ysdeg, data = salary)

# get summary
summary(fitS2)


Call:
lm(formula = salary ~ degree + rankNew + sex + year + ysdeg, 
    data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   26864.81    1375.29  19.534 < 0.0000000000000002 ***
degreePhD      1388.61    1018.75   1.363                0.180    
rankNewAsst  -11118.76    1351.77  -8.225       0.000000000162 ***
rankNewAssoc  -5826.40    1012.93  -5.752       0.000000727809 ***
sexFemale      1166.37     925.57   1.260                0.214    
year            476.31      94.91   5.018       0.000008653790 ***
ysdeg          -124.57      77.49  -1.608                0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 0.00000000000000022

Two things to note:

The baseline category of our rank variable has changed. We can see that the new baseline is Full Prof. because it’s been dropped from the regression output.
The sign of the coefficient/slope of the rank predictor variable has changed. If one were to move from full professorship to associate professorship, we would predict a decrease in salary. Likewise if one were to move from full professorship to assistant professorship. The actual values haven’t changed, however, because the relationship hasn’t changed, merely the direction. It’s worth noting that while this regression output is mathematically correct, it’s not the most meaningful way to structure our model. Professors generally move up in rank and rarely (if ever) move back down in rank so the original factor levels make the most sense.

E

We can also exclude particular variables from a regression model. Here I’ll exclude rank and see what happens.

Show code

# fit model w/o rank
fitS3 <- lm(salary ~ . - rank, data = salary)

# get summary
summary(fitS3)


Call:
lm(formula = salary ~ . - rank, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969 < 0.0000000000000002 ***
degreePhD   -3299.35    1302.52  -2.533             0.014704 *  
sexFemale   -1286.54    1313.09  -0.980             0.332209    
year          351.97     142.48   2.470             0.017185 *  
ysdeg         339.40      80.62   4.210             0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 0.000000001048

Ignoring rank altogether appears to change the model rather dramatically. The effects of having a PhD, being female, and increasing number of years since highest degree attained have reversed (i.e. the sign of the coefficient/slope has changed). In this model, the variables with the highest degree of statistical significance are the number of years since highest degree attained, number of years in current rank, and having a PhD.

Placed in the context of an investigation of sex discrimination, I’m not sure we could justify ignoring rank. Discrimination of any kind could easily manifest itself in the promotion of one group over another, indicating that we should control for it (i.e. include it in our regression model) instead of ignoring it.

F

If the claim is that which dean one was hired by is a predictor of salary, a logical new variable would be the binary variable dean, with factor levels “new” and “old.” I’ll use the mutate() function to create this new variable, assigning each professor a category based on when he/she was hired. Those hired in the past fifteen years will be assigned to the new dean and those hired prior (over fifteen years ago) to the old dean.

Show code

# create new variable
salaryD <- salary %>%
  mutate(dean = case_when(
    ysdeg >= 1 & ysdeg <= 15 ~ "new",
    ysdeg >= 16 ~ "old"
  ))

# view to check
head(salaryD, 20)

    degree  rank    sex year ysdeg salary dean
1  Masters  Prof   Male   25    35  36350  old
2  Masters  Prof   Male   13    22  35350  old
3  Masters  Prof   Male   10    23  28200  old
4  Masters  Prof Female    7    27  26775  old
5      PhD  Prof   Male   19    30  33696  old
6  Masters  Prof   Male   16    21  28516  old
7      PhD  Prof Female    0    32  24900  old
8  Masters  Prof   Male   16    18  31909  old
9      PhD  Prof   Male   13    30  31850  old
10     PhD  Prof   Male   13    31  32850  old
11 Masters  Prof   Male   12    22  27025  old
12 Masters Assoc   Male   15    19  24750  old
13 Masters  Prof   Male    9    17  28200  old
14     PhD Assoc   Male    9    27  23712  old
15 Masters  Prof   Male    9    24  25748  old
16 Masters  Prof   Male    7    15  29342  new
17 Masters  Prof   Male   13    20  31114  old
18     PhD Assoc   Male   11    14  24742  new
19     PhD Assoc   Male   10    15  22906  new
20     PhD  Prof   Male    6    21  24450  old

Now I can fit a new model that includes this new variable.

Show code

# fit model w/ dean
fitS4 <- lm(salary ~ . , data = salaryD)

# get summary
summary(fitS4)


Call:
lm(formula = salary ~ ., data = salaryD)

Residuals:
    Min      1Q  Median      3Q     Max 
-3621.2 -1336.8  -271.6   530.1  9247.6 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 15516.79     814.81  19.043 < 0.0000000000000002 ***
degreePhD    1135.00    1031.16   1.101                0.277    
rankAssoc    5234.01    1138.47   4.597       0.000035985932 ***
rankProf    11411.45    1362.02   8.378       0.000000000116 ***
sexFemale    1084.09     921.49   1.176                0.246    
year          460.35      95.09   4.841       0.000016263785 ***
ysdeg         -47.86      97.71  -0.490                0.627    
deanold     -1749.09    1372.83  -1.274                0.209    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2382 on 44 degrees of freedom
Multiple R-squared:  0.8602,    Adjusted R-squared:  0.838 
F-statistic: 38.68 on 7 and 44 DF,  p-value: < 0.00000000000000022

We can see from the regression output that the baseline category is those hired by the new dean. Based on this model, we expect those hired by the old dean to earn an average of $1,749.09 less than those hired by the new dean. This finding is not statistically significant, however, so based on this model we would not be able to reject the null hypothesis at this time.

It is possible that there is some multicollinearity contained within this model. Multicollinearity means that some or all of the predictor variables in a multiple regression model are highly correlated with one another. One indication of multicollinearity is the $R^2$ values of our models. In our first model, $R^2$=.855, and in this newest one, $R^2$=.860. This slight increase is an indication that the new variable adds little predictive power to the model.

It’s also worth noting that larger sample sizes can help mitigate the problems posed by multicollinearity. A general rule is that a dataset should have 10x as many observations as predictor variables contained within the model. According to this guideline, we would ideally have 60 observations for our new model with 6 variables. Since we have only 52, we may be entering into the territory of a too-small sample size.

Given the potential for multicollinearity, I’ll fit a new model removing the variable ysdeg and compare the two.

Show code

# fit model w/ dean w/o ysdeg
fitS5 <- lm(salary ~ . - ysdeg, data = salaryD)

# get summary
summary(fitS5)


Call:
lm(formula = salary ~ . - ysdeg, data = salaryD)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 15491.84     806.32  19.213 < 0.0000000000000002 ***
degreePhD     818.93     797.48   1.027               0.3100    
rankAssoc    4972.66     997.17   4.987     0.00000961362451 ***
rankProf    11096.95    1191.00   9.317     0.00000000000454 ***
sexFemale     907.14     840.54   1.079               0.2862    
year          434.85      78.89   5.512     0.00000164625970 ***
deanold     -2163.46    1072.04  -2.018               0.0496 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 0.00000000000000022

We can see that the adjusted $R^2$ value of this final model is 0.8407, which is higher than the adjusted $R^2$ values for the original model (0.8357) and the model with both ysdeg and dean (0.838). This is a good indication that this final model is the best fit.

Question 3

Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)
Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

A

First I’ll inspect the data.

Show code

# load data
data("house.selling.price")

# create object
hSP <- house.selling.price

# get string
str(hSP)

'data.frame':   100 obs. of  7 variables:
 $ case : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Taxes: int  3104 1173 3076 1608 1454 2997 4054 3002 6627 320 ...
 $ Beds : int  4 2 4 3 3 3 3 3 5 3 ...
 $ Baths: int  2 1 2 2 3 2 2 2 4 2 ...
 $ New  : int  0 0 0 0 0 1 0 1 0 0 ...
 $ Price: int  279900 146500 237700 200000 159900 499900 265500 289900 587000 70000 ...
 $ Size : int  2048 912 1654 2068 1477 3153 1355 2075 3990 1160 ...

It’s worth noting that while the variable New is actually a factor variable, R currently understands it as an integer. I’m not sure if that will matter for my analysis but I’ll bear in mind that I may need to change it at some point.

Show code

# fit model w/ new and size
fitP <- lm(Price ~ New + Size, data = hSP)

# get summary
summary(fitP)


Call:
lm(formula = Price ~ New + Size, data = hSP)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738              0.00737 ** 
New          57736.283  18653.041   3.095              0.00257 ** 
Size           116.132      8.795  13.204 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 0.00000000000000022

The coefficient/slope for the predictor variable new is 57736.283, indicating that we would expect an average increase in selling price of $57,736.28 if the house is new instead of not new.

The coefficient/slope for the predictor variable size is 116.132, indicating that we would expect an average increase in selling price of $116.13 per one-square-foot increase in house size.

Both coefficients are statistically significant at the .01 level so we can be confident in saying that the effect of size and of being new or not new is not zero.

B

New is a binary (or dummy) variable where 1 = “new” and 0 = “not new”. Thus our regression equation will look a little different than when we have all quantitative variables. Since new has two factor levels (“new” and “not new”), our equation will have one $z$ term.

Starting with \[E(y)=\alpha+\beta{x}+\beta_1{z_1}\] we get \[{estimated\;selling\;price}=-40,230.867+116.132*size+57,736.283*z\] where $z$=1 if the house is new and $z$=0 if it’s not new.

If the house is new then the final term of the prediction equation will be 57,736.283 and the equation can be simplified to \[{estimated\;selling\;price}=17,505.42+116.132*size\] If the house is not new then the final term of the above equation will be 0 and the equation can be simplified to \[{estimated\;selling\;price}=-40,230.867+116.132*size\]

In both cases, the effect of size remains the same—a one-square-foot increase in size predicts an average increase in price of $116.13 per square foot. Both coefficients/slopes are statistically significant so we can be confident neither are zero. Put more simply, a house that is new will be more expensive than one that is not but for both new and not new houses, bigger houses will likely be more expensive.

C

We can calculate the price of a 3,000-square-foot house that is new as follows \[{estimated\;selling\;price}=17,505.42+116.132*3,000\]

and for a house that is not new as follows \[{estimated\;selling\;price}=-40,230.867+116.132*3,000\]

Show code

# assign objects
a <- -40230.867
b1 <- 116.132
bZ <- 57736.283

# calculate estimated price if new
priceNew <- a + (b1*3000) + (bZ*1)

# calculate price if not new
priceNotNew <- a + (b1*3000) + (bZ*0)

Thus, the estimated price for a 3,000-square-foot house that is new is $365901.42. The estimated price for a 3,000-square-foot house that is not new is $308165.13.

D

In addition to allowing us to include multiple predictor variables, multiple regression allows us to indicate interaction terms by creating cross-product terms. An interaction between two predictor variables exists when the value of one changes the effect of the other on the response variable.

I’ll now fit a model in which size interacts with whether the house is new or not new.

Show code

# fit model w/ size * new interaction
fitP2 <- lm(Price ~ Size*New, data = hSP)

# get summary
summary(fitP2)


Call:
lm(formula = Price ~ Size * New, data = hSP)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432              0.15536    
Size           104.438      9.424  11.082 < 0.0000000000000002 ***
New         -78527.502  51007.642  -1.540              0.12697    
Size:New        61.916     21.686   2.855              0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 0.00000000000000022

Looking at the output of our regression model, it appears that the interaction term $size*new$ is statistically significant, which indicates interaction.

E

To get a sense of whether there is interaction between size and new, we can plot the prediction lines for new houses and those that are not new separately.

Show code

# plot by dummy variable
interact_plot(fitP2, pred = Size, modx = New, data = hSP, modx.labels = c("Not New", "New"))

This makes it easy to see what the summary of our lm() output (above) tells us: an increase in size predicts an increase in selling price for both houses that are new and those that are not, but being new confers an additional increase in selling price of $61.92 per one-square-foot increase in house size. The larger the house, the larger the difference in price based on being new or not.

F

Based on our new regression output, our new prediction equation is \[{estimated\;selling\;price}=-22,227.808+104.438*size-78,527.502*z+61.916*size*z\] where $z$=1 if the house is new and $z$=0 if it’s not.

Show code

# assign objects
a <- -22227.808
b1 <- 104.438
b2 <- -78527.502
b3 <- 61.916

# calculate price if new
priceNew2 <- a + (b1 * 3000) + (b2 * 1) + (b3 * 3000 * 1)

# calculate price if not new
priceNotNew2 <- a + (b1 * 3000) + (b2 * 0) + (b3 * 3000 * 0)

Thus, when we include the interaction term $size*new$, the estimated price for a 3,000-square-foot house that is new is $398306.69 and the estimated price for a 3,000-square-foot house that is not new is $291086.19.

G

Show code

# calculate price if new
priceNew3 <- a + (b1 * 1500) + (b2 * 1) + (b3 * 1500 * 1)

# calculate price if not new
priceNotNew3 <- a + (b1 * 1500) + (b2 * 0) + (b3 * 1500 * 0)

The estimated price for a 1,500-square-foot house that is new is $148775.69 and the estimated price for a 1,500-square-foot house that is not new is $134429.19.

For a 3,000-square-foot house, the difference in price between one that’s new and one that’s not is $107220.5.

For a 1,500-square-food-house, the difference in price between one that’s new and one that’s not is $14346.5.

This fits with what’s represented in the above plot. The difference between the lines for new houses and not new houses is relatively small around the 1,500-square-foot mark on the $x$-axis and much larger around the 3,000-square-foot mark. As the houses get bigger in size, the impact of whether the house is new or not new increases.

H

The model with the interaction between size and new seems more appropriate to me, for a few reasons.

If we look at the adjusted $R^2$ values, we can see that the value for the first model (no interaction term) is 0.7169 and for the second (interaction term) is 0.7363, indicating the second explains slightly more variation in selling price than the first.
The $P$-value for the interaction term $size*new$ is statistically significant at the .01 level, indicating that we can be confident that the effect of the interaction is not zero.
Plotting the two lines indicates that there is interaction between size and new. If there were no interaction, we would expect both lines to be parallel to one another.
The sample size is sufficiently large for the number of predictor variables.