Homework Three

DACSS 603

Cynthia Hester
March 20,2022

Question 1

(SMSS 11.2, except part (d))

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.


Solutions

Part A

A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.


Here’s what we have:

The selling price of the home depends on the size of the home and the lot size. Therefore, the selling price of home is the dependent or the response variable. The size of home is the first explanatory variable, and lot size is the second explanatory variable. Therefore,the multiple linear regression of the problem is obtained using the formula:

\(Y_{}\) = \(B_{0}\) +\(B_{1}\) \(x_{i1}\) + \(B_{2}\)\(x_{i2}\) + €

where:

\(Y_{i}\) = dependent variable that is the selling price of home

\(B_{0}\) = the y-intercept

\(B_{1}\) and \(B_{2}\) are the slopes of the explanatory variables.

\(x_{i1}\) is the first explanatory variable that is the size of home

\(x_{i2}\) is the second explanatory variable that is the lot size.

€ is the error or residual

where:

\(Y_{}\) is the selling price of home.

\(x_{1}\) is the size of home

\(x_{2}\) is the lot size.

For a home of 1240 square feet on a lot of 18,000 square feet,we predict the selling price with estimated multiple linear regression using the predictive equation:

\(\hat{y}\) = -10,536+53.8\(x_{1}\)+2.84\(x_{2}\) = $107,206

Calculating selling price in R

hide
 house_selling_price<- -(10536)+53.8*1240+2.84*18000      # selling_price in R
 print(house_selling_price)                               # review of output
[1] 107296

House selling price = $107296


Now we calculate the residual which is the difference between the observed value and the mean value the model predicts for that observation. It is the vertical distance between a data point and the regression line. It is a measure of how well a line fits an individual data point.

hide
residual_house_price<-145000-house_selling_price    #residual price of house
print(residual_house_price)
[1] 37704

Interpretation:

Since we have a positive residual value, it means the actual selling price is MORE than the predicted selling price. The house therefore sold for more than was predicted.


Part B

For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

Interpretation:

The slope coefficient for the first explanatory variable size of the home is obtained as 53.8 and is positive. For every additional unit increase in the size of the home, the selling price of the home increases by 53.8 sq.ft., keeping the other explanatory variable that is lot size fixed.

Thus, for every additional square-foot increase in the size of the home, the selling price of the home increases by $53.80, keeping the other explanatory variable that is lot size fixed.


Part C

According to this prediction equation, for fixed home size, how much would the lot size need to increase to have the same impact as a one-square-foot increase in home size?

Interpretation:

The slope coefficient for the second explanatory variable lot size is 2.84 and is positive. For every additional unit increase in the lot size, the selling price of the house increases by 2.84 sq.ft., keeping the other explanatory variable home size which is 53.8 fixed. Therefore, taking the equation :

\(\hat{y}\) = -10,536+53.8\(x_{1}\)+2.84\(x_{2}\)

We multiply lot size 53.8*(1)/2.84 where 1 is the one-square-foot increase in home size and divide by lot size 2.84.

hide
lot_increase<-53.8*1/2.84
print(lot_increase)
[1] 18.94366

19 sq.ft. is the amount the lot size would need to increase to have the same impact as a one-square-foot increase in home size.


Question 2

(ALR, 5.17, slightly modified)

(Data file: salary in alr4 R package).

The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.


First, I inspect the data set for an understanding what it is about.

hide
glimpse(salary)         
Rows: 52
Columns: 6
$ degree <fct> Masters, Masters, Masters, Masters, PhD, Masters, PhD~
$ rank   <fct> Prof, Prof, Prof, Prof, Prof, Prof, Prof, Prof, Prof,~
$ sex    <fct> Male, Male, Male, Female, Male, Male, Female, Male, M~
$ year   <int> 25, 13, 10, 7, 19, 16, 0, 16, 13, 13, 12, 15, 9, 9, 9~
$ ysdeg  <int> 35, 22, 23, 27, 30, 21, 32, 18, 30, 31, 22, 19, 17, 2~
$ salary <int> 36350, 35350, 28200, 26775, 33696, 28516, 24900, 3190~
hide
summary(salary)       
     degree      rank        sex          year            ysdeg      
 Masters:34   Asst :18   Male  :38   Min.   : 0.000   Min.   : 1.00  
 PhD    :18   Assoc:14   Female:14   1st Qu.: 3.000   1st Qu.: 6.75  
              Prof :20               Median : 7.000   Median :15.50  
                                     Mean   : 7.481   Mean   :16.12  
                                     3rd Qu.:11.000   3rd Qu.:23.25  
                                     Max.   :25.000   Max.   :35.00  
     salary     
 Min.   :15000  
 1st Qu.:18247  
 Median :23719  
 Mean   :23798  
 3rd Qu.:27259  
 Max.   :38045  
hide
head(salary,15)       #First 15 rows of the data set
    degree  rank    sex year ysdeg salary
1  Masters  Prof   Male   25    35  36350
2  Masters  Prof   Male   13    22  35350
3  Masters  Prof   Male   10    23  28200
4  Masters  Prof Female    7    27  26775
5      PhD  Prof   Male   19    30  33696
6  Masters  Prof   Male   16    21  28516
7      PhD  Prof Female    0    32  24900
8  Masters  Prof   Male   16    18  31909
9      PhD  Prof   Male   13    30  31850
10     PhD  Prof   Male   13    31  32850
11 Masters  Prof   Male   12    22  27025
12 Masters Assoc   Male   15    19  24750
13 Masters  Prof   Male    9    17  28200
14     PhD Assoc   Male    9    27  23712
15 Masters  Prof   Male    9    24  25748

Solutions

Part A

Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.


First, to gain an understanding of the data, I import the Salary data from the alr4textbook library.

hide
library(alr4)
data("salary")
dim(salary)
[1] 52  6
hide
library(skimr)
summary(salary)
     degree      rank        sex          year            ysdeg      
 Masters:34   Asst :18   Male  :38   Min.   : 0.000   Min.   : 1.00  
 PhD    :18   Assoc:14   Female:14   1st Qu.: 3.000   1st Qu.: 6.75  
              Prof :20               Median : 7.000   Median :15.50  
                                     Mean   : 7.481   Mean   :16.12  
                                     3rd Qu.:11.000   3rd Qu.:23.25  
                                     Max.   :25.000   Max.   :35.00  
     salary     
 Min.   :15000  
 1st Qu.:18247  
 Median :23719  
 Mean   :23798  
 3rd Qu.:27259  
 Max.   :38045  

We see that mean salary for men is $24,696.79 and mean salary for women is $21,357.14 a difference of $3339.65.


Now we test the hypothesis

Hypothesis Testing:

\(H_{0}\): \(=\) mean salaries between men and women are the same

\(H_{a}\): \(\neq\) there is a difference in mean salaries of men and women

Significance Level = 0.05

Test Statistic = t-test

Now that I have a better understanding of the data, I check whether the variance between men and women is the same.

hide
## Calculating mean salary by sex using pipes and the group_by function

salary %>% group_by(sex) %>% 
  summarise(mean = mean(salary), 
              sd = sd(salary))
# A tibble: 2 x 3
  sex      mean    sd
  <fct>   <dbl> <dbl>
1 Male   24697. 5646.
2 Female 21357. 6152.

We now use the Two Sample t-test to determine the confidence interval .

hide
t.test(salary~sex, data = salary, var.equal = T,    #t-test to calculate the 95% CI
       conf.level = 0.95, alternative = "two.sided")

    Two Sample t-test

data:  salary by sex
t = 1.8474, df = 50, p-value = 0.0706
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -291.257 6970.550
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

Interpretation:

The p-value of the t-test = 0.0706, therefore at a 5% significance level,there is not enough evidence to reject the null hypothesis \(H_{0}\) since the probability of the null hypothesis \(H_{0}\) being true is 7.06% higher than the benchmark rejection of 5%. We therefore fail to reject the null hypothesis \(H_{0}\) that mean salary for men and women are the same.


Part B

Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.


First, we run a multiple regression model using salary.

hide
lm_model <- lm(salary~.,data = salary)    #multiple linear regression
summary(lm_model)                         #review of output

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Part 2 of B

Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

I now calculate the 95% confidence interval for the difference between male and female salaries, using a t-test.

hide
t.test(salary~sex, data = salary, var.equal = T,    #t-test to calculate the 95% CI
       conf.level = 0.95, alternative = "two.sided")

    Two Sample t-test

data:  salary by sex
t = 1.8474, df = 50, p-value = 0.0706
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -291.257 6970.550
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

95% confidence interval for the difference in salaries between male and female is:

[-291.257 6970.550] where -291.257 is the lower bound and 6970.550 is the upper bound


Using the summary function and the linear model (lm) we once again see the significance level or p-value is 0.0706 or 7.06%. Furthermore, we see the point estimate of the Sex variable is $3340 in favor of males.

hide
summary(lm_model <- lm(salary ~ sex, salary))$coef
             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 24696.789   937.9776 26.32983 5.761530e-31
sexFemale   -3339.647  1807.7156 -1.84744 7.060394e-02

Part C

Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables.

Here, I calculate the multiple linear regression to determine the statistical significance of the predictor variables.

hide
lm_model <- lm(salary~.,data = salary)    #multiple linear regression
summary(lm_model)

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Interpretation:

Intercept: Average salary of a person is estimated to be $15746.05, with no impact from any other predictors.

degreePHD: For a person holding a PHD degree, there is a positive slope and an average increase in the salary of the person by $1388.61 keeping all other variables as constant. The p-value is 0.18 > 0.05 therefore not statistically significant.

rankAssoc: For a person who holds an associate rank, there is a positive slope and an average increase in the salary of the person by $5292.36, keeping all other variables as constant. The p-value is 3.22e-05 < 0.05 and is statistically significant.

rankProf: For a person who holds an professor rank, there is a positive slope and an average increase in the salary of the person by $11118.76, keeping all other variables as constant. The p-value is 1.62e-10 < 0.05 and is statistically significant.

sexFemale: For females, there is a positive slope and an average increase in the salary of $1166.37 as compared to males, keeping all other variables as constant. The p-value is 0.214 > 0.05 and is statistically significant.

year: For an extra year of person holding same rank, there is a positive slope and an average increase in the salary of $476.31, keeping all other variables as constant. The p-value is 8.65e-06 < 0.05 and is statistically significant.

ysdeg: For an additional year of holding the highest degree, there is a negative slope and an average decrease in the salary of $124.57, keeping all other variables as constant. The p-value is 0.115 > 0.05 and is not statistically significant.


Part D

Change the baseline category for the rank variable. Interpret the coefficients related to rank again.


I first create new dummy variables for the rank variable(s).

hide
salary$D1 <- ifelse(salary$rank=="Asst",1,0)
salary$D2 <- ifelse(salary$rank=="Prof",1,0)

Revised linear models after changing the base category of rank

hide
# Linear Model after changing the base category of rank
lm_model <- lm(salary~degree+D1+D2+sex+year+ysdeg,data = salary)
summary(lm_model)

Call:
lm(formula = salary ~ degree + D1 + D2 + sex + year + ysdeg, 
    data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21038.41    1109.12  18.969  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
D1          -5292.36    1145.40  -4.621 3.22e-05 ***
D2           5826.40    1012.93   5.752 7.28e-07 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Analysis:

Revised Rank variables:

D1 = Assistant Rank

D1 = -5292.36 implies that for a person holding an Assistant Rank, there is an average decrease in salary by $5292.36 compared to a person holding a Associate Rank, keeping all other variables as constant.

D2 = Professor Rank

D2 = 5826.40, implies that for a person holding a Professor Rank, there is an average increase in the salary of $5826.40 compared to a person holding an Associate Rank, keeping all other variables as constant.


Part E

Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.

Exclude the variable rank, refit, and summarize how your findings changed, if they did.


Excluding the variable rank in the linear regression model

hide
#linear regression model using the lm function

lm_model_no_rank<-lm(formula=salary~sex+degree+year+ysdeg,data=salary)
summary(lm_model_no_rank)                       #review of linear model output

Call:
lm(formula = salary ~ sex + degree + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
sexFemale   -1286.54    1313.09  -0.980 0.332209    
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

Interpretation:

Excluding the rank variable, the coefficient for SexFemale has a negative slope, indicating a salary advantage for males,however, the 𝑝-value is 0.33 which is greater than > 0.05, indicating the difference is not statistically significant.

Excluding the rank variable, the coefficient for degreePhd has a negative slope, indicating a decrease in salary for females of approximately $3299.35. With a 𝑝-value of 0.0147 which is < less than benchmark, this indicates statistical significance.

Excluding the rank variable, the coefficient for year has a positive slope, indicating a increase a positive relationship to salary of approximately $351.97. With a 𝑝-value of 0.017185 which is < less than 0.05 benchmark, this indicates statistical significance.

Excluding the rank variable, the coefficient for ysdeg has a positive slope, indicating a increase a positive relationship to salary of approximately $339.40. With a 𝑝-value of 0.000114 which is < less than 0.05 benchmark, this indicates statistical significance.

Analysis:

By excluding the rank variable we see that coefficients for the predictor variables, degreePhd,year,ysdeg all have p-values of less than the standard benchmark of 0.05. This indicates statistical significance. Whereas, the relationship between salary and the predictor variable sexFemale has a p-value greater than the benchmark of 0.05,which indicates statistical insignificance. Of note, it is observed when the variable rank is removed, degreePhd is statistically significant, whereas before it was statistically insignificant, sexFemale is still statistically insignificant when rank is removed.This may indicate there is no gender salary discrimination. Predictor variable year is still statistically insignificant when rank is removed whereas, ysdeg was statistically insignificant before the removal of rank but is now statistically significant after the removal.


Part F

Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?


Let’s start with our hypotheses

Hypotheses:

\(H_{0}\): \(>\) Mean salary of hires for new Dean are greater than mean salary of hires of former Dean

\(H_{a}\): \(\leq\) Mean salary of hires for new Dean are less than or equal to mean salary of former Dean

The year they earned their highest degree 16 or more were hired by the old dean and 15 or less were hired by the new dean. I will create dummy variables representing each dean and their hires. I choose this method because new dean and old dean are binary, and by definition dummy variables are dichotomous. With old dean greater than 16 represented by 1 and new dean less than 15,represented by 0.

I now check if there is any multicollinearity in our model

hide
lm_model_ysdeg <- lm(salary~.,data = salary)    #multiple linear regression
summary(lm_model_ysdeg)

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients: (2 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
D1                NA         NA      NA       NA    
D2                NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Model would not knit with the vif function. However, Vif chunk will run when separated from rest of model.

hide
#vif(lm_model_ysdeg)#variance inflation factor (VIF)

Interpretation:

When the linear regression model is run including all predictors we see that ysdeg has a p-value of 0.627 which is larger than the benchmark of 0.05 and is therefore not statistically significant. Furthermore, when running the model using VIF the Variance Inflation Factor, the variable ysdeg is 8.967 which is significantly larger than the other variables, with a value significantly greater than the accepted value of 5. This may indicate potentially severe correlation between the predictor variable year and other predictor variables in the model.


The year they earned their highest degree 16 or more were hired by the old dean and 15 or less were hired by the new dean. I will create dummy variables representing each dean and their hires. I choose this method because new dean and old dean are binary, and by definition dummy variables are dichotomous. With old dean greater than 16 represented by 1 and new dean less than 15,represented by 0.

hide
salary$hires<-ifelse(salary$ysdeg<=15,1,0)  #dummy variable 

new_dean<-lm(salary~rank+degree+sex+year+hires,data=salary)
summary(new_dean)                         # review model output

Call:
lm(formula = salary ~ rank + degree + sex + year + hires, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13328.38    1483.38   8.985 1.33e-11 ***
rankAssoc    4972.66     997.17   4.987 9.61e-06 ***
rankProf    11096.95    1191.00   9.317 4.54e-12 ***
degreePhD     818.93     797.48   1.027   0.3100    
sexFemale     907.14     840.54   1.079   0.2862    
year          434.85      78.89   5.512 1.65e-06 ***
hires        2163.46    1072.04   2.018   0.0496 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16

Analysis:

To avoid any multicollinearity I removed the variable that has high correlation and/or similarity to the original model. In this instance I removed the predictor variable ysdeg because of its similarity to years it also had a p-value of 0.627 greater than the bench mark of 0.05 as well as a Variance Inflation Factor VIF score of 8.9, which well exceeded the benchmark of 5.

When the linear model is run omitting the ysdeg variable, year appears to have a statistically significant p-value of 1.65e-06. Moreover, an adjusted R-squared of 0.8407 indicates a strong correlation. It is important to avoid any multicollinearity in the model because we want a model of independent variables. If the variables are too close when fitting the model we can have skewed results, in other words the variables may not provide unique or independent information.


Question 3


(SMSS 13.7 & 13.8 combined, modified)

(Data file: house.selling.price in smss R package)


Solutions


Part A

Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)

I first import the house selling price data and inspect it.

hide
library(smss)
data(house.selling.price)
str(house.selling.price)
'data.frame':   100 obs. of  7 variables:
 $ case : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Taxes: int  3104 1173 3076 1608 1454 2997 4054 3002 6627 320 ...
 $ Beds : int  4 2 4 3 3 3 3 3 5 3 ...
 $ Baths: int  2 1 2 2 3 2 2 2 4 2 ...
 $ New  : int  0 0 0 0 0 1 0 1 0 0 ...
 $ Price: int  279900 146500 237700 200000 159900 499900 265500 289900 587000 70000 ...
 $ Size : int  2048 912 1654 2068 1477 3153 1355 2075 3990 1160 ...

I then look at the first 10 rows of the data for ease of review and create a new object for the dataset.

hide
head(house.selling.price,10)          #first 10 rows of data set
   case Taxes Beds Baths New  Price Size
1     1  3104    4     2   0 279900 2048
2     2  1173    2     1   0 146500  912
3     3  3076    4     2   0 237700 1654
4     4  1608    3     2   0 200000 2068
5     5  1454    3     3   0 159900 1477
6     6  2997    3     2   1 499900 3153
7     7  4054    3     2   0 265500 1355
8     8  3002    3     2   1 289900 2075
9     9  6627    5     4   0 587000 3990
10   10   320    3     2   0  70000 1160
hide
selling_price<-house.selling.price     #house selling price

I fit a multiple linear regression to show the relationship of selling price of the new house in terms of its size.

hide
reg_selling_price<-lm(Price~New+Size,data=house.selling.price)  
summary(reg_selling_price)        #review of output

Call:
lm(formula = Price ~ New + Size, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
New          57736.283  18653.041   3.095  0.00257 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Coefficients of house selling price data set

hide
reg_selling_price<-lm(Price~New+Size,data=selling_price)$coefficients  
summary(reg_selling_price)           #view regression model output
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-40230.9 -20057.4    116.1   5873.9  28926.2  57736.3 

Performing a correlation test to determine relationship between New house and house Size

hide
cor.test(selling_price$Size, selling_price$New) #correlation test

    Pearson's product-moment correlation

data:  selling_price$Size and selling_price$New
t = 4.1212, df = 98, p-value = 7.891e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2032530 0.5399831
sample estimates:
      cor 
0.3843277 

Analysis:

Controlling for Size we see that predictor variables New and Size have p-values of 0.00257 and 2e-16 respectively which are statistically significant since they are less than the benchmark of 0.05. This indicates we can reject the \(H_{0}\) null hypotheses since there is no correlation between New and Price of new homes. Further, we can also reject \(H_{0}\) the null hypothesis for the relationship between Size and Price since there is no correlation between them as well. By calculating the correlation we see the correlation is 0.3843277 which is indicates a weak correlation.


Part B

Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.


We first start with a modification of the previously seen linear regression model.

\(\ E\)\((y)_{}\) = \(\alpha\) +\(B_{1}\) \(x_{i1}\) + \(B_{2}\)\(x_{i2}\) + \(B_{p}\)\(x_{p}\) + €

where:

\(\ E\) is the estimated new/not new house selling price

\(y_{}\) is the dependent variable/outcome variable

\(\alpha\) is the intercept

\(B_{1}\) and \(B_{2}\) are the slopes of the explanatory variables.

\(B_{p}\)\(x_{p}\) the coefficient \(B\) is the expected increase in the dependent variable \(y_{}\) for a one unit increase in the independent predictor variables \(\rho\)

I will now run multi-linear regression

hide
reg_selling_price<-lm(Price~New+Size,data = selling_price)
summary(reg_selling_price)     #view regression model output

Call:
lm(formula = Price ~ New + Size, data = selling_price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
New          57736.283  18653.041   3.095  0.00257 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

We then separate coefficients of the model for easier undesirability

hide
reg_selling_price<-lm(Price~New+Size,data=selling_price)$coefficients 
summary(reg_selling_price)       #view regression model output
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-40230.9 -20057.4    116.1   5873.9  28926.2  57736.3 

We now plug the coefficients into our predictor equation where the dummy variables (1 = new; 0 = not_new) are.

\(\ E\) = -40231+116(Size)+57736(New) where 116 sq.ft.is the size which remains constant

\(\ E\)\((price)_{} = -40230.9+116\)


\(\ E = -40230.9+116.1+57736.283\)

\(\ E\)\((price)_{}\) = -40231+57736(New)+116(Size) = 17,505(New) +116 (Size) again lot size at 116 sq.ft. remains constant

Analysis:

We see that for New and Used houses there is an increase in square footage associated with a price increase of $116. Also,for each house size there is an expected price increase of $57736 for a new house. Interestingly, the impact of each variable is separate.


Part C

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.


(i)

We can use the results from Part B to answer the following questions:

\(\ E\)\((y)_{}\) = \(\alpha\) +\(B_{1}\) \(x_{i1}\) + \(B_{2}\)\(x_{i2}\) + \(B_{p}\)\(x_{p}\) + €

where the estimated equation is:

\(\ E\)\((y)_{}= 17505 + 116.10(3000)\)

hide
new = 17505              #variable created for "new" retrieved from previous problem
sq_ft = 116.10           #variable created for "square feet"

new_house_price<-(new+sq_ft*3000)         #object created for "new_house_price"
print(new_house_price)                    #view output
[1] 365805

New_House = $365,805

(ii)

Estimated equation \(\ E\)\((y)_{}= -40230.9 + 116.10(3000)\)

hide
not_new_price = -40230.9        #new object created  "not_new_price" for used houses
sq_ft   = 116.10

not_new_price<-(not_new_price+sq_ft*3000)
                    
print(not_new_price)   #view output
[1] 308069.1

Not_New_House = $308,069


Part D

Fit another model, this time with an interaction term allowing interaction between

size and new, and report the regression results

Here we fit a linear model showing the interaction between New and Size.

hide
                              #interaction between size and new
new_size_reg<-(lm(Price~New*Size,data = house.selling.price)) 
summary(new_size_reg)         #view regression model output

Call:
lm(formula = Price ~ New * Size, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
New         -78527.502  51007.642  -1.540  0.12697    
Size           104.438      9.424  11.082  < 2e-16 ***
New:Size        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

Linear model with coefficients for better readability

hide
#view regression model output
summary(lm(Price~New*Size,data = house.selling.price))$coefficients
                Estimate   Std. Error   t value     Pr(>|t|)
(Intercept) -22227.80793 15521.109973 -1.432102 1.553627e-01
New         -78527.50235 51007.641896 -1.539524 1.269661e-01
Size           104.43839     9.424079 11.082080 7.198590e-19
New:Size        61.91588    21.685692  2.855149 5.271610e-03

Analysis:

New and Size have a p-value 0.00527 which is less than the standard benchmark of

0.05 and thus appears to be statistically significant.


Part E

Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.


(i) and (ii)

We plot a linear model using ggplot to determine the “Predicted Selling Price vs Size for Houses that are New and Used”. Where the dummy variable 0 denotes not_new and 1** denotes new

hide
ggplot(new_size_reg,aes(y=Size,x=Price))+
  geom_point()+
  geom_smooth(method="lm",se= T,full_range=T)+
  facet_wrap(.~New)+
  labs(x = "Price in Dollars",y = "Size in square feet",title = "Price vs Size for Houses that are Not_New and New" )


Part F

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Predicted value of new house

(i)

I fit a linear model to determine the predicted selling price of a 3000 sq. ft. new home

hide
new_house_predict<-lm(Price~New+Size+Size*New,data = house.selling.price)
summary(new_house_predict)     #review of output

Call:
lm(formula = Price ~ New + Size + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
New         -78527.502  51007.642  -1.540  0.12697    
Size           104.438      9.424  11.082  < 2e-16 ***
New:Size        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

Using our trusty estimated equation we plug in the predictor variables from the

multilinear regression model we calculated earlier. Note for the new house we will

use the dummy variable “1” and for not_new house we use the dummy variable “0”

Estimated equation:

\(\ E\)\((y)_{}= -22227.81 -78527.50+104.44(3000)+61.92(3000)\)

hide
#Predicting new house price for home with 3000 sq ft.

new_house_predict_est<-(-22227.81-78527.50+104.44*3000+61.92*3000)
print(new_house_predict_est)                  
[1] 398324.7

New house predicted selling price(3000 sq.ft.) = $398324.70

(ii)

We now calculate predicted price for the used aka not_new house house

Estimated equation:

\(\ E\)\((y)_{}= -22227.81 +104.44(3000)+61.92(0)\)

hide
not_new_predict_est<-(-22227.81+104.44*3000+61.92*0)
print(not_new_predict_est)
[1] 291092.2

Not_new house predicted selling price(3000 sq.ft.) = $291092.20


Part G

Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

(i)

Predicting selling price of a new house of 1500 square feet

Estimated equation \(\ E\)\((y)_{}= -22227.81 -78527.50+104.44(1500)+61.92(1500)\)

hide
new_house_predict_est<-(-22227.81-78527.50+104.44*1500+61.92*1500)
print(new_house_predict_est)   
[1] 148784.7

New house predicted selling price(1500 sq.ft.) = $148784.70

(ii)

Estimated equation \(\ E\)\((y)_{}= -22227.81 +104.44(1500)+61.92(0)\)

hide
#Predicted value of used house with 1500 square feet

not_new_predict_est<-(-22227.81+104.44*1500+61.92*0)
print(not_new_predict_est)
[1] 134432.2

Not_new house predicted selling price(1500 sq.ft.) = $134432.20

Interpretation:

The predicted selling price difference between a new 3000 sq ft and 1500 sq ft new house is:

$398324.70-$148784.70 = $249540

Predicted selling price difference between new 3000 sq. ft and 1500 sq ft house

expressed as a percentage

hide
predict_selling_new_diff<-(398324.70 - 148784.70)/((398324.70 + 148784.70)/2)*100
print(predict_selling_new_diff)   #review output
[1] 91.22124

Percentage difference between new 3000 sq.ft. and new 1500 sq.ft. is 91.22%

The predicted selling price difference between a new 3000 sq ft and and a not_new 3000 sq ft house is:

$398324.70-$291092.20 = $107232.50

The predicted selling price difference between a not_new 3000 sq ft and a not_new 1500 sq ft house is:

$291092.20-$134432.20 = $156660

Predicted selling price difference between not_new 3000 sq. ft and not_new 1500 sq ft house as a percentage

hide
predict_selling_price_used_diff<-(291092.20 - 134432.20)/((291092.20 + 134432.20)/2)*100
print(predict_selling_price_used_diff)   #review output
[1] 73.6315

Percentage difference between a not_new 3000 sq ft and not_new 1500 sq ft house is 73.63%

The predicted selling price difference between a new 1500 sq ft and and a not_new 1500 sq ft house is:

$148784.70-$134432.20 = $14352.50

Overall, we see that as the size of a new house increases it has more value than comparably sized used aka not_new houses.


Part H

Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?


Model 1 with interaction term

hide
new_house_predict<-lm(Price~New+Size+Size*New,data = house.selling.price)
summary(new_house_predict)     #review of output

Call:
lm(formula = Price ~ New + Size + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
New         -78527.502  51007.642  -1.540  0.12697    
Size           104.438      9.424  11.082  < 2e-16 ***
New:Size        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

Model 2 without interaction term

hide
reg_selling_price<-lm(Price~New+Size,data=house.selling.price)  
summary(reg_selling_price)        #review of output

Call:
lm(formula = Price ~ New + Size, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
New          57736.283  18653.041   3.095  0.00257 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Analysis :

The model with the interaction term has an adjusted \(R^{2}\) of 0.7363 compared to an adjusted \(R^{2}\) of 0.7169 of the model without one. This suggests that the additional input variable is contributing to the model. Furthermore, comparing the \(R^{2}\) between Model 1 and Model 2, the \(R^{2}\) predicts that Model 1 is a better model as it carries greater explanatory power (0.7443 in Model 1 vs. 0.7226 in Model 2). Therefore,Model 1 would be my preference.