DACSS 603
(SMSS 11.2, except part (d))
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.
Part A
A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
Here’s what we have:
The selling price of the home depends on the size of the home and the lot size. Therefore, the selling price of home is the dependent or the response variable. The size of home is the first explanatory variable, and lot size is the second explanatory variable. Therefore,the multiple linear regression of the problem is obtained using the formula:
\(Y_{}\) = \(B_{0}\) +\(B_{1}\) \(x_{i1}\) + \(B_{2}\)\(x_{i2}\) + €
where:
\(Y_{i}\) = dependent variable that is the selling price of home
\(B_{0}\) = the y-intercept
\(B_{1}\) and \(B_{2}\) are the slopes of the explanatory variables.
\(x_{i1}\) is the first explanatory variable that is the size of home
\(x_{i2}\) is the second explanatory variable that is the lot size.
€ is the error or residual
where:
\(Y_{}\) is the selling price of home.
\(x_{1}\) is the size of home
\(x_{2}\) is the lot size.
For a home of 1240 square feet on a lot of 18,000 square feet,we predict the selling price with estimated multiple linear regression using the predictive equation:
\(\hat{y}\) = -10,536+53.8\(x_{1}\)+2.84\(x_{2}\) = $107,206
Calculating selling price in R
house_selling_price<- -(10536)+53.8*1240+2.84*18000 # selling_price in R
print(house_selling_price) # review of output
[1] 107296
House selling price = $107296
Now we calculate the residual which is the difference between the observed value and the mean value the model predicts for that observation. It is the vertical distance between a data point and the regression line. It is a measure of how well a line fits an individual data point.
residual_house_price<-145000-house_selling_price #residual price of house
print(residual_house_price)
[1] 37704
Since we have a positive residual value, it means the actual selling price is MORE than the predicted selling price. The house therefore sold for more than was predicted.
Part B
For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
The slope coefficient for the first explanatory variable size of the home is obtained as 53.8 and is positive. For every additional unit increase in the size of the home, the selling price of the home increases by 53.8 sq.ft., keeping the other explanatory variable that is lot size fixed.
Thus, for every additional square-foot increase in the size of the home, the selling price of the home increases by $53.80, keeping the other explanatory variable that is lot size fixed.
Part C
According to this prediction equation, for fixed home size, how much would the lot size need to increase to have the same impact as a one-square-foot increase in home size?
The slope coefficient for the second explanatory variable lot size is 2.84 and is positive. For every additional unit increase in the lot size, the selling price of the house increases by 2.84 sq.ft., keeping the other explanatory variable home size which is 53.8 fixed. Therefore, taking the equation :
\(\hat{y}\) = -10,536+53.8\(x_{1}\)+2.84\(x_{2}\)
We multiply lot size 53.8*(1)/2.84 where 1 is the one-square-foot increase in home size and divide by lot size 2.84.
lot_increase<-53.8*1/2.84
print(lot_increase)
[1] 18.94366
19 sq.ft. is the amount the lot size would need to increase to have the same impact as a one-square-foot increase in home size.
(ALR, 5.17, slightly modified)
(Data file: salary in alr4 R package).
The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.
First, I inspect the data set for an understanding what it is about.
glimpse(salary)
Rows: 52
Columns: 6
$ degree <fct> Masters, Masters, Masters, Masters, PhD, Masters, PhD~
$ rank <fct> Prof, Prof, Prof, Prof, Prof, Prof, Prof, Prof, Prof,~
$ sex <fct> Male, Male, Male, Female, Male, Male, Female, Male, M~
$ year <int> 25, 13, 10, 7, 19, 16, 0, 16, 13, 13, 12, 15, 9, 9, 9~
$ ysdeg <int> 35, 22, 23, 27, 30, 21, 32, 18, 30, 31, 22, 19, 17, 2~
$ salary <int> 36350, 35350, 28200, 26775, 33696, 28516, 24900, 3190~
summary(salary)
degree rank sex year ysdeg
Masters:34 Asst :18 Male :38 Min. : 0.000 Min. : 1.00
PhD :18 Assoc:14 Female:14 1st Qu.: 3.000 1st Qu.: 6.75
Prof :20 Median : 7.000 Median :15.50
Mean : 7.481 Mean :16.12
3rd Qu.:11.000 3rd Qu.:23.25
Max. :25.000 Max. :35.00
salary
Min. :15000
1st Qu.:18247
Median :23719
Mean :23798
3rd Qu.:27259
Max. :38045
head(salary,15) #First 15 rows of the data set
degree rank sex year ysdeg salary
1 Masters Prof Male 25 35 36350
2 Masters Prof Male 13 22 35350
3 Masters Prof Male 10 23 28200
4 Masters Prof Female 7 27 26775
5 PhD Prof Male 19 30 33696
6 Masters Prof Male 16 21 28516
7 PhD Prof Female 0 32 24900
8 Masters Prof Male 16 18 31909
9 PhD Prof Male 13 30 31850
10 PhD Prof Male 13 31 32850
11 Masters Prof Male 12 22 27025
12 Masters Assoc Male 15 19 24750
13 Masters Prof Male 9 17 28200
14 PhD Assoc Male 9 27 23712
15 Masters Prof Male 9 24 25748
Part A
Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
First, to gain an understanding of the data, I import the Salary data from the alr4textbook library.
[1] 52 6
degree rank sex year ysdeg
Masters:34 Asst :18 Male :38 Min. : 0.000 Min. : 1.00
PhD :18 Assoc:14 Female:14 1st Qu.: 3.000 1st Qu.: 6.75
Prof :20 Median : 7.000 Median :15.50
Mean : 7.481 Mean :16.12
3rd Qu.:11.000 3rd Qu.:23.25
Max. :25.000 Max. :35.00
salary
Min. :15000
1st Qu.:18247
Median :23719
Mean :23798
3rd Qu.:27259
Max. :38045
We see that mean salary for men is $24,696.79 and mean salary for women is $21,357.14 a difference of $3339.65.
Now we test the hypothesis
Hypothesis Testing:
\(H_{0}\): \(=\) mean salaries between men and women are the same
\(H_{a}\): \(\neq\) there is a difference in mean salaries of men and women
Significance Level = 0.05
Test Statistic = t-test
Now that I have a better understanding of the data, I check whether the variance between men and women is the same.
# A tibble: 2 x 3
sex mean sd
<fct> <dbl> <dbl>
1 Male 24697. 5646.
2 Female 21357. 6152.
We now use the Two Sample t-test to determine the confidence interval .
t.test(salary~sex, data = salary, var.equal = T, #t-test to calculate the 95% CI
conf.level = 0.95, alternative = "two.sided")
Two Sample t-test
data: salary by sex
t = 1.8474, df = 50, p-value = 0.0706
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-291.257 6970.550
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
The p-value of the t-test = 0.0706, therefore at a 5% significance level,there is not enough evidence to reject the null hypothesis \(H_{0}\) since the probability of the null hypothesis \(H_{0}\) being true is 7.06% higher than the benchmark rejection of 5%. We therefore fail to reject the null hypothesis \(H_{0}\) that mean salary for men and women are the same.
Part B
Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
First, we run a multiple regression model using salary.
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
Part 2 of B
Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
I now calculate the 95% confidence interval for the difference between male and female salaries, using a t-test.
t.test(salary~sex, data = salary, var.equal = T, #t-test to calculate the 95% CI
conf.level = 0.95, alternative = "two.sided")
Two Sample t-test
data: salary by sex
t = 1.8474, df = 50, p-value = 0.0706
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-291.257 6970.550
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
95% confidence interval for the difference in salaries between male and female is:
[-291.257 6970.550] where -291.257 is the lower bound and 6970.550 is the upper bound
Using the summary function and the linear model (lm) we once again see the significance level or p-value is 0.0706 or 7.06%. Furthermore, we see the point estimate of the Sex variable is $3340 in favor of males.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24696.789 937.9776 26.32983 5.761530e-31
sexFemale -3339.647 1807.7156 -1.84744 7.060394e-02
Part C
Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables.
Here, I calculate the multiple linear regression to determine the statistical significance of the predictor variables.
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
Intercept: Average salary of a person is estimated to be $15746.05, with no impact from any other predictors.
degreePHD: For a person holding a PHD degree, there is a positive slope and an average increase in the salary of the person by $1388.61 keeping all other variables as constant. The p-value is 0.18 > 0.05 therefore not statistically significant.
rankAssoc: For a person who holds an associate rank, there is a positive slope and an average increase in the salary of the person by $5292.36, keeping all other variables as constant. The p-value is 3.22e-05 < 0.05 and is statistically significant.
rankProf: For a person who holds an professor rank, there is a positive slope and an average increase in the salary of the person by $11118.76, keeping all other variables as constant. The p-value is 1.62e-10 < 0.05 and is statistically significant.
sexFemale: For females, there is a positive slope and an average increase in the salary of $1166.37 as compared to males, keeping all other variables as constant. The p-value is 0.214 > 0.05 and is statistically significant.
year: For an extra year of person holding same rank, there is a positive slope and an average increase in the salary of $476.31, keeping all other variables as constant. The p-value is 8.65e-06 < 0.05 and is statistically significant.
ysdeg: For an additional year of holding the highest degree, there is a negative slope and an average decrease in the salary of $124.57, keeping all other variables as constant. The p-value is 0.115 > 0.05 and is not statistically significant.
Part D
Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
I first create new dummy variables for the rank variable(s).
Revised linear models after changing the base category of rank
Call:
lm(formula = salary ~ degree + D1 + D2 + sex + year + ysdeg,
data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21038.41 1109.12 18.969 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
D1 -5292.36 1145.40 -4.621 3.22e-05 ***
D2 5826.40 1012.93 5.752 7.28e-07 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
Revised Rank variables:
D1 = -5292.36 implies that for a person holding an Assistant Rank, there is an average decrease in salary by $5292.36 compared to a person holding a Associate Rank, keeping all other variables as constant.
D2 = 5826.40, implies that for a person holding a Professor Rank, there is an average increase in the salary of $5826.40 compared to a person holding an Associate Rank, keeping all other variables as constant.
Part E
Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.
Exclude the variable rank, refit, and summarize how your findings changed, if they did.
Excluding the variable rank in the linear regression model
Call:
lm(formula = salary ~ sex + degree + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
sexFemale -1286.54 1313.09 -0.980 0.332209
degreePhD -3299.35 1302.52 -2.533 0.014704 *
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
Excluding the rank variable, the coefficient for SexFemale has a negative slope, indicating a salary advantage for males,however, the 𝑝-value is 0.33 which is greater than > 0.05, indicating the difference is not statistically significant.
Excluding the rank variable, the coefficient for degreePhd has a negative slope, indicating a decrease in salary for females of approximately $3299.35. With a 𝑝-value of 0.0147 which is < less than benchmark, this indicates statistical significance.
Excluding the rank variable, the coefficient for year has a positive slope, indicating a increase a positive relationship to salary of approximately $351.97. With a 𝑝-value of 0.017185 which is < less than 0.05 benchmark, this indicates statistical significance.
Excluding the rank variable, the coefficient for ysdeg has a positive slope, indicating a increase a positive relationship to salary of approximately $339.40. With a 𝑝-value of 0.000114 which is < less than 0.05 benchmark, this indicates statistical significance.
By excluding the rank variable we see that coefficients for the predictor variables, degreePhd,year,ysdeg all have p-values of less than the standard benchmark of 0.05. This indicates statistical significance. Whereas, the relationship between salary and the predictor variable sexFemale has a p-value greater than the benchmark of 0.05,which indicates statistical insignificance. Of note, it is observed when the variable rank is removed, degreePhd is statistically significant, whereas before it was statistically insignificant, sexFemale is still statistically insignificant when rank is removed.This may indicate there is no gender salary discrimination. Predictor variable year is still statistically insignificant when rank is removed whereas, ysdeg was statistically insignificant before the removal of rank but is now statistically significant after the removal.
Part F
Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.
Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?
Let’s start with our hypotheses
Hypotheses:
\(H_{0}\): \(>\) Mean salary of hires for new Dean are greater than mean salary of hires of former Dean
\(H_{a}\): \(\leq\) Mean salary of hires for new Dean are less than or equal to mean salary of former Dean
The year they earned their highest degree 16 or more were hired by the old dean and 15 or less were hired by the new dean. I will create dummy variables representing each dean and their hires. I choose this method because new dean and old dean are binary, and by definition dummy variables are dichotomous. With old dean greater than 16 represented by 1 and new dean less than 15,represented by 0.
I now check if there is any multicollinearity in our model
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
D1 NA NA NA NA
D2 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
Model would not knit with the vif function. However, Vif chunk will run when separated from rest of model.
#vif(lm_model_ysdeg)#variance inflation factor (VIF)
When the linear regression model is run including all predictors we see that ysdeg has a p-value of 0.627 which is larger than the benchmark of 0.05 and is therefore not statistically significant. Furthermore, when running the model using VIF the Variance Inflation Factor, the variable ysdeg is 8.967 which is significantly larger than the other variables, with a value significantly greater than the accepted value of 5. This may indicate potentially severe correlation between the predictor variable year and other predictor variables in the model.
The year they earned their highest degree 16 or more were hired by the old dean and 15 or less were hired by the new dean. I will create dummy variables representing each dean and their hires. I choose this method because new dean and old dean are binary, and by definition dummy variables are dichotomous. With old dean greater than 16 represented by 1 and new dean less than 15,represented by 0.
Call:
lm(formula = salary ~ rank + degree + sex + year + hires, data = salary)
Residuals:
Min 1Q Median 3Q Max
-3403.3 -1387.0 -167.0 528.2 9233.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13328.38 1483.38 8.985 1.33e-11 ***
rankAssoc 4972.66 997.17 4.987 9.61e-06 ***
rankProf 11096.95 1191.00 9.317 4.54e-12 ***
degreePhD 818.93 797.48 1.027 0.3100
sexFemale 907.14 840.54 1.079 0.2862
year 434.85 78.89 5.512 1.65e-06 ***
hires 2163.46 1072.04 2.018 0.0496 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared: 0.8594, Adjusted R-squared: 0.8407
F-statistic: 45.86 on 6 and 45 DF, p-value: < 2.2e-16
To avoid any multicollinearity I removed the variable that has high correlation and/or similarity to the original model. In this instance I removed the predictor variable ysdeg because of its similarity to years it also had a p-value of 0.627 greater than the bench mark of 0.05 as well as a Variance Inflation Factor VIF score of 8.9, which well exceeded the benchmark of 5.
When the linear model is run omitting the ysdeg variable, year appears to have a statistically significant p-value of 1.65e-06. Moreover, an adjusted R-squared of 0.8407 indicates a strong correlation. It is important to avoid any multicollinearity in the model because we want a model of independent variables. If the variables are too close when fitting the model we can have skewed results, in other words the variables may not provide unique or independent information.
(SMSS 13.7 & 13.8 combined, modified)
(Data file: house.selling.price in smss R package)
Part A
Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)
I first import the house selling price data and inspect it.
'data.frame': 100 obs. of 7 variables:
$ case : int 1 2 3 4 5 6 7 8 9 10 ...
$ Taxes: int 3104 1173 3076 1608 1454 2997 4054 3002 6627 320 ...
$ Beds : int 4 2 4 3 3 3 3 3 5 3 ...
$ Baths: int 2 1 2 2 3 2 2 2 4 2 ...
$ New : int 0 0 0 0 0 1 0 1 0 0 ...
$ Price: int 279900 146500 237700 200000 159900 499900 265500 289900 587000 70000 ...
$ Size : int 2048 912 1654 2068 1477 3153 1355 2075 3990 1160 ...
I then look at the first 10 rows of the data for ease of review and create a new object for the dataset.
head(house.selling.price,10) #first 10 rows of data set
case Taxes Beds Baths New Price Size
1 1 3104 4 2 0 279900 2048
2 2 1173 2 1 0 146500 912
3 3 3076 4 2 0 237700 1654
4 4 1608 3 2 0 200000 2068
5 5 1454 3 3 0 159900 1477
6 6 2997 3 2 1 499900 3153
7 7 4054 3 2 0 265500 1355
8 8 3002 3 2 1 289900 2075
9 9 6627 5 4 0 587000 3990
10 10 320 3 2 0 70000 1160
selling_price<-house.selling.price #house selling price
I fit a multiple linear regression to show the relationship of selling price of the new house in terms of its size.
Call:
lm(formula = Price ~ New + Size, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
New 57736.283 18653.041 3.095 0.00257 **
Size 116.132 8.795 13.204 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Coefficients of house selling price data set
Min. 1st Qu. Median Mean 3rd Qu. Max.
-40230.9 -20057.4 116.1 5873.9 28926.2 57736.3
Performing a correlation test to determine relationship between New house and house Size
cor.test(selling_price$Size, selling_price$New) #correlation test
Pearson's product-moment correlation
data: selling_price$Size and selling_price$New
t = 4.1212, df = 98, p-value = 7.891e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2032530 0.5399831
sample estimates:
cor
0.3843277
Controlling for Size we see that predictor variables New and Size have p-values of 0.00257 and 2e-16 respectively which are statistically significant since they are less than the benchmark of 0.05. This indicates we can reject the \(H_{0}\) null hypotheses since there is no correlation between New and Price of new homes. Further, we can also reject \(H_{0}\) the null hypothesis for the relationship between Size and Price since there is no correlation between them as well. By calculating the correlation we see the correlation is 0.3843277 which is indicates a weak correlation.
Part B
Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
We first start with a modification of the previously seen linear regression model.
\(\ E\)\((y)_{}\) = \(\alpha\) +\(B_{1}\) \(x_{i1}\) + \(B_{2}\)\(x_{i2}\) + \(B_{p}\)\(x_{p}\) + €
where:
\(\ E\) is the estimated new/not new house selling price
\(y_{}\) is the dependent variable/outcome variable
\(\alpha\) is the intercept
\(B_{1}\) and \(B_{2}\) are the slopes of the explanatory variables.
\(B_{p}\)\(x_{p}\) the coefficient \(B\) is the expected increase in the dependent variable \(y_{}\) for a one unit increase in the independent predictor variables \(\rho\)
I will now run multi-linear regression
Call:
lm(formula = Price ~ New + Size, data = selling_price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
New 57736.283 18653.041 3.095 0.00257 **
Size 116.132 8.795 13.204 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
We then separate coefficients of the model for easier undesirability
Min. 1st Qu. Median Mean 3rd Qu. Max.
-40230.9 -20057.4 116.1 5873.9 28926.2 57736.3
We now plug the coefficients into our predictor equation where the dummy variables (1 = new; 0 = not_new) are.
\(\ E\) = -40231+116(Size)+57736(New) where 116 sq.ft.is the size which remains constant
\(\ E\)\((price)_{} = -40230.9+116\)
\(\ E = -40230.9+116.1+57736.283\)
\(\ E\)\((price)_{}\) = -40231+57736(New)+116(Size) = 17,505(New) +116 (Size) again lot size at 116 sq.ft. remains constant
We see that for New and Used houses there is an increase in square footage associated with a price increase of $116. Also,for each house size there is an expected price increase of $57736 for a new house. Interestingly, the impact of each variable is separate.
Part C
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
(i)
We can use the results from Part B to answer the following questions:
\(\ E\)\((y)_{}\) = \(\alpha\) +\(B_{1}\) \(x_{i1}\) + \(B_{2}\)\(x_{i2}\) + \(B_{p}\)\(x_{p}\) + €
where the estimated equation is:
\(\ E\)\((y)_{}= 17505 + 116.10(3000)\)
new = 17505 #variable created for "new" retrieved from previous problem
sq_ft = 116.10 #variable created for "square feet"
new_house_price<-(new+sq_ft*3000) #object created for "new_house_price"
print(new_house_price) #view output
[1] 365805
New_House = $365,805
(ii)
Estimated equation \(\ E\)\((y)_{}= -40230.9 + 116.10(3000)\)
not_new_price = -40230.9 #new object created "not_new_price" for used houses
sq_ft = 116.10
not_new_price<-(not_new_price+sq_ft*3000)
print(not_new_price) #view output
[1] 308069.1
Not_New_House = $308,069
Part D
Fit another model, this time with an interaction term allowing interaction between
size and new, and report the regression results
Here we fit a linear model showing the interaction between New and Size.
Call:
lm(formula = Price ~ New * Size, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
New -78527.502 51007.642 -1.540 0.12697
Size 104.438 9.424 11.082 < 2e-16 ***
New:Size 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
Linear model with coefficients for better readability
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.80793 15521.109973 -1.432102 1.553627e-01
New -78527.50235 51007.641896 -1.539524 1.269661e-01
Size 104.43839 9.424079 11.082080 7.198590e-19
New:Size 61.91588 21.685692 2.855149 5.271610e-03
New and Size have a p-value 0.00527 which is less than the standard benchmark of
0.05 and thus appears to be statistically significant.
Part E
Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
(i) and (ii)
We plot a linear model using ggplot to determine the “Predicted Selling Price vs Size for Houses that are New and Used”. Where the dummy variable 0 denotes not_new and 1** denotes new
ggplot(new_size_reg,aes(y=Size,x=Price))+
geom_point()+
geom_smooth(method="lm",se= T,full_range=T)+
facet_wrap(.~New)+
labs(x = "Price in Dollars",y = "Size in square feet",title = "Price vs Size for Houses that are Not_New and New" )
Part F
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
Predicted value of new house
(i)
I fit a linear model to determine the predicted selling price of a 3000 sq. ft. new home
Call:
lm(formula = Price ~ New + Size + Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
New -78527.502 51007.642 -1.540 0.12697
Size 104.438 9.424 11.082 < 2e-16 ***
New:Size 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
Using our trusty estimated equation we plug in the predictor variables from the
multilinear regression model we calculated earlier. Note for the new house we will
use the dummy variable “1” and for not_new house we use the dummy variable “0”
Estimated equation:
\(\ E\)\((y)_{}= -22227.81 -78527.50+104.44(3000)+61.92(3000)\)
#Predicting new house price for home with 3000 sq ft.
new_house_predict_est<-(-22227.81-78527.50+104.44*3000+61.92*3000)
print(new_house_predict_est)
[1] 398324.7
New house predicted selling price(3000 sq.ft.) = $398324.70
(ii)
We now calculate predicted price for the used aka not_new house house
Estimated equation:
\(\ E\)\((y)_{}= -22227.81 +104.44(3000)+61.92(0)\)
not_new_predict_est<-(-22227.81+104.44*3000+61.92*0)
print(not_new_predict_est)
[1] 291092.2
Not_new house predicted selling price(3000 sq.ft.) = $291092.20
Part G
Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
(i)
Predicting selling price of a new house of 1500 square feet
Estimated equation \(\ E\)\((y)_{}= -22227.81 -78527.50+104.44(1500)+61.92(1500)\)
new_house_predict_est<-(-22227.81-78527.50+104.44*1500+61.92*1500)
print(new_house_predict_est)
[1] 148784.7
New house predicted selling price(1500 sq.ft.) = $148784.70
(ii)
Estimated equation \(\ E\)\((y)_{}= -22227.81 +104.44(1500)+61.92(0)\)
#Predicted value of used house with 1500 square feet
not_new_predict_est<-(-22227.81+104.44*1500+61.92*0)
print(not_new_predict_est)
[1] 134432.2
Not_new house predicted selling price(1500 sq.ft.) = $134432.20
The predicted selling price difference between a new 3000 sq ft and 1500 sq ft new house is:
$398324.70-$148784.70 = $249540
Predicted selling price difference between new 3000 sq. ft and 1500 sq ft house
expressed as a percentage
predict_selling_new_diff<-(398324.70 - 148784.70)/((398324.70 + 148784.70)/2)*100
print(predict_selling_new_diff) #review output
[1] 91.22124
Percentage difference between new 3000 sq.ft. and new 1500 sq.ft. is 91.22%
The predicted selling price difference between a new 3000 sq ft and and a not_new 3000 sq ft house is:
$398324.70-$291092.20 = $107232.50
The predicted selling price difference between a not_new 3000 sq ft and a not_new 1500 sq ft house is:
$291092.20-$134432.20 = $156660
Predicted selling price difference between not_new 3000 sq. ft and not_new 1500 sq ft house as a percentage
predict_selling_price_used_diff<-(291092.20 - 134432.20)/((291092.20 + 134432.20)/2)*100
print(predict_selling_price_used_diff) #review output
[1] 73.6315
Percentage difference between a not_new 3000 sq ft and not_new 1500 sq ft house is 73.63%
The predicted selling price difference between a new 1500 sq ft and and a not_new 1500 sq ft house is:
$148784.70-$134432.20 = $14352.50
Overall, we see that as the size of a new house increases it has more value than comparably sized used aka not_new houses.
Part H
Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?
Call:
lm(formula = Price ~ New + Size + Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
New -78527.502 51007.642 -1.540 0.12697
Size 104.438 9.424 11.082 < 2e-16 ***
New:Size 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
Call:
lm(formula = Price ~ New + Size, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
New 57736.283 18653.041 3.095 0.00257 **
Size 116.132 8.795 13.204 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
The model with the interaction term has an adjusted \(R^{2}\) of 0.7363 compared to an adjusted \(R^{2}\) of 0.7169 of the model without one. This suggests that the additional input variable is contributing to the model. Furthermore, comparing the \(R^{2}\) between Model 1 and Model 2, the \(R^{2}\) predicts that Model 1 is a better model as it carries greater explanatory power (0.7443 in Model 1 vs. 0.7226 in Model 2). Therefore,Model 1 would be my preference.