library(distill)
library(dplyr)
library(tidyverse)
library(knitr)
library(alr4)
library(ggplot2)
library(smss)

Question 1

(SMSS 11.2, except part (d))

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

A. A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

# predictive equation: ŷ = −10,536 + 53.8x1 + 2.84x2
yhat = -10536 + 53.8*1240 + 2.84*18000 
print(paste("The predicted selling price is", yhat))

[1] "The predicted selling price is 107296"

Residual = 145000-yhat
print(paste("The residual error is", Residual))

[1] "The residual error is 37704"

Interpretation: The property sold 37,704 higher than what the model predicted.

B. For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

For fixed lot size, the positive regression coefficient is 53.8/ sq ft. This means a house selling price increases $53.8 for each sq ft increase in size, given that X2(independent variable) is constant.

C. According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

#Z=lot size increase to have the same impact as a 1 sq ft increase in home size
#(2.84)*(Z)=(53.8)*(1 sq ft)
x1_coef= 53.8 #house size
x2_coef= 2.84 #lot size
Z=x1_coef/x2_coef
Z

[1] 18.94366

Z1 <- as.integer(round(Z)) #round up answer
print(paste("A lot size increase of", Z1, "sq ft have same impact per 1 sq ft increase in house size"))

[1] "A lot size increase of 19 sq ft have same impact per 1 sq ft increase in house size"

Question 2

(ALR, 5.17, slightly modified)

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

data("salary") # load the UN11 data
dim(salary)

[1] 52  6

kable(head(salary), format = "markdown", digits = 10,caption = "**1980 Salaries of faculty at Midwestern college**")

Table 1: **1980 Salaries of faculty at Midwestern college**
degree	rank	sex	year	ysdeg	salary
Masters	Prof	Male	25	35	36350
Masters	Prof	Male	13	22	35350
Masters	Prof	Male	10	23	28200
Masters	Prof	Female	7	27	26775
PhD	Prof	Male	19	30	33696
Masters	Prof	Male	16	21	28516

A. Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

t.test(salary~sex, data=salary)


    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14

EXPLANATION: With a p-value of 0.09, we fail to reject the null hypothesis. We accept the alternative hypothesis: true difference in means between group Male and group Female is not equal to zero. The mean of Male group (24,696) is higher than mean of Female group (21,357)

B.1. Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex.

MLM <-lm(salary ~ degree + rank + sex + year + ysdeg, data = salary)
summary(MLM)


Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

B.2. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

t.test(salary~sex, data=salary)


    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14

The 95% confidence interval for the difference in salary between males and females is [-567.8539 , 7247.1471]

C. Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

tab <- matrix(c("degree",  "NOT significant",  "PhD salary change =1388.61", "rank","YES- significant","Assoc salary change=5292.36", "sex","NOT significant","Female salary change= 1166.37","year","  YES- significant"," Salary change per year=476.31", "ysdegree","  NOT significant","  Salary change per ysdegree=-124.57 " ), ncol=3, byrow=TRUE)
colnames(tab) <- c('Variable','Statistical Significance','Slope:salary increase per change in variable')
tab <- as.table(tab)
kable(tab)

	Variable	Statistical Significance	Slope:salary increase per change in variable
A	degree	NOT significant	PhD salary change =1388.61
B	rank	YES- significant	Assoc salary change=5292.36
C	sex	NOT significant	Female salary change= 1166.37
D	year	YES- significant	Salary change per year=476.31
E	ysdegree	NOT significant	Salary change per ysdegree=-124.57

D. Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

MLM_Rank <-lm(rank ~ degree + salary + sex + year + ysdeg, data = salary)
print(MLM_Rank)


Call:
lm(formula = rank ~ degree + salary + sex + year + ysdeg, data = salary)

Coefficients:
(Intercept)    degreePhD       salary    sexFemale         year  
 -0.5925499   -0.4710229    0.0001085   -0.3130130   -0.0616205  
      ysdeg  
  0.0469866

INTERPRETATION: Based on the coefficients, degree, sex and year variables have NEGATIVE linear relationship. The variables salary and ysdeg have POSITIVE linear relationship.

E. Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.

Exclude the variable rank, refit, and summarize how your findings changed, if they did.

MLM <-lm(salary ~ degree  + sex + year + ysdeg, data = salary)
summary(MLM)


Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

INTERPRETATION:
Excluding rank, the model is NOT a better fit with Multiple R-squared: 0.6312,Adjusted R-squared: 0.5998. The variables : degree, year and ysdeg are statistically significant. The results showed that Female sex had negative coefficient -1286.54.

F. Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

lm(salary ~ rank+ sex + year*ysdeg, data = salary)


Call:
lm(formula = salary ~ rank + sex + year * ysdeg, data = salary)

Coefficients:
(Intercept)    rankAssoc     rankProf    sexFemale         year  
  16352.169     5254.351    10351.178      867.705      312.454  
      ysdeg   year:ysdeg  
    -86.317        5.581

Based on the new variable coefficient ysdeg:year, there is support for the hypothesis that the people hired by the new Dean are making higher than those that were not.

Question 3

(SMSS 13.7 & 13.8 combined, modified)

(Data file: house.selling.price in smss R package)

data("house.selling.price") 
dim(house.selling.price)

[1] 100   7

kable(head(house.selling.price))

case	Taxes	Beds	Baths	New	Price	Size
1	3104	4	2	0	279900	2048
2	1173	2	1	0	146500	912
3	3076	4	2	0	237700	1654
4	1608	3	2	0	200000	2068
5	1454	3	3	0	159900	1477
6	2997	3	2	1	499900	3153

A. Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)

Reg <-lm(Price ~ Size + New, data = house.selling.price)
summary(Reg)


Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

B. Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

Price_New_aov <-aov(Price ~ New, data = house.selling.price)
summary.aov(Price_New_aov)

            Df    Sum Sq   Mean Sq F value   Pr(>F)    
New          1 2.274e+11 2.274e+11   28.29 6.61e-07 ***
Residuals   98 7.878e+11 8.039e+09                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lm(Price ~ New, data = house.selling.price)


Call:
lm(formula = Price ~ New, data = house.selling.price)

Coefficients:
(Intercept)          New  
     138567       152396

INTERPRETATION: The coefficient 152396 means the selling price increase for variable “NEW” = 1.

C. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Dummy variable: 1=NEW, 0=NOT NEW

EQUATION: Price =138567 + 57736.283New +116.132 size

Price_NEW = 138567  + 57736.283*1 +116.132 *3000
print(paste("Predicted Selling Price for NEW home is", Price_NEW))

[1] "Predicted Selling Price for NEW home is 544699.283"

Price_OLD = 138567  + 57736.283*0 +116.132 *3000
print(paste("Predicted Selling Price for NOT NEW home is", Price_OLD))

[1] "Predicted Selling Price for NOT NEW home is 486963"

D. Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

Size_New <-lm(Price ~ Size*New, data = house.selling.price)
summary(Size_New)


Call:
lm(formula = Price ~ Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

This model is only slightly better fitting than previous model based on adjusted R-squared of 73.63% compared to 71.69%. The variables size and the new variable size:new are statistically significant with p-values < 0.05

E. Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

New_HSP <- subset(house.selling.price, New %in% c("1"))
Old_HSP <- subset(house.selling.price, New=="0")

ggplot(house.selling.price, aes(x = Size, y = Price, color=New)) +
     geom_point() + 
    geom_smooth(method = "lm",colour = 7) +
    labs(x="Size", y="Price", title = "Predicted selling price vs size for NEW and NOT NEW homes")

ggplot(New_HSP, aes(x = Size, y = Price, color=New)) +
     geom_point() + 
    geom_smooth(method = "lm",colour = 7)  +
    labs(x="Size", y="Price", title = "Predicted selling price vs size for NEW homes")

ggplot(Old_HSP, aes(x = Size, y = Price)) +
     geom_point() + 
    geom_smooth(method = "lm",colour = 7) +
    labs(x="Size", y="Price", title = "Predicted selling price vs size for NOT NEW homes")

F. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Dummy variable: 1=NEW, 0=NOT NEW

EQUATION: Price =138567 + 57736.283New +116.132 size

Price_NEW = 138567  + 57736.283*1 +116.132 *3000
print(paste("Predicted Selling Price for NEW home is", Price_NEW))

[1] "Predicted Selling Price for NEW home is 544699.283"

Price_OLD = 138567  + 57736.283*0 +116.132 *3000
print(paste("Predicted Selling Price for NOT NEW home is", Price_OLD))

[1] "Predicted Selling Price for NOT NEW home is 486963"

G. Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

Dummy variable: 1=NEW, 0=NOT NEW

EQUATION: Price =138567 + 57736.283New +116.132 size

Price_NEW_1500 = 138567  + 57736.283*1 +116.132 *1500
print(paste("Predicted Selling Price for 1500 sqft NEW home is", Price_NEW_1500))

[1] "Predicted Selling Price for 1500 sqft NEW home is 370501.283"

Price_OLD_1500 = 138567  + 57736.283*0 +116.132 *1500
print(paste("Predicted Selling Price for 1500 sqft NOT NEW home is", Price_OLD_1500))

[1] "Predicted Selling Price for 1500 sqft NOT NEW home is 312765"

Price_diff_3000 = Price_NEW - Price_OLD
Price_diff_1500 = Price_NEW_1500 - Price_OLD_1500
print(paste("There is no price difference in prices NEW vs NOT NEW 3000 sqft", Price_diff_3000," vs NEW vs NOT NEW 1500 sqft house",Price_diff_1500 ))

[1] "There is no price difference in prices NEW vs NOT NEW 3000 sqft 57736.2830000001  vs NEW vs NOT NEW 1500 sqft house 57736.283"

H. Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

See Figure 1 and 2 below. The model with the interaction Size:New is a better fitting than model without interaction based on adjusted R-squared of 73.63% compared to 71.69%. The variables are statistically significant with p-value: < 2.2e-16

ggplot(house.selling.price, aes(x = Size*New, y = Price)) +
   geom_point() +
 geom_smooth(method = "lm",colour = 7) +
     labs(x="Size", y="Price", title = "Figure 1:Predicted selling price With Interaction")

ggplot(house.selling.price, aes(x = Size, y = Price)) +
   geom_point() +
 geom_smooth(method = "lm",colour = 7) +
     labs(x="Size", y="Price", title = "Figure 2:Predicted selling price WITHOUT INTERACTION")

Multiple Regression

Question 1

Question 2

Question 3