# Loading packages
library(alr4)
## Loading required package: car
## Loading required package: carData
## Loading required package: effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(smss)
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.
A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
# Writing function to solve for x1 and x2
y_func <- function(x1, x2) {-10536 + 53.8*x1 + 2.84*x2}
# Inputing x1 and x2 values to find predicted selling price
pred <- y_func(x1 = 1240, x2 = 18000)
pred
## [1] 107296
The predicted selling price is 107,296 while the actual price is 145,000
# Finding residual
residual <- 145000 - pred
residual
## [1] 37704
The residual is #37,704, meaning homes are being sold for far more than our model predicts.
For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
For a fixed lot size, house selling price is predicted to increase by $53.8
According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
# Calculating how much lot size would need to increse to equal a 1 sqft increase in home size
53.8/2.84
## [1] 18.94366
The lot size would need to increase by 18.94 square feet to habe the same impact as a one-square-foot increase in home size.
The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.
# Loading data
data("salary")
head(salary)
## degree rank sex year ysdeg salary
## 1 Masters Prof Male 25 35 36350
## 2 Masters Prof Male 13 22 35350
## 3 Masters Prof Male 10 23 28200
## 4 Masters Prof Female 7 27 26775
## 5 PhD Prof Male 19 30 33696
## 6 Masters Prof Male 16 21 28516
Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
# Creating linear regression model with sex as explanatory variable
summary(lm(salary ~ sex, data = salary))
##
## Call:
## lm(formula = salary ~ sex, data = salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8602.8 -4296.6 -100.8 3513.1 16687.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24697 938 26.330 <2e-16 ***
## sexFemale -3340 1808 -1.847 0.0706 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5782 on 50 degrees of freedom
## Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518
## F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706
The coefficient for womens salary is -3340, indicating women make on average $3340 less than their male counterparts.
Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
# Linear regression with salary as outcome varaible and everything else predictor
model <- lm(formula = salary ~ ., data = salary)
confint(model)
## 2.5 % 97.5 %
## (Intercept) 14134.4059 17357.68946
## degreePhD -663.2482 3440.47485
## rankAssoc 2985.4107 7599.31080
## rankProf 8396.1546 13841.37340
## sexFemale -697.8183 3030.56452
## year 285.1433 667.47476
## ysdeg -280.6397 31.49105
The 95% confidence interval for difference in salary between men and women is between -697.82 and 3030.56.
Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables
model <- lm(formula = salary ~ ., data = salary)
summary(model)
##
## Call:
## lm(formula = salary ~ ., data = salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4045.2 -1094.7 -361.5 813.2 9193.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15746.05 800.18 19.678 < 2e-16 ***
## degreePhD 1388.61 1018.75 1.363 0.180
## rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
## rankProf 11118.76 1351.77 8.225 1.62e-10 ***
## sexFemale 1166.37 925.57 1.260 0.214
## year 476.31 94.91 5.018 8.65e-06 ***
## ysdeg -124.57 77.49 -1.608 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2398 on 45 degrees of freedom
## Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
## F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
degreePHD - While not statistically significant (with a P value of 0.18), degreePDH leads to a salary increase of $1388.61.
rankAssoc - Is statistically significant (P value of 3.22e-05) and leads to a salary increase of $5292.36
rankProf - Is statistically significant (P value of 1.62e-10), in fact the most signficant explanitory variable, and it leads to a salary increase of $11,118.76
sexFemale - Is not statistically significant (P value of 0.214), leading to a salary increase of $1166.37
year - Is statistically significant (P value of 8.65e-06), leading to a salary increase of $476.31
ysdeg - Is not statistically significant (P value of 0.115), leading to decrease in salary by $124.57.
Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
salary$rank <- relevel(salary$rank, ref = 'Prof')
summary(lm(salary ~ ., data = salary))
##
## Call:
## lm(formula = salary ~ ., data = salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4045.2 -1094.7 -361.5 813.2 9193.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26864.81 1375.29 19.534 < 2e-16 ***
## degreePhD 1388.61 1018.75 1.363 0.180
## rankAsst -11118.76 1351.77 -8.225 1.62e-10 ***
## rankAssoc -5826.40 1012.93 -5.752 7.28e-07 ***
## sexFemale 1166.37 925.57 1.260 0.214
## year 476.31 94.91 5.018 8.65e-06 ***
## ysdeg -124.57 77.49 -1.608 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2398 on 45 degrees of freedom
## Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
## F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
With a baseline catefory of Prof, the coefficient of rankAsst is -11118.76, meaning assistant professors make $11,118.76 less than professors. The coefficient for rankAssoc is -5826.40, meaning associate professors make 5,826.40 less than professors.
Exclude the variable rank, refit, and summarize how your findings changed, if they did.
# Creating linear regression model that exludes rank
summary(lm(salary ~ degree + sex + year + ysdeg + salary, data = salary))
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on the
## right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 5 in
## model.matrix: no columns are assigned
##
## Call:
## lm(formula = salary ~ degree + sex + year + ysdeg + salary, data = salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8146.9 -2186.9 -491.5 2279.1 11186.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
## degreePhD -3299.35 1302.52 -2.533 0.014704 *
## sexFemale -1286.54 1313.09 -0.980 0.332209
## year 351.97 142.48 2.470 0.017185 *
## ysdeg 339.40 80.62 4.210 0.000114 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3744 on 47 degrees of freedom
## Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
## F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
Excluding rank, ysdeg becomes the most statistically significant variable (with a P score of 0.000114). The coefficient of sexFemale = -1286.54 without rank, though, including rank, the coefficient = 1166.37. This indicates women make $1,286.54 less than men regarldless of rank. However, the P score (0.332) is not statistically significant, therefore evidence may not hold up in court.
# Creating new column that determines old dean or new dean
salary$dean <- ifelse(salary$ysdeg >= '15', "Old Dean", "New Dean")
salary$dean
## [1] "Old Dean" "Old Dean" "Old Dean" "Old Dean" "Old Dean" "Old Dean"
## [7] "Old Dean" "Old Dean" "Old Dean" "Old Dean" "Old Dean" "Old Dean"
## [13] "Old Dean" "Old Dean" "Old Dean" "Old Dean" "Old Dean" "New Dean"
## [19] "Old Dean" "Old Dean" "Old Dean" "Old Dean" "New Dean" "Old Dean"
## [25] "New Dean" "Old Dean" "New Dean" "Old Dean" "Old Dean" "Old Dean"
## [31] "Old Dean" "Old Dean" "New Dean" "Old Dean" "Old Dean" "Old Dean"
## [37] "New Dean" "Old Dean" "Old Dean" "Old Dean" "Old Dean" "New Dean"
## [43] "Old Dean" "Old Dean" "Old Dean" "New Dean" "Old Dean" "Old Dean"
## [49] "New Dean" "New Dean" "New Dean" "Old Dean"
# Creating linear regression model testing significance of dean
summary(lm(salary ~ dean, data = salary))
##
## Call:
## lm(formula = salary ~ dean, data = salary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9311.1 -4185.9 -573.6 3931.8 13383.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20580 1727 11.913 3.23e-16 ***
## deanOld Dean 4082 1945 2.098 0.041 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5729 on 50 degrees of freedom
## Multiple R-squared: 0.08091, Adjusted R-squared: 0.06253
## F-statistic: 4.402 on 1 and 50 DF, p-value: 0.04097
The new dean variable is significant at the .05 level, and indicates a $4082 salary increase for those hired by the new dean.
# Loading data
data("house.selling.price")
head(house.selling.price)
## case Taxes Beds Baths New Price Size
## 1 1 3104 4 2 0 279900 2048
## 2 2 1173 2 1 0 146500 912
## 3 3 3076 4 2 0 237700 1654
## 4 4 1608 3 2 0 200000 2068
## 5 5 1454 3 3 0 159900 1477
## 6 6 2997 3 2 1 499900 3153
Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
summary(lm(Price ~ Size + New, data = house.selling.price))
##
## Call:
## lm(formula = Price ~ Size + New, data = house.selling.price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -205102 -34374 -5778 18929 163866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40230.867 14696.140 -2.738 0.00737 **
## Size 116.132 8.795 13.204 < 2e-16 ***
## New 57736.283 18653.041 3.095 0.00257 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53880 on 97 degrees of freedom
## Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
## F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Both house size and whether or not it is new are statistically significant explanatory variables for price. More specifically, Size is significant at the 0.0001 level, while New is significant at the 0.001 level. A one square foot increase in size would equate to a a $116.13 increase in price, and new houses cost on average 57736.28 more than old houses.
Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.
# Prediction equation
# y = -40230.867 + 116.132 * Size + 57736.283 * New
# Prediction equation for new homes
# y = -40230.867 + 116.132 * Size + 57736.283 * 1
# Prediction equation for old homes
# y = -40230.867 + 116.132 * Size + 57736.283 * 0
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
# Predicted price for new homes
-40230.867 + 116.132 * 3000 + 57736.283
## [1] 365901.4
The predicted selling price for new homes of 3000 sqft is $365,901
# Predicted price for old homes
-40230.867 + 116.132 * 3000 + 57736.283 * 0
## [1] 308165.1
The predicted selling price for old homes of 3000 sqft is $308,165
Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
summary(lm(Price ~ Size + New + Size * New, data = house.selling.price))
##
## Call:
## lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175748 -28979 -6260 14693 192519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22227.808 15521.110 -1.432 0.15536
## Size 104.438 9.424 11.082 < 2e-16 ***
## New -78527.502 51007.642 -1.540 0.12697
## Size:New 61.916 21.686 2.855 0.00527 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52000 on 96 degrees of freedom
## Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
## F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
# Prediction equation that includes interaction between size and new
# y = -2227.808 + 104.438 * Size - 78527.502 * New + 61.916 * Size * New
# Prediction equation for new homes
# y = -2227.808 + 104.438 * Size - 78527.502 * 1 + 61.916 * Size * 1
# Prediction equation for old homes
# y = -2227.808 + 104.438 * Size - 78527.502 * 0 + 61.916 * Size * 0
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
# Predicted selling price for new homes
-2227.808 + 104.438 * 3000 - 78527.502 * 1 + 61.916 * 3000 * 1
## [1] 418306.7
The predicted selling price for new homes of 3000 sqft is $418,306
# Predicted selling price for old homes
-2227.808 + 104.438 * 3000 - 78527.502 * 0 + 61.916 * 3000 * 0
## [1] 311086.2
The predicted selling price for old homes of 3000 sqft is $311,086
Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
# Predicted selling price for new homes
-2227.808 + 104.438 * 1500 - 78527.502 * 1 + 61.916 * 1500 * 1
## [1] 168775.7
The predicted selling price for new homes of 1500 sqft is $168,775
# Predicted selling price for old homes
-2227.808 + 104.438 * 1500 - 78527.502 * 0 + 61.916 * 1500 * 0
## [1] 154429.2
The predicted selling price for old homes of 1500 sqft is $154,429
As size of the home increases, the difference in prices between new and old homes also increases. For houses of 3000 sqft, the price difference between new and old houses is $107,220. For houses of 1500 sqft, the price difference is just 14,346.
Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?
Since Size effects the relationship between New and Price so greatly, I preffer the model that includes the interaction between Price, Size and New.