Third homework for DACSS 603.
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.
## x1 = size of home in sq ft
## x2 = lot size in sq ft
options(scipen = 100)
hw1_int <- -10536
hw1_x1 <- 53.8
hw1_x2 <- 2.84
hw1a <- hw1_int + (hw1_x1 * 1240) + (hw1_x2 * 18000)
hw1a_resid <- 145000 - hw1a
t1 <- c(1240, 1241, 1242)
t2 <- c(hw1_int + (hw1_x1 * t1) + (hw1_x2 * 18000))
t12 <- data.frame(t1, t2)
table_t12 <- knitr::kable(t12)
The predicted selling price for this home is $107296 and the residual is $37704. In this case, the house was able to be sold for over $37,000 more than its predicted price
| t1 | t2 |
|---|---|
| 1240 | 107296.0 |
| 1241 | 107349.8 |
| 1242 | 107403.6 |
Using some the same parameters from the question above, where lot size is set at 18000 sq ft, and increasing the square footage by 1 from 1240, the increase in selling price is $53.80 (which is the value for x1)
53.8 / 2.84 ≈ 19.84 sq/ft
(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.
library(alr4)
library(dplyr)
hw2 <- salary
#Gender Salary Tests
hw2_male_salary <- subset(hw2, hw2$sex == "Male", select = (c("salary")))
hw2_female_salary <- subset(hw2, hw2$sex == "Female", select = (c("salary")))
hw2_saltest <- t.test(hw2_male_salary , hw2_female_salary)
#Multiple LM
hw2_lm <- lm(salary ~ degree + rank + sex + year + ysdeg, data = hw2)
#Gender Confidence Interval
hw2_sex_ci <- confint(hw2_lm, 'sexFemale', level = 0.95)
c11 <- round(hw2_sex_ci[1], 4)
c12 <- round(hw2_sex_ci[2], 4)
#Coefficients Summary
hw2_lm_summary <- summary(hw2_lm)
#Level Change for Rank
hw2$rank <- relevel(hw2$rank, ref = "Assoc")
hw2_lm <- lm(salary ~ degree + rank + sex + year + ysdeg, data = hw2)
hw2_lm_summary <- summary(hw2_lm)
# Regression Without Rank
hw2_lm2 <- lm(salary ~ degree + + sex + year + ysdeg, data = hw2)
hw2_lm2_summary <- summary(hw2_lm2)
# Variable for employees tenured 15 years or less
hw2$rank <- relevel(hw2$rank, ref = "Asst")
hw2$ysdeg15 <- ifelse(hw2$ysdeg <= 15, 1, 0)
hw2_lm3 <- lm(salary ~ degree + sex + rank + ysdeg15, data = hw2)
hw2_lm3_summary <- summary(hw2_lm3)
Ho : The mean salaries between men and women equal each other
Ha : The mean salaries between men and women are not the same
Going only on the basis of gender, I conducted a T-Test using the salaries of the male and female employees. With a T statistic of 1.774438 on 21.591032 degrees of freedom, the p-value is 0.0900941, which does not allow me to reject the null hypothesis that the mean salaries between men and women are equal.
The confidence Interval for the difference in salary between males and females is as follows (-697.8183 , 3030.5645 )
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 21038.4085 | 1109.11624 | 18.968623 | 0.0000000 |
| degreePhD | 1388.6133 | 1018.74688 | 1.363060 | 0.1796454 |
| rankAsst | -5292.3608 | 1145.39802 | -4.620543 | 0.0000322 |
| rankProf | 5826.4032 | 1012.93301 | 5.752012 | 0.0000007 |
| sexFemale | 1166.3731 | 925.56888 | 1.260169 | 0.2141043 |
| year | 476.3090 | 94.91357 | 5.018345 | 0.0000087 |
| ysdeg | -124.5743 | 77.48628 | -1.607695 | 0.1148967 |
For degree - Whether or not an employee had a Masters or a PhD was not considered to be statistically significant. If a employee were to get a PhD, their salary would increase by $1388.61.
For rank - The employees rank within the college was considered to be statistically significant. Using the assistant rank as a baseline, an employee with an associate rank would contribute towards $5269.36 towards their salary, and $11,118.76 if the employee was a professor.
For sex - The employee sex was not considered to be statistically significant.
For year - The number of years an employee is at their current rank was deemed to be considered statistically significant. An employee’s rank would count towards $476.31 of their salary, multiplied by the number of years spent at the rank.
For ysdeg - The number of years since highest degree earned was not deemed to be statistically significant towards predicting the employees salary. As it is modeled, for every year that an employee spends working, that employee would lose out on $124.57 for each passing year from the highest degree obtained.
Using the Associate rank as a baseline, an employee with an Assistant rank would lose out on $5292.36 in salary, while a Professor would earn $5826.40 in salary.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 17183.5717 | 1147.94172 | 14.9690280 | 0.0000000 |
| degreePhD | -3299.3488 | 1302.51952 | -2.5330514 | 0.0147040 |
| sexFemale | -1286.5443 | 1313.08854 | -0.9797849 | 0.3322090 |
| year | 351.9686 | 142.48087 | 2.4702865 | 0.0171854 |
| ysdeg | 339.3990 | 80.62097 | 4.2098109 | 0.0001144 |
By not fitting rank, the variables that are are deemed statistically significant in predicting salary are degree, years in current rank, and years since the highest degree earned. The most notable change is that the years since highest degree and the employee’s degree is now deemed to be statistically signifcant in this model, as opposed to the former model in which these two variables weren’t.
After plotting year and ysdeg against each other, these two variables seem to be very correlated with each other. Thus, in fitting our model, I opted to remove year from the model, given the year variable might mask the true effect the ysdeg variable, since that is our variable of interest.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 17585.646 | 1621.1719 | 10.8474898 | 0.0000000 |
| degreePhD | 1126.187 | 1018.3701 | 1.1058716 | 0.2745324 |
| sexFemale | -829.240 | 997.5536 | -0.8312737 | 0.4101134 |
| rankAssoc | 4825.252 | 1276.0371 | 3.7814356 | 0.0004482 |
| rankProf | 11925.702 | 1512.4142 | 7.8852089 | 0.0000000 |
| ysdeg15 | 319.032 | 1303.7673 | 0.2447001 | 0.8077769 |
As it turns out, there does not seem to be enough evidence that those hired by the new Dean had a significant effect in salary predictions.
(SMSS 13.7 & 13.8 combined, modified)
(Data file: house.selling.price in smss R package)
library(smss)
data(house.selling.price)
hw3_lm <- lm(Price ~ Size + New, house.selling.price)
hw3_lm_summary <- summary(hw3_lm)
hw3_lm1 <- lm(Price ~ Size, house.selling.price)
hw3_lm1_summary <- summary(hw3_lm1)
hw3_lm2 <- lm(Price ~ New, house.selling.price)
hw3_lm2_summary <- summary(hw3_lm2)
new3000 <- round(hw3_lm$coefficients[1] + (hw3_lm$coefficients[2] * 3000) + hw3_lm$coefficients[3], 2)
old3000 <- round(hw3_lm$coefficients[1] + (hw3_lm$coefficients[2] * 3000), 2)
# Interaction Model
hw3_lm_int <- lm(Price ~ Size + New + Size*New, house.selling.price)
hw3_lm_int_summary <- summary(hw3_lm_int)
new3000_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 3000) + hw3_lm_int$coefficients[3] + (hw3_lm_int$coefficients[4] * 3000), 2)
old3000_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 3000) + hw3_lm_int$coefficients[4], 2)
new1500_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 1500) + hw3_lm_int$coefficients[3] + (hw3_lm_int$coefficients[4] * 1500), 2)
old1500_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 1500) + hw3_lm_int$coefficients[4], 2)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -40230.8668 | 14696.139626 | -2.737513 | 0.0073651 |
| Size | 116.1316 | 8.794993 | 13.204284 | 0.0000000 |
| New | 57736.2828 | 18653.040780 | 3.095275 | 0.0025701 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -50926.2547 | 14896.372755 | -3.418702 | 0.0009181 |
| Size | 126.5941 | 8.467517 | 14.950559 | 0.0000000 |
For predicting the price of a home, the price of the home will go up by almost $127 per square foot. In this model, Size is a deemed to be statistically significant in predicting the price of ahouse.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 138567.4 | 9503.741 | 14.580302 | 0.0000000 |
| New | 152396.2 | 28654.858 | 5.318338 | 0.0000007 |
For predicting the price of a home, the price of the home will go up by $152,396. In this model, whether or not the house is new is a deemed to be statistically significant in predicting the price of a house.
For a new, 3000 sqaure foot house, the predicted price will be $365900.18
For an older, 3000 sqaure foot house, the predicted price will be $308163.9
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -22227.80793 | 15521.109973 | -1.432102 | 0.1553627 |
| Size | 104.43839 | 9.424079 | 11.082080 | 0.0000000 |
| New | -78527.50235 | 51007.641896 | -1.539524 | 0.1269661 |
| Size:New | 61.91588 | 21.685692 | 2.855149 | 0.0052716 |
For this model, the size of the house and the interaction between “newness” and size were deemed to be significant in predicting the price of a house. Unlike the previous models, “New” in itself was not considered to be significant.
Selling Price = -22227.81 + 104.44(Size) + (-78527.5 for a new house) + (61.92 * size * new house)
For a new, 3000 square foot house, the predicted price will be $398307.51
For an older, 3000 square foot house, the predicted price will be $291149.28
For a new, 1500 square foot house, the predicted price will be $148776.1
For an older, 1500 square foot house, the predicted price will be $134491.69
The size in home will raise the predicted selling price of the home, which is a common reality. - Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?
Using adjust R Squared as a way to determine which model does better, the interaction model wins out as its Adjusted RSquared of .736 is slightly higher than the regular model, which has an Adjusted RSquared of .717. I think both models do a sufficient job of explaining the reality that the size of a home does increase the cost. I would prefer the interaction model only because it does help validate my thinking that the size of a house combined with its “newness” makes a difference, which happens to be reflected by the fact that the interaction term was deemed to be statistically significant, at least using these two parameters as predictors.