DACSS 603 HW#3

Third homework for DACSS 603.

Alexander Hong
2022-03-31

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

## x1 = size of home in sq ft
## x2 = lot size in sq ft

options(scipen = 100)

hw1_int <- -10536
hw1_x1 <- 53.8
hw1_x2 <- 2.84

hw1a <- hw1_int + (hw1_x1 * 1240) + (hw1_x2 * 18000)
hw1a_resid <- 145000 - hw1a

t1 <- c(1240, 1241, 1242)
t2 <- c(hw1_int + (hw1_x1 * t1) + (hw1_x2 * 18000))
t12 <- data.frame(t1, t2)
table_t12 <- knitr::kable(t12)

The predicted selling price for this home is $107296 and the residual is $37704. In this case, the house was able to be sold for over $37,000 more than its predicted price

t1 t2
1240 107296.0
1241 107349.8
1242 107403.6

Using some the same parameters from the question above, where lot size is set at 18000 sq ft, and increasing the square footage by 1 from 1240, the increase in selling price is $53.80 (which is the value for x1)

53.8 / 2.84 ≈ 19.84 sq/ft

Question 2

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

library(alr4)
library(dplyr)

hw2 <- salary

#Gender Salary Tests
hw2_male_salary <- subset(hw2, hw2$sex == "Male", select = (c("salary")))
hw2_female_salary <- subset(hw2, hw2$sex == "Female", select = (c("salary")))

hw2_saltest <- t.test(hw2_male_salary , hw2_female_salary)

#Multiple LM
hw2_lm <- lm(salary ~ degree + rank + sex + year + ysdeg, data = hw2)

#Gender Confidence Interval
hw2_sex_ci <- confint(hw2_lm, 'sexFemale', level = 0.95)
c11 <- round(hw2_sex_ci[1], 4)
c12 <- round(hw2_sex_ci[2], 4)

#Coefficients Summary
hw2_lm_summary <- summary(hw2_lm)

#Level Change for Rank
hw2$rank <- relevel(hw2$rank, ref = "Assoc")
hw2_lm <- lm(salary ~ degree + rank + sex + year + ysdeg, data = hw2)
hw2_lm_summary <- summary(hw2_lm)

# Regression Without Rank
hw2_lm2 <- lm(salary ~ degree + + sex + year + ysdeg, data = hw2)
hw2_lm2_summary <- summary(hw2_lm2)

# Variable for employees tenured 15 years or less
hw2$rank <- relevel(hw2$rank, ref = "Asst")
hw2$ysdeg15 <- ifelse(hw2$ysdeg <=  15, 1, 0)

hw2_lm3 <- lm(salary ~ degree + sex + rank + ysdeg15, data = hw2)
hw2_lm3_summary <- summary(hw2_lm3)

Ho : The mean salaries between men and women equal each other

Ha : The mean salaries between men and women are not the same

Going only on the basis of gender, I conducted a T-Test using the salaries of the male and female employees. With a T statistic of 1.774438 on 21.591032 degrees of freedom, the p-value is 0.0900941, which does not allow me to reject the null hypothesis that the mean salaries between men and women are equal.

The confidence Interval for the difference in salary between males and females is as follows (-697.8183 , 3030.5645 )

Estimate Std. Error t value Pr(>|t|)
(Intercept) 21038.4085 1109.11624 18.968623 0.0000000
degreePhD 1388.6133 1018.74688 1.363060 0.1796454
rankAsst -5292.3608 1145.39802 -4.620543 0.0000322
rankProf 5826.4032 1012.93301 5.752012 0.0000007
sexFemale 1166.3731 925.56888 1.260169 0.2141043
year 476.3090 94.91357 5.018345 0.0000087
ysdeg -124.5743 77.48628 -1.607695 0.1148967

For degree - Whether or not an employee had a Masters or a PhD was not considered to be statistically significant. If a employee were to get a PhD, their salary would increase by $1388.61.

For rank - The employees rank within the college was considered to be statistically significant. Using the assistant rank as a baseline, an employee with an associate rank would contribute towards $5269.36 towards their salary, and $11,118.76 if the employee was a professor.

For sex - The employee sex was not considered to be statistically significant.

For year - The number of years an employee is at their current rank was deemed to be considered statistically significant. An employee’s rank would count towards $476.31 of their salary, multiplied by the number of years spent at the rank.

For ysdeg - The number of years since highest degree earned was not deemed to be statistically significant towards predicting the employees salary. As it is modeled, for every year that an employee spends working, that employee would lose out on $124.57 for each passing year from the highest degree obtained.

Using the Associate rank as a baseline, an employee with an Assistant rank would lose out on $5292.36 in salary, while a Professor would earn $5826.40 in salary.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.5717 1147.94172 14.9690280 0.0000000
degreePhD -3299.3488 1302.51952 -2.5330514 0.0147040
sexFemale -1286.5443 1313.08854 -0.9797849 0.3322090
year 351.9686 142.48087 2.4702865 0.0171854
ysdeg 339.3990 80.62097 4.2098109 0.0001144

By not fitting rank, the variables that are are deemed statistically significant in predicting salary are degree, years in current rank, and years since the highest degree earned. The most notable change is that the years since highest degree and the employee’s degree is now deemed to be statistically signifcant in this model, as opposed to the former model in which these two variables weren’t.

library(alr4)
library(dplyr)

hw2 <- salary
plot(hw2$year, hw2$ysdeg)

After plotting year and ysdeg against each other, these two variables seem to be very correlated with each other. Thus, in fitting our model, I opted to remove year from the model, given the year variable might mask the true effect the ysdeg variable, since that is our variable of interest.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 17585.646 1621.1719 10.8474898 0.0000000
degreePhD 1126.187 1018.3701 1.1058716 0.2745324
sexFemale -829.240 997.5536 -0.8312737 0.4101134
rankAssoc 4825.252 1276.0371 3.7814356 0.0004482
rankProf 11925.702 1512.4142 7.8852089 0.0000000
ysdeg15 319.032 1303.7673 0.2447001 0.8077769

As it turns out, there does not seem to be enough evidence that those hired by the new Dean had a significant effect in salary predictions.

Question 3

(SMSS 13.7 & 13.8 combined, modified)

(Data file: house.selling.price in smss R package)

library(smss)

data(house.selling.price)
hw3_lm <- lm(Price ~ Size + New, house.selling.price)
hw3_lm_summary <- summary(hw3_lm)

hw3_lm1 <- lm(Price ~ Size, house.selling.price)
hw3_lm1_summary <- summary(hw3_lm1)
hw3_lm2 <- lm(Price ~ New, house.selling.price)
hw3_lm2_summary <- summary(hw3_lm2)

new3000 <- round(hw3_lm$coefficients[1] + (hw3_lm$coefficients[2] * 3000) + hw3_lm$coefficients[3], 2)
old3000 <- round(hw3_lm$coefficients[1] + (hw3_lm$coefficients[2] * 3000), 2)

# Interaction Model
hw3_lm_int <- lm(Price ~ Size + New + Size*New, house.selling.price)
hw3_lm_int_summary <- summary(hw3_lm_int)

new3000_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 3000) + hw3_lm_int$coefficients[3] + (hw3_lm_int$coefficients[4] * 3000), 2)
old3000_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 3000) + hw3_lm_int$coefficients[4], 2)

new1500_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 1500) + hw3_lm_int$coefficients[3] + (hw3_lm_int$coefficients[4] * 1500), 2)
old1500_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 1500) + hw3_lm_int$coefficients[4], 2)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.8668 14696.139626 -2.737513 0.0073651
Size 116.1316 8.794993 13.204284 0.0000000
New 57736.2828 18653.040780 3.095275 0.0025701
Estimate Std. Error t value Pr(>|t|)
(Intercept) -50926.2547 14896.372755 -3.418702 0.0009181
Size 126.5941 8.467517 14.950559 0.0000000

For predicting the price of a home, the price of the home will go up by almost $127 per square foot. In this model, Size is a deemed to be statistically significant in predicting the price of ahouse.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 138567.4 9503.741 14.580302 0.0000000
New 152396.2 28654.858 5.318338 0.0000007

For predicting the price of a home, the price of the home will go up by $152,396. In this model, whether or not the house is new is a deemed to be statistically significant in predicting the price of a house.

For a new, 3000 sqaure foot house, the predicted price will be $365900.18

For an older, 3000 sqaure foot house, the predicted price will be $308163.9

Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.80793 15521.109973 -1.432102 0.1553627
Size 104.43839 9.424079 11.082080 0.0000000
New -78527.50235 51007.641896 -1.539524 0.1269661
Size:New 61.91588 21.685692 2.855149 0.0052716

For this model, the size of the house and the interaction between “newness” and size were deemed to be significant in predicting the price of a house. Unlike the previous models, “New” in itself was not considered to be significant.

Selling Price = -22227.81 + 104.44(Size) + (-78527.5 for a new house) + (61.92 * size * new house)

For a new, 3000 square foot house, the predicted price will be $398307.51

For an older, 3000 square foot house, the predicted price will be $291149.28

For a new, 1500 square foot house, the predicted price will be $148776.1

For an older, 1500 square foot house, the predicted price will be $134491.69

The size in home will raise the predicted selling price of the home, which is a common reality. - Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

Using adjust R Squared as a way to determine which model does better, the interaction model wins out as its Adjusted RSquared of .736 is slightly higher than the regular model, which has an Adjusted RSquared of .717. I think both models do a sufficient job of explaining the reality that the size of a home does increase the cost. I would prefer the interaction model only because it does help validate my thinking that the size of a house combined with its “newness” makes a difference, which happens to be reflected by the fact that the interaction term was deemed to be statistically significant, at least using these two parameters as predictors.