DACSS 603 HW#3

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

## x1 = size of home in sq ft
## x2 = lot size in sq ft

options(scipen = 100)

hw1_int <- -10536
hw1_x1 <- 53.8
hw1_x2 <- 2.84

hw1a <- hw1_int + (hw1_x1 * 1240) + (hw1_x2 * 18000)
hw1a_resid <- 145000 - hw1a

t1 <- c(1240, 1241, 1242)
t2 <- c(hw1_int + (hw1_x1 * t1) + (hw1_x2 * 18000))
t12 <- data.frame(t1, t2)
table_t12 <- knitr::kable(t12)

A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

The predicted selling price for this home is $107296 and the residual is $37704. In this case, the house was able to be sold for over $37,000 more than its predicted price

For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

t1	t2
1240	107296.0
1241	107349.8
1242	107403.6

Using some the same parameters from the question above, where lot size is set at 18000 sq ft, and increasing the square footage by 1 from 1240, the increase in selling price is $53.80 (which is the value for x1)

According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

53.8 / 2.84 ≈ 19.84 sq/ft

Question 2

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

library(alr4)
library(dplyr)

hw2 <- salary

#Gender Salary Tests
hw2_male_salary <- subset(hw2, hw2$sex == "Male", select = (c("salary")))
hw2_female_salary <- subset(hw2, hw2$sex == "Female", select = (c("salary")))

hw2_saltest <- t.test(hw2_male_salary , hw2_female_salary)

#Multiple LM
hw2_lm <- lm(salary ~ degree + rank + sex + year + ysdeg, data = hw2)

#Gender Confidence Interval
hw2_sex_ci <- confint(hw2_lm, 'sexFemale', level = 0.95)
c11 <- round(hw2_sex_ci[1], 4)
c12 <- round(hw2_sex_ci[2], 4)

#Coefficients Summary
hw2_lm_summary <- summary(hw2_lm)

#Level Change for Rank
hw2$rank <- relevel(hw2$rank, ref = "Assoc")
hw2_lm <- lm(salary ~ degree + rank + sex + year + ysdeg, data = hw2)
hw2_lm_summary <- summary(hw2_lm)

# Regression Without Rank
hw2_lm2 <- lm(salary ~ degree + + sex + year + ysdeg, data = hw2)
hw2_lm2_summary <- summary(hw2_lm2)

# Variable for employees tenured 15 years or less
hw2$rank <- relevel(hw2$rank, ref = "Asst")
hw2$ysdeg15 <- ifelse(hw2$ysdeg <=  15, 1, 0)

hw2_lm3 <- lm(salary ~ degree + sex + rank + ysdeg15, data = hw2)
hw2_lm3_summary <- summary(hw2_lm3)

Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

Ho : The mean salaries between men and women equal each other

Ha : The mean salaries between men and women are not the same

Going only on the basis of gender, I conducted a T-Test using the salaries of the male and female employees. With a T statistic of 1.774438 on 21.591032 degrees of freedom, the p-value is 0.0900941, which does not allow me to reject the null hypothesis that the mean salaries between men and women are equal.

Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

The confidence Interval for the difference in salary between males and females is as follows (-697.8183 , 3030.5645 )

Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	21038.4085	1109.11624	18.968623	0.0000000
degreePhD	1388.6133	1018.74688	1.363060	0.1796454
rankAsst	-5292.3608	1145.39802	-4.620543	0.0000322
rankProf	5826.4032	1012.93301	5.752012	0.0000007
sexFemale	1166.3731	925.56888	1.260169	0.2141043
year	476.3090	94.91357	5.018345	0.0000087
ysdeg	-124.5743	77.48628	-1.607695	0.1148967

For degree - Whether or not an employee had a Masters or a PhD was not considered to be statistically significant. If a employee were to get a PhD, their salary would increase by $1388.61.

For rank - The employees rank within the college was considered to be statistically significant. Using the assistant rank as a baseline, an employee with an associate rank would contribute towards $5269.36 towards their salary, and $11,118.76 if the employee was a professor.

For sex - The employee sex was not considered to be statistically significant.

For year - The number of years an employee is at their current rank was deemed to be considered statistically significant. An employee’s rank would count towards $476.31 of their salary, multiplied by the number of years spent at the rank.

For ysdeg - The number of years since highest degree earned was not deemed to be statistically significant towards predicting the employees salary. As it is modeled, for every year that an employee spends working, that employee would lose out on $124.57 for each passing year from the highest degree obtained.

Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

Using the Associate rank as a baseline, an employee with an Assistant rank would lose out on $5292.36 in salary, while a Professor would earn $5826.40 in salary.

Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts. Exclude the variable rank, refit, and summarize how your findings changed, if they did.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	17183.5717	1147.94172	14.9690280	0.0000000
degreePhD	-3299.3488	1302.51952	-2.5330514	0.0147040
sexFemale	-1286.5443	1313.08854	-0.9797849	0.3322090
year	351.9686	142.48087	2.4702865	0.0171854
ysdeg	339.3990	80.62097	4.2098109	0.0001144

By not fitting rank, the variables that are are deemed statistically significant in predicting salary are degree, years in current rank, and years since the highest degree earned. The most notable change is that the years since highest degree and the employee’s degree is now deemed to be statistically signifcant in this model, as opposed to the former model in which these two variables weren’t.

Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary. Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

library(alr4)
library(dplyr)

hw2 <- salary
plot(hw2$year, hw2$ysdeg)

After plotting year and ysdeg against each other, these two variables seem to be very correlated with each other. Thus, in fitting our model, I opted to remove year from the model, given the year variable might mask the true effect the ysdeg variable, since that is our variable of interest.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	17585.646	1621.1719	10.8474898	0.0000000
degreePhD	1126.187	1018.3701	1.1058716	0.2745324
sexFemale	-829.240	997.5536	-0.8312737	0.4101134
rankAssoc	4825.252	1276.0371	3.7814356	0.0004482
rankProf	11925.702	1512.4142	7.8852089	0.0000000
ysdeg15	319.032	1303.7673	0.2447001	0.8077769

As it turns out, there does not seem to be enough evidence that those hired by the new Dean had a significant effect in salary predictions.

Question 3

(SMSS 13.7 & 13.8 combined, modified)

(Data file: house.selling.price in smss R package)

library(smss)

data(house.selling.price)
hw3_lm <- lm(Price ~ Size + New, house.selling.price)
hw3_lm_summary <- summary(hw3_lm)

hw3_lm1 <- lm(Price ~ Size, house.selling.price)
hw3_lm1_summary <- summary(hw3_lm1)
hw3_lm2 <- lm(Price ~ New, house.selling.price)
hw3_lm2_summary <- summary(hw3_lm2)

new3000 <- round(hw3_lm$coefficients[1] + (hw3_lm$coefficients[2] * 3000) + hw3_lm$coefficients[3], 2)
old3000 <- round(hw3_lm$coefficients[1] + (hw3_lm$coefficients[2] * 3000), 2)

# Interaction Model
hw3_lm_int <- lm(Price ~ Size + New + Size*New, house.selling.price)
hw3_lm_int_summary <- summary(hw3_lm_int)

new3000_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 3000) + hw3_lm_int$coefficients[3] + (hw3_lm_int$coefficients[4] * 3000), 2)
old3000_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 3000) + hw3_lm_int$coefficients[4], 2)

new1500_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 1500) + hw3_lm_int$coefficients[3] + (hw3_lm_int$coefficients[4] * 1500), 2)
old1500_int <- round(hw3_lm_int$coefficients[1] + (hw3_lm_int$coefficients[2] * 1500) + hw3_lm_int$coefficients[4], 2)

Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-40230.8668	14696.139626	-2.737513	0.0073651
Size	116.1316	8.794993	13.204284	0.0000000
New	57736.2828	18653.040780	3.095275	0.0025701

Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-50926.2547	14896.372755	-3.418702	0.0009181
Size	126.5941	8.467517	14.950559	0.0000000

For predicting the price of a home, the price of the home will go up by almost $127 per square foot. In this model, Size is a deemed to be statistically significant in predicting the price of ahouse.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	138567.4	9503.741	14.580302	0.0000000
New	152396.2	28654.858	5.318338	0.0000007

For predicting the price of a home, the price of the home will go up by $152,396. In this model, whether or not the house is new is a deemed to be statistically significant in predicting the price of a house.

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

For a new, 3000 sqaure foot house, the predicted price will be $365900.18

For an older, 3000 sqaure foot house, the predicted price will be $308163.9

Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-22227.80793	15521.109973	-1.432102	0.1553627
Size	104.43839	9.424079	11.082080	0.0000000
New	-78527.50235	51007.641896	-1.539524	0.1269661
Size:New	61.91588	21.685692	2.855149	0.0052716

For this model, the size of the house and the interaction between “newness” and size were deemed to be significant in predicting the price of a house. Unlike the previous models, “New” in itself was not considered to be significant.

Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

Selling Price = -22227.81 + 104.44(Size) + (-78527.5 for a new house) + (61.92 * size * new house)

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

For a new, 3000 square foot house, the predicted price will be $398307.51

For an older, 3000 square foot house, the predicted price will be $291149.28

Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

For a new, 1500 square foot house, the predicted price will be $148776.1

For an older, 1500 square foot house, the predicted price will be $134491.69

The size in home will raise the predicted selling price of the home, which is a common reality. - Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

Using adjust R Squared as a way to determine which model does better, the interaction model wins out as its Adjusted RSquared of .736 is slightly higher than the regular model, which has an Adjusted RSquared of .717. I think both models do a sufficient job of explaining the reality that the size of a home does increase the cost. I would prefer the interaction model only because it does help validate my thinking that the size of a house combined with its “newness” makes a difference, which happens to be reflected by the fact that the interaction term was deemed to be statistically significant, at least using these two parameters as predictors.