DACSS 603 Homework 3

Homework 3 for DACSS 603

Molly Hackbarth
03-28-2022

Question 1

(SMSS 11.2, except part (d)) For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

Part A

A. A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

In order to find the predicted selling price of a home that is 1240 square feet (x1) on a lot of 18,000 square feet (x2) I will the equation given to me which is ŷ = −10,536 + 53.8(x1) + 2.84(x2). Thus the predicted price is $107,296.

To calculate the residual we take the predicted price ($107,296) and subtract it by the sold price ($145,000). Thus our residual value is -$37,704.

This tells us that the house sold for $37,704 more than it should have compared to the prediction model.

price <- -10536 + 53.8*1240 + 2.84*18000

residual <- price-145000

price
[1] 107296
residual
[1] -37704

Part B

B. For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

To figure out for fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size we look at the equation ŷ = −10,536 + 53.8(x1) + 2.84(x2). Since we know x1 is size of the home in square feet the answer is $53.8 per square foot. This is because we are multiplying the size of the home in square feet by $53.8 for each additional square foot.

Part C

C. According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

To calculate how much lot size would need to increase to have the same impact as a one square foot increase in home size we look at the equation ŷ = −10,536 + 53.8(x1) + 2.84(x2) and look at x1 and x2. Since we know there is a fixed home size (53.8) and we only need to find a one for square increase in home size we would need to divide x1(1)/x2 to find the increase. This would be 53.8/2.84. Thus the lot size would need to increase by about 18.94 square feet to have the same financial benefits as a 1 square foot increase in the home size.

53.8/2.84
[1] 18.94366

Question 2

(ALR, 5.17, slightly modified) (Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

Part A

A. Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

The first thing I did was pull in the data. to test if the mean salary for men and women is the same I used the group by command and then summarise to find the mean salary for both males and females. Here we can see that the mean salary for males was $24,696.79 and for females it was $21,357.14. Although the means are not equal ($3,339.65 difference), the p value (.09) is higher than a significance level of 5%. As a result, we cannot reject the null hypothesis that the mean salary for men and women are the same.

salary <- alr4::salary

mean <- salary %>%
  group_by(sex) %>%
  summarise(average_salary = mean(salary))
  

salary2 <- salary %>% 
  filter(sex == "Female")

salary3 <- salary %>% 
  filter(sex =="Male")

mean2 <- t.test(salary2$salary, salary3$salary)

mean2

    Welch Two Sample t-test

data:  salary2$salary and salary3$salary
t = -1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7247.1471   567.8539
sample estimates:
mean of x mean of y 
 21357.14  24696.79 

Part B

B. Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

In order to run a linear regression I used the linear model function lm(). I then use the confint() to find the 95% confidence interval of the means for all variables. Using confit() to find the 95% confidence interval of the mean for the difference in salary between males and females, the lower bound -$697.82 and the upperbound is $3,030.56.

all <-  lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)

summary(all)

Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
sexFemale    1166.37     925.57   1.260    0.214    
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
confint(all)
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
sexFemale    -697.8183  3030.56452
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

Part C

C. Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

For the relationship between salary and sexFemale, from the regression output we can see that if you’re female, you will see an increase of $1,166.37. Thus we would see a positive slope. However, the p value for if your sex is female is .21 and statistically insignificant.

For the relationship between salary and degreePHD, from the regression output we can see that depending on if you received a PHD, you will see an increase of $1,388.61. Thus we would see a positive slope. However, the p value for degreePHD is .18 and statistically insignificant.

For the relationship between salary and rankAssoc, from the regression output we can see that if you are a ranking associate, you will see an increase of $5,292.36. Thus we would see a positive slope. The p value for rankAssoc is 3.22e-05 and statistically significant as the p value falls between 0 and .001.

For the relationship between salary and rankProf, from the regression output we can see that if you are a ranking professor, you will see an increase of $11,118.76. Thus we would see a positive slope. the p value for rankProf is 1.62e-10 and statistically significant as the p value falls between 0 and .001.

For the relationship between salary and year, from the regression output we can see that the longer you spend at your current rank in years, you will see an increase of $476.31. Thus we would see a positive slope. The p value for year is 8.65e-06 and statistically significant as the p value falls between 0 and .001.

For the relationship between salary and ysdeg, from the regression output we can see that the further away in years you are since highest degree, you will see a decrease of $124.57. Thus we would see a negative slope. However, the p value for ysdeg is .12 and statistically insignificant.

Overall we can see that if you have a PHD, are a ranked Professor, spend more time at your current rank and are not far away from when you received your highest degree you would have make the most money. The adjusted R-squared value suggests that the coefficients explains 83.57% of the salary differences.

Part D

D. Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

To change the baseline category for the rank variable I will use the relevel() function. Relevel() re-orders factors in a specifed way. For this example I would relevel() “Female” so that “Female” is the first level. This allows them lm() to use “Male” as the first value (1) and “Female” as (0). Note in the output that the slope associated with sex2 is -$1,166.37 instead of positive, with a larger intercept.

sex2 <- relevel(salary$sex,"Female")

all2 <-  lm(salary ~ sex2 + degree + rank + year + ysdeg, data = salary)

summary(all2)

Call:
lm(formula = salary ~ sex2 + degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 16912.42     816.44  20.715  < 2e-16 ***
sex2Male    -1166.37     925.57  -1.260    0.214    
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

For the relationship between salary and sexMale,from the regression output we can see that if you’re male, you will see a decrease of $1,166.37. Thus we would see a negative slope. However, the p value for if your sex is male is .21 and statistically insignificant.

Here are a few observations about the results of releveling. The estimate numbers are similar with sexFemale, however the t-value and estimate are the negative versions of the results before releveling. The intercepts are different between the releveling. For males it is $16,912.42 compared to females $15,746.05. The reason for this is for males you are subtracting $1,166.37 from the intercept and for females you are adding $1,166.37 to the intercept. The standard error and the p-values are the same for female and male.

Part E

E. Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts. Exclude the variable rank, refit, and summarize how your findings changed, if they did.

To answer this question I removed the rank variable and refit using lm().

nonrank <-  lm(salary ~ sex + degree + year + ysdeg, data = salary)

summary(nonrank)

Call:
lm(formula = salary ~ sex + degree + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
sexFemale   -1286.54    1313.09  -0.980 0.332209    
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

Removing rank has a huge impact on the analysis results.

For the relationship between salary and sexFemale, from the regression output we can see that if you’re female, you will see a decrease of $1,286.54. Thus we would see a negative slope. However, the p value for if your sex is female is .33 and statistically insignificant.

For the relationship between salary and degreePHD, from the regression output we can see that depending on if you received a PHD, you will see a decrease of $3,299.35. Thus we would see a negative slope. The p value for degreePHD is .01 and statistically significant as the p value falls between .01 and .05.

For the relationship between salary and year, from the regression output we can see that the longer you spend at your current rank in years, you will see an increase of $351.97. Thus we would see a positive slope. The p value for year is .02 and statistically significant as the p value falls between .01 and .05.

For the relationship between salary and ysdeg, from the regression output we can see that the further away in years you are since highest degree, you will see an increase of $339.40. Thus we would see a positive slope. The p value for ysdeg is .0001 and statistically significant as the p value falls between 0 and .01.

The coefficient estimates for sexFemale, receiving a PHD, and years in current rank, have all decreased, with sex and receiving a PHD turning negative. Only ysdegree had a higher estimate than before removing the rank variable. Additionally ysdegree went from statistically insignificant (.12) with a rank variable included, to statistically significant (.0001) when rank was removed.

Part F

F. Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

In order to test this hypothesis I created a dummy variable (edu) that will use the idea if the year their earned their highest degree is 15 years or below it will code as 1. Otherwise if it’s 16 or above it’ll be coded as 0.

I used the variable edu to avoid multicollinearity. Multicollinearity could be a concern if two variables are strongly correlated.

In order to avoid this I removed ysdegree, rank, and year. ysdegree was removed because of the overlap with the new dummy variable. Rank was also removed because of how each rank would require an increase in pay for each level regardless of the new Dean. Year, which is years in current rank, may have a strong correlation with the dummy variable edu. The year variable also may not account for if a employee has been hired before the new dean but was recently promoted to a new rank. If this happened the year variable could show a 0 despite the employee being hired more than 15 years ago.

salary$edu <- ifelse(salary$ysdeg <=15, 1, 0)

newdean <-  lm(salary ~ sex + degree + edu, data = salary)

summary(newdean)

Call:
lm(formula = salary ~ sex + degree + edu, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8260.4 -3557.7  -462.6  3563.2 12098.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    28663       1155  24.821  < 2e-16 ***
sexFemale      -2716       1433  -1.896    0.064 .  
degreePhD      -1227       1372  -0.895    0.375    
edu            -7418       1306  -5.679 7.74e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4558 on 48 degrees of freedom
Multiple R-squared:  0.4416,    Adjusted R-squared:  0.4067 
F-statistic: 12.65 on 3 and 48 DF,  p-value: 3.231e-06

For the relationship between salary and edu, from the regression output we can see that that if you have been hired by the new dean, you will see a decrease of $7,418. Thus we would see a negative slope. The p value for the dummy variable is 7.74e-07 and statistically significant as the p value falls between 0 and .01.

We fail to reject the null hypothesis that the new Dean and the previous Dean have given equal offers. We can reject the alternate hypothesis that the offers are not equal.

Question 3

(SMSS 13.7 & 13.8 combined, modified) (Data file: house.selling.price in smss R package)

Part A

A. Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)

First thing I did was pull in the data. Then I used lm() to find the price as the outcome variable and size and new are the explanatory variables.

data("house.selling.price")

ssn <- lm(Price ~ Size + New, data = house.selling.price)

summary(ssn)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Overall we can see that the adjusted R-squared suggest that the coefficient explains 71.7% of the housing prices.

We are also able to see that both “size” and “new” p-values are less than .05. Therefore we reject the null hypothesis that there is no correlation between “size” and selling price of the home. We also reject the null hypothesis that there is no correlation between “new” and selling price of the home. As a result both variables are statically significant to the selling price of the home.

Part B

B. Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

Based on the lm() function we can see the predicted equation for price by home would be y = -40230.867 + 116.13(x) 57736.18(x1), where x1 is equal to if a home is new (yes) it is equal to 1 and if a home is not new (no) it is equal to 0. The variable X represents the size of the home in square feet.

If a home was new the predicted equation would be y = -40230.87 + 116.13(x) + 57736.18.

If a home was not new the predicted equation would be y = -40230.87 + 116.13(x) + (57736.18*0), which can be simplified to y = -40230.87 + 116.13(x).

For the relationship between price and size, from the regression output you will see an increase of the home price by $116.13 per square foot increase in the size of the home. Thus we would see a positive slope. The p value for size is < 2e-16 and the p value falls between 0 and .001. Therefore we reject the null hypothesis that there is no correlation between “size” and selling price of the home. As a result the size variable is statically significant to the selling price of the home.

For the relationship between price and new, from the regression output we can see that if the home is new, you will see an increase of $57,736.28 in home price. Thus we would see a positive slope. The p value for new is .003 as the p value falls between .001 and .01. Therefore we reject the null hypothesis that there is no correlation between new and selling price of the home. As a result the “new” variable is statically significant to the selling price of the home.

Part C

C. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

The predicted selling price for a home of 3000 square feet that is new is $365,895.30.

The predicted selling price for a home of 3000 square feet that is not new is $308,159.10.

new3 <- -40230.87 + 116.13*3000 + 57736.18

notnew3 <- -40230.87 + 116.13*3000

new3
[1] 365895.3
notnew3
[1] 308159.1

Part D

D. Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

To look at the statistical significance we can use the lm(). We will be allowing interactions by using an interaction variable Size*New.

new <- lm(Price ~ Size + New + Size*New, house.selling.price)

summary(new)

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

Overall we can see that the adjusted R-squared suggest that the coefficients explain 73.6% of the pricing differences. This is increased from just using size and new as variables.

For the relationship between price and “Size:New”, from the regression output we see the coefficient is $61.92. Thus we would see a positive slope. The p value for “Size:New” is .005 as the p value falls between .001 and .01. Therefore we reject the null hypothesis that there is no correlation between “Size:New” and selling price of the home. As a result the created variable is statically significant to the selling price of the home.

However the variable “new” is now insignificant as it went from 0.003 when we didn’t include the interaction variable “Size:New”. to .127 when we allowed the interaction of “Size:New”.

Part E

E. Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

The updated formula would be y = -22227.808 + 104.44(x1) + 61.916(x2) + -78527.502(x3). Where x1 is the size of the home in square feet. X2 is the size of the home in square feet multiplied by if the home was new (1) or not (0). X3 represents 1 if the home is new otherwise 0 if the home is not new.

If a home was new the predicted equation would be y = -22227.808 + 104.44(x1) + -78527.502(x3 * 1) + 61.916(x2 * 1).

If a home was not new the predicted equation would be y = -22227.808 + 104.44(x1) + -78527.502(x3 * 0) + 61.916(x2 * 0), which can be simplified to y = -22227.808 + 104.44(x1).

For the relationship between price and size, from the regression output we can see that the larger the size of the home in square feet compared to if the house is new or not, you will see an increase of $104.44. Thus we would see a positive slope. The p value for size is < 2e-16 and statistically significant as the p value falls between .001 and .01. We can also conclude that as new houses increase in size the price increases. However we can see the coefficient for size has changed from $116.13 to $104.44. We still see a positive slope, although slightly smaller.

For the relationship between price and new, from the regression output we can see that if the home is new, you will see a decrease of $78,527.502. Thus we would see a negative slope. The p value for new is .13 and statistically insignificant. This is different from the previous model.

For the relationship between price and Size:New, from the regression output we see the coefficient is $61.92. Thus we would see a positive slope. The p value for Size:New is .005 and statistically significant as the p value falls between .001 and .01. We can also conclude that as new houses increase in size the price increases.

Part F

F. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

The predicted selling price for a home of 3000 square feet that is new is $398,312.70.

The predicted selling price for a home of 3000 square feet that is not new is $291,092.20.

The reason why the selling price of a home of 3000 square feet is so different from the previous model is because we are accounting for how “size” and “new” are interacting with the interacting variable “Size:New” with a positive coefficient of $61.92. This interaction between “Size:New” also has the effect of reducing prices of older homes that are not new when comparing models.

newi <- -22227.808 + 104.44*(3000) + -78527.502 + 61.916* (3000*1)

oldi <- -22227.808 + 104.44*(3000) + (-78527.502*0) + 61.916*(3000*0)

newi
[1] 398312.7
oldi
[1] 291092.2

Part G

G. Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

The predicted selling price for a home of 1500 square feet that is new is $148,778.70.

The predicted selling price for a home of 1500 square feet that is not new is $134,432.20.

newi15 <- -22227.808 + 104.44*(1500) + -78527.502 + 61.916* (1500*1)

oldi15 <- -22227.808 + 104.44*(1500) + (-78527.502*0) + 61.916*(1500*0)

newi15
[1] 148778.7
oldi15
[1] 134432.2
oldi15/newi15 * 100
[1] 90.35716
oldi/newi * 100
[1] 73.08133

For a 1500 square foot home the price of an old house is about 90.36% of a new house.

For a 3000 square foot home the price of an old house is only about 73.08% of a new house.

So we see as the size increases the ratio of the selling prices of old homes versus new homes decreases.

Part H

H. Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

I believe the model with interactions, that includes the interaction variable, better represents the relationship of size and new to the outcome price. The p-value for Size:New was .005 and statistically significant. We reject the null hypothesis that variable Size:New is not correlated to selling price.