DACSS 603: Homework 3

Homework # 3 questions and answers for DACSS 603: Introduction to Quantitative Analysis

Megan Georges
3/23/2022

Question 1

(SMSS 11.2, except part (d))

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

a. A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

# Calculating predicted selling price
PSPfun <- function(a, b)
{-10536 + 53.8*a + 2.84*b}

PSPfun(1240, 18000)
[1] 107296
# Calculating residual
ResidualFUN <- function(real, prd){real - prd}

ResidualFUN(145000, 107296)
[1] 37704

The predicted selling price is 107,296 dollars and the actual selling price is 145,000 dollars. Therefore, the residual is 37,704 dollars, meaning that the house was sold for 37,704 dollars greater than predicted. The homeowners made out well on this sale!

b. For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

Using the prediction equation ŷ = −10,536 + 53.8x1 + 2.84x2, where x2 equals lot size, the house selling price is expected to increase by 53.8 dollars per each square-foot increase in home size given the lot sized is fixed. This is because a fixed lot size would make 2.84x2 a set number in the prediction equation. Therefore, we would not need to factor in a change in the output based on any input. Then, we are left with the coefficient for the home size variable, which is 53.8. For x1=1, representing one square-foot of home size, the output would increase by 53.8*1 = 53.8.

c. According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

# Calculating lot size needed for equal impact of 1 unit increase in home size

# 53.8(1) = 2.84x2
x2 <- 53.8/2.84
x2
[1] 18.94366

An increase in lot size of about 18.94 square-feet would have the same impact as an increase of 1 square-foot in home size on the predicted selling price.


Question 2

(ALR, 5.17, slightly modified)

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

a. Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

# Load data and preview
data(salary)
head(salary)
   degree rank    sex year ysdeg salary
1 Masters Prof   Male   25    35  36350
2 Masters Prof   Male   13    22  35350
3 Masters Prof   Male   10    23  28200
4 Masters Prof Female    7    27  26775
5     PhD Prof   Male   19    30  33696
6 Masters Prof   Male   16    21  28516
# Testing hypothesis that mean salary for men and women is the same
lmSalSex <- lm(salary ~ sex, salary)
summary(lmSalSex)

Call:
lm(formula = salary ~ sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8602.8 -4296.6  -100.8  3513.1 16687.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24697        938  26.330   <2e-16 ***
sexFemale      -3340       1808  -1.847   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared:  0.0639,    Adjusted R-squared:  0.04518 
F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

To start, the null hypothesis would be that mean salary for men and mean salary for women are equal, and the alternative hypothesis would be that the salaries are not equal. I ran a regression with sex as the explanatory variable and salary as the outcome variable. The female coefficient is -3340, which means that women do make less than men (not considering any other variables). However, there is a significance level of .07, so we fail to reject the null hypothesis and therefore cannot conclude that there is a difference between mean salaries for men and women.

b. Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

lmSalAll <- lm(salary ~ degree + rank + sex + year + ysdeg, salary)
confint(lmSalAll)
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

Assuming there is no interaction between sex and other predictors, we can be 95% confident that the difference in mean salary for women compared to men falls between -697.8183 dollars and 3030.5645 dollars.

c. Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

summary(lmSalAll)

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

When running a regression with salary as the outcome variable and all other variables as predictors, we have the following findings:

d. Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

lmSalRank <- lm(salary ~ rank, salary)
summary(lmSalRank)

Call:
lm(formula = salary ~ rank, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-5209.0 -1819.2  -417.8  1586.6  8386.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  17768.7      705.5   25.19  < 2e-16 ***
rankAssoc     5407.3     1066.6    5.07 6.09e-06 ***
rankProf     11890.3      972.4   12.23  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2993 on 49 degrees of freedom
Multiple R-squared:  0.7542,    Adjusted R-squared:  0.7442 
F-statistic: 75.17 on 2 and 49 DF,  p-value: 1.174e-15
salary$rank <- relevel(salary$rank, ref = "Prof")
lmSalRankNew <- lm(salary ~ rank, salary)
summary(lmSalRankNew)

Call:
lm(formula = salary ~ rank, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-5209.0 -1819.2  -417.8  1586.6  8386.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29659.0      669.3  44.316  < 2e-16 ***
rankAsst    -11890.3      972.4 -12.228  < 2e-16 ***
rankAssoc    -6483.0     1043.0  -6.216 1.09e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2993 on 49 degrees of freedom
Multiple R-squared:  0.7542,    Adjusted R-squared:  0.7442 
F-statistic: 75.17 on 2 and 49 DF,  p-value: 1.174e-15

When using Assistant as the baseline for the rank variable, which is the lowest ranking between Assistant, Associate, and Professor, the coefficients in the regression show an expected increase in salary as rank increases. A ranking of Associate would predict an increase in salary of 5407.3 dollars, while increasing rank to Professor (highest achievement) would predict an increase in salary of 11890.3 dollars in comparison to the salary of an Assistant. Each relationship has significance levels well below the threshold of .05, thus the results are statistically significant.

When changing the baseline for regression on the rank variable with salary as the output, the coefficients look much different. With Professor as the baseline, the rankings of Associate and Assistant are predicted to have salaries that fall below that of the Professor salary. The Associate rank is predicted to have a salary decrease of 6483 dollars compared to the Professor salary, and the Assistant salary is predicted to have a salary 11890.3 dollars below that of a Professor. The significance levels remain well below the threshold of .05, thus this does not change the above conclusion that rank and salary have a statistically significant relationship.

e. Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts. Exclude the variable rank, refit, and summarize how your findings changed, if they did.

lmSalnoRank <- lm(salary ~ degree + sex + year + ysdeg, salary)
summary(lmSalnoRank)

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

When removing the variable “rank”, the coefficient for sex is -1286.54 compared to the above regression that included rank with a coefficient for sex at 1166.37. The new coefficient predicts that a female salary would be 1286.54 less than a male salary, when excluding the variable of rank. However, the significance level is 0.332, which is very high and therefore the results cannot be found to be statistically significant. While the change of the coefficient to negative upon removal of rank is interesting, the significance level would likely prevent these results from holding up in court as an indication of discrimination on the basis of sex.

f. Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary. Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

# Create dummy variable for faculty the Dean hired
# Faculty with ysdeg equal to or less than 15 = 1
# Faculty with ysdeg greater than 15 = 0
salary <- salary %>%
  add_column(hired = ifelse(salary$ysdeg <=15, 1, 0)) 
lmHired <- lm(salary ~ hired, salary)
summary(lmHired)

Call:
lm(formula = salary ~ hired, data = salary)

Residuals:
   Min     1Q Median     3Q    Max 
 -8294  -3486  -1772   3829  10576 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27469.4      913.4  30.073  < 2e-16 ***
hired        -7343.5     1291.8  -5.685 6.73e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4658 on 50 degrees of freedom
Multiple R-squared:  0.3926,    Adjusted R-squared:  0.3804 
F-statistic: 32.32 on 1 and 50 DF,  p-value: 6.734e-07
lmDean <- lm(salary ~ sex + rank + degree + hired, salary)
summary(lmDean)

Call:
lm(formula = salary ~ sex + rank + degree + hired, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-6187.5 -1750.9  -438.9  1719.5  9362.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29511.3      784.0  37.640  < 2e-16 ***
sexFemale     -829.2      997.6  -0.831    0.410    
rankAsst    -11925.7     1512.4  -7.885 4.37e-10 ***
rankAssoc    -7100.4     1297.0  -5.474 1.76e-06 ***
degreePhD     1126.2     1018.4   1.106    0.275    
hired          319.0     1303.8   0.245    0.808    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3023 on 46 degrees of freedom
Multiple R-squared:  0.7645,    Adjusted R-squared:  0.7389 
F-statistic: 29.87 on 5 and 46 DF,  p-value: 2.192e-13

First, I created a dummy variable called “hired” which coded those employed for 15 years or less (thus hired by the new Dean) as 1 and those who have been employed for over 15 years as 0. Then, I fit a new regression model and decided to include the variables of sex, rank, degree, and hired. I omitted the year and ysdeg variables to prevent overlapping or multicollinearity. Multicollinearity can be a concern when variables are highly correlated or related in some way. The idea of regression is to observe how each variable partially effects the output while holding the other variables fixed. We cannot reasonably change the year or ysdeg or hired variables individually while holding the other two fixed since they tend to “grow” in similar manners. Since the variable hired is a product of the ysdeg variable, we could not include both. Likewise, the year variable is highly correlated with ysdeg and hired because it looks at a similar predictor (number of years employed vs number of years since obtaining degree vs over/under 15 years at the university).

In the first of the two linear models displayed, I looked just at the hired variable as a predictor and salary as the output. The coefficient of -7343.5 predicts that those hired within the past 15 years make -7343.5 dollars less than those hired more than 15 years ago, with statistically significant level (less than .001). However, this fails to account for other key variables and the adjusted r-squared is just .38, which shows that the model is not strongly predictive of salary.

The second model, which uses the above described variables of sex, rank, degree, and hired as predictors and salary as the output, has an adjusted r-squared value of about .74, which suggests this model accounts for better prediction of the output for this data set.

Based on the regression model, those hired by the current Dean are predicted to make 319 dollars more than those not hired by the Dean (employed at the university since before the current Dean’s hire). When it comes to salary, this is a rather insignificant number. Furthermore, the level of significance for the hired variable is .81, which is astronomical and indicates that the relationship between hired and salary is not statistically significant. Based on these factors, I would state that findings do not indicate any favorable treatment (through better salary offers) by the Dean toward faculty that the Dean specifically hired.


Question 3

(SMSS 13.7 & 13.8 combined, modified)

(Data file: house.selling.price in smss R package)

a. Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)

# Load and preview data
data("house.selling.price")
head(house.selling.price)
  case Taxes Beds Baths New  Price Size
1    1  3104    4     2   0 279900 2048
2    2  1173    2     1   0 146500  912
3    3  3076    4     2   0 237700 1654
4    4  1608    3     2   0 200000 2068
5    5  1454    3     3   0 159900 1477
6    6  2997    3     2   1 499900 3153
lmPrice <- lm(Price ~ Size + New, house.selling.price)
summary(lmPrice)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

b. Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

When running regression on the house selling price dataset, with selling price as the output and Size and Newness as predictors, both the Size and the designation as New can be expected to increase the expected selling price of a house. An increase in square-footage of a house is expected to increase selling price by 116.13 dollars and if a house is new it is expected to increase the selling price by 57736.28 dollars compared to a not-new house. The significance level for size is less than .001, and the significance level for New is about 0.003. Both are below the significance level threshold of .05 and thus the relationships to selling price are significant based on this regression model.

The prediction equations for selling price using size and newness as predictors is:

\(\hat{y}\)New = 17505.42 + 116.13x, where x = size of home (square feet)

\(\hat{y}\)NotNew = -40230.87 + 116.13x

The equation for not new houses omits the value that a house being new adds (57736.28). They are essentially the same equation: \(\hat{y}\) = -40230.87 + 116.13x + 57736.28y, where x = house size and y = status of newness. Because the New variable is a dummy variable, it will be valued at 0 if the house is not new, so the coefficient/value will not be added to the predicted selling price. Therefore, for the \(\hat{y}\)New equation, I just added the value of a new house to the intercept value.

c. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

# New
17505.42 + (116.13*3000)
[1] 365895.4
# Not new
-40230.87 + (116.13*3000)
[1] 308159.1

The predicted selling price of a house that is 3000 square feet in size will be 365895.4 dollars for a new house and 308159.1 dollars for a not new house.

d. Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

lmSizeNew <- lm(Price ~ Size*New, house.selling.price)
summary(lmSizeNew)

Call:
lm(formula = Price ~ Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

e. Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

The predicted selling price, based on the new regression that includes interaction between Size and Newness, would look like:

\(\hat{y}\)New = -22227.81 + 104.44x - 78527.50y + 61.92(xy)

\(\hat{y}\)NotNew = -22227.81 + 104.44x

where x = size of house (square-feet) and y = 1 for New, 0 for not New

f. Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

# New house
-22227.81 + (104.44*3000) - 78527.50 + (61.92*3000)
[1] 398324.7
# Not new house
-22227.81 + (104.44*3000)
[1] 291092.2

The predicted selling price for a house of 3000 square-feet that is New is 398324.7 dollars and for a house that is not new is 291092.2 dollars.

g. Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

# New house
-22227.81 + (104.44*1500) - 78527.50 + (61.92*1500)
[1] 148784.7
# Not new house
-22227.81 + (104.44*1500)
[1] 134432.2

The predicted selling price for a house of 3000 square-feet that is New is 148784.7 dollars and for a house that is not new is 134432.2 dollars. Compared to the previous question, with a house double the size in square-feet, the difference in these predicted selling prices is much smaller and it seems that size has less of an impact on selling price once the size gets so small, and as the size increases it has a greater impact on the predicted selling price.

h. Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

It’s interesting that the prediction model with interaction has a significantly large negative coefficient for the New variable (but the interaction coefficient can make up for it once the house surpasses about 1268 square-feet). The adjusted r-squared for the model with interaction is 0.7363 and the adjusted r-squared for the first model without interaction is 0.7169. The increase in the adjusted r-squared with the interaction model could be due to an additional variable (interaction variable) or could indicate a slightly better fit for the prediction of the data. Since the models do have similar adjusted r-squared values, I would prefer the model with interaction because the regression indicates that the interaction term is statistically significant to selling price prediction, so I feel it is necessary to utilize an equation that factors for this.