HW3

#Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet),

• the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

##Part 1a.

A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

###Part 1a. ANSWER

The predicted selling price for a home of 1240 square feet on a lot of 18,000 square feet is $107,296. Utilizing the 1240 square footage on a 18,000 lot we got the answer with the following equation.

ŷ = −10,536 + 53.8x1 + 2.84x2 = −10,536 + 53.81240 + 2.8418,000 = $107,296

The Residual Y - ŷ = $145,000 – $107,296 = $37,704.

The residual would be $37,704 as the selling price for the home is more than I would have predicted based on the multiple regression model.

##Part 1b.

For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

###Part 1b. ANSWER

The coefficient for size of home in the prediction equation is 53.8, so for a fixed lot size, for each square foot increase in home size, because of this the predicted house selling price increases by $53.80.

##Part 1c.

According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

###Part 1c. ANSWER

To solve this, we run the equation of what the impact is for a one-square-foot increase in home size is as the main question is of how much would the lot size need to increase.

ŷ = −10,536 + 53.8x1 + 2.84x2 = −10,536 + 53.81241 + 2.8418,000 = $107,349.8 (the impact for a one-square-foot increase in home size)

ŷ = −10,536 + 53.8x1 + 2.84x2 = −10,536 + 53.81240 + 2.8418,000 $107,349.8 = −10,536 + 53.81241 + 2.84 Y

Y = 18,019

Lot size needs to increase 19 square feet to have the same impact as a one-square-foot increase in home size.

#Question 2

(ALR, 5.17, slightly modified)

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

##Part 2a.

Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

###Part 2a. ANSWER

To test this hypothesis I ran a plot to visualize the data which shows that the hypothesis that the mean salary for men and women is not the same. We see that as the box plots provide a visual to show women have a mean well below men in terms of their salary.

##Part 2b.

Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

###Part 2b. ANSWER


Call:
lm(formula = salary ~ degree + rank + year + ysdeg + sex, data = salary)

Coefficients:
(Intercept)    degreePhD    rankAssoc     rankProf         year  
    15746.0       1388.6       5292.4      11118.8        476.3  
      ysdeg    sexFemale  
     -124.6       1166.4


Call:
lm(formula = salary ~ degree + rank + year + ysdeg + sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
sexFemale    1166.37     925.57   1.260    0.214    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Based on our summary we find that the significance code on SexFemale held a value of 95% confidence interval for the difference in salary between males and females. Assuming no interaction between sex and the other predictors we see that there is no Signif. code next to SexFemale which indicates the p-value is in between (0.1, 1] or in terms of a confidence interval (.90, 1] indicating that we did obtain a 95% confidence interval for the difference in salary between males and females.

##Part 2c.

Interpret your finding for each predictor variable; discuss (a) statistical significance,

###Part 2c. ANSWER

The statistical significance for “ysdeg & degreePHD” has from the summary no stars signif. Code next to it which indicates that like question B, there’s a 95% confidence interval and a 5% significance. “rankAssoc & rankProf” having 3 stars *** indicates that the p-value is in between [0,0.001] or in other words it shows that there is almost no statistical significance for Rank as it relates to salary which also holds true for “Year”.

interpretation of the coefficient / slope in relation to the outcome variable and other variables

ANSWER Salary being the outcome variable we look to the Estimate of the Intercept between both salary (outcome variable) and other variables. From our summary we can see that on average the regression coefficient for the intercept is equal to 15,746.05. This means that for Men and Women the average expected minimum salary is 15,746.05.

##Part 2d.

Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

###Part 2d. ANSWER

First we needed to see the order of the Rank as it stands and I did this by previewing the variable. Changing the baseline category or also known as the Reference category to rank we get the following output to which we then modify with the following code.

Code: >salary$rank Output: Levels: Asst Assoc Prof

Now when we run the code for the coefficients again we get the following as it compares to a before and after.

BEFORE

 [1] Prof  Prof  Prof  Prof  Prof  Prof  Prof  Prof  Prof  Prof  Prof 
[12] Assoc Prof  Assoc Prof  Prof  Prof  Assoc Assoc Prof  Asst  Assoc
[23] Prof  Prof  Assoc Prof  Assoc Prof  Assoc Assoc Asst  Assoc Asst 
[34] Assoc Assoc Assoc Asst  Asst  Asst  Asst  Asst  Asst  Assoc Asst 
[45] Asst  Asst  Asst  Asst  Asst  Asst  Asst  Asst 
Levels: Asst Assoc Prof


Call:
lm(formula = salary ~ degree + rank + year + ysdeg + sex, data = salary)

Coefficients:
(Intercept)    degreePhD    rankAssoc     rankProf         year  
    15746.0       1388.6       5292.4      11118.8        476.3  
      ysdeg    sexFemale  
     -124.6       1166.4

AFTER

 [1] Prof  Prof  Prof  Prof  Prof  Prof  Prof  Prof  Prof  Prof  Prof 
[12] Assoc Prof  Assoc Prof  Prof  Prof  Assoc Assoc Prof  Asst  Assoc
[23] Prof  Prof  Assoc Prof  Assoc Prof  Assoc Assoc Asst  Assoc Asst 
[34] Assoc Assoc Assoc Asst  Asst  Asst  Asst  Asst  Asst  Assoc Asst 
[45] Asst  Asst  Asst  Asst  Asst  Asst  Asst  Asst 
Levels: Prof Asst Assoc


Call:
lm(formula = salary ~ degree + rank + year + ysdeg + sex, data = salary)

Coefficients:
(Intercept)    degreePhD     rankAsst    rankAssoc         year  
    26864.8       1388.6     -11118.8      -5826.4        476.3  
      ysdeg    sexFemale  
     -124.6       1166.4

INTERPRETATION Whatever gets placed in the reference category gets left out. Ranking by Professor compared to before (the baseline category of Assistant) leads to a lower salary change and the rest stays the same.

##Part 2e.

Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.

Exclude the variable rank, refit, and summarize how your findings changed, if they did.

###Part 2e. ANSWER

BEFORE


Call:
lm(formula = salary ~ degree + rank + year + ysdeg + sex, data = salary)

Coefficients:
(Intercept)    degreePhD     rankAsst    rankAssoc         year  
    26864.8       1388.6     -11118.8      -5826.4        476.3  
      ysdeg    sexFemale  
     -124.6       1166.4

AFTER


Call:
lm(formula = salary ~ degree + year + ysdeg + sex, data = salary)

Coefficients:
(Intercept)    degreePhD         year        ysdeg    sexFemale  
    17183.6      -3299.3        352.0        339.4      -1286.5

As we see in the change from removing rank, having a degree became almost insignificant as it relates to an individuals salary. The years worked in the rank and female workers dropped as well. The only thing that was positively affected was the ysdeg or also known as the years since the highest degree earned. Not having rank showed an increase of importance in the length of time since the highest degree earned because of the fact that it isn’t any longer important. The more times that’s lapsed from an individual getting the degree, the more important that induvial became.

##Part 2f.

Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity.

###Part 2f. ANSWER

What we want is to build a new variable to implement to the linear model that we used before to test the hypothesis of whether or not the Dean has been making offers that are a lot more generous to newly hired faculty in the last 15 years. To test this, we want to compare and contrast everyone in the dataset who earned their highest degree with years greater than or equal to 16 and 15 years ago or less. This will paint the before and after picture we’re looking for.

The variable we want to work with is associated with years since highest degree earned “ysdeg” while keeping the structure of the previous linear model we implement the following to give us the new variable we need.

DATA BEFORE >= 16 years


Call:
lm(formula = salary ~ degree + rank + year + ysdegBEFORE + sex, 
    data = salary)

Coefficients:
    (Intercept)        degreePhD         rankAsst        rankAssoc  
        26588.8            818.9         -11096.9          -6124.3  
           year  ysdegBEFORETRUE        sexFemale  
          434.8          -2163.5            907.1

DATA AFTER <= 15 years


Call:
lm(formula = salary ~ degree + rank + year + ysdegAFTER + sex, 
    data = salary)

Coefficients:
   (Intercept)       degreePhD        rankAsst       rankAssoc  
       24425.3           818.9        -11096.9         -6124.3  
          year  ysdegAFTERTRUE       sexFemale  
         434.8          2163.5           907.1

##Explain why multicollinearity would be a concern in this case and how you avoided it.

High intercorrelations among two or more independent variables in a multiple regression model can lead to skewed or misleading results when trying to figure out how well each variable can best predict or understand dependent variables.

How I avoided this was by building independent variables and running them on the linear model independently so I wouldn’t get the following result by using both of them leading to a misleading result showing the DATA AFTER as Not Availible.


Call:
lm(formula = salary ~ degree + rank + year + ysdegBEFORE + ysdegAFTER + 
    sex, data = salary)

Coefficients:
    (Intercept)        degreePhD         rankAsst        rankAssoc  
        26588.8            818.9         -11096.9          -6124.3  
           year  ysdegBEFORETRUE   ysdegAFTERTRUE        sexFemale  
          434.8          -2163.5               NA            907.1

##Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

As the statistical model and results suggest, there is a positive correlation to everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. From what the data suggests, the people hired by the new Dean are making higher than those that were not.

#Question 3

(SMSS 13.7 & 13.8 combined, modified)

(Data file: house.selling.price in smss R package)

##Part 3a.

Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). (In other words, price is the outcome variable and size and new are the explanatory variables.)

###Part 3a.ANSWER


Call:
lm(formula = Price ~ Size + case + Taxes + Beds + Baths + New, 
    data = house.selling.price)

Coefficients:
(Intercept)         Size         case        Taxes         Beds  
   25438.94        63.83      -453.82        38.36    -10894.83  
      Baths          New  
    1725.98     44548.63

As we can see from the report based on price there’s a high correlation to new homes however the size of the home is another story. As we can see there’s a relatively small correlation between price and size indicating that the price in homes is overvalued for the size which holds true when we look at Beds. Beds hold an extremely negative correlation as this is another indicator that the price is far too overvalued.

##Part 3b.

Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes. In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

###Part 3b. ANSWER

As we already built a model to report on for an interpretation relating to selling price to the size for new homes we use this to build an equation for a summary. A summary that will discuss the statistical significance and interpret the meaning of the coefficient.


Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Prediction Equation for New Homes E(price) = −40231 + 57736(New) + 116(Size)

Prediction Equation for Old Homes E(price) = −40231 + 57736(Old) + 116(Size)

As demonstrated by the prediction equation above, my interpretation is that when making a prediction we need to differentiate whether we’re predicting for New or Old homes. Additionally we need to know the size as both variables will tell us the E(price) or in other words the predicted price of the home.

                                         REPORT & INTERPRETATION

In terms of statistical significance we can see that size has three stars which indicates that the Size of the home has more significance or in other words more importance while wealther or not the home was New had less significance. In terms of coefficients, the intercept estimate denotes the average value of the output variable (Price) when all input (Size and “New or Old”) becomes zero, the estimate of slope.

The standard error holds the estimated error to which we can get when we calculate the difference between our response variables predicted and actual value. The standard error in turn tells us about the confidence relating input and output variables. In terms of each variable we have, we can see that while size of the homes have a small standard error whether or not the home was new had a large standard deviation separating square footage from the price of the home.

The T-Value provides the confidence to reject the null hypothesis. The greater the value away from zero (Size), the bigger the confidence to reject the null hypothesis and establishing the relationship between output and input variable. In our example both Size and New is away from zero showing to reject the null hypothesis and establish the relationship between output and input variable

Pr(>t) is essentially the p-value. The closer it is to zero the safer we can be to reject the null hypothesis which in this case we would be safe to do so and say that there is a relationship between the size and how new the home is.

##Part 3c.

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

###Part 3c. ANSWER

new Prediction Equation for New Homes E(price) = −40231 + 57736(New) + 116(Size) E(price) = −40231 + 57736(1) + 116(3000) E(price) = −40231 + 57736 + 348,000 = $365,505
not new Prediction Equation for Old Homes E(price) = −40231 + 57736(Old) + 116(Size) E(price) = −40231 + 57736(0) + 116(3000) E(price) = −40231 + 348,000 = $307,769

##Part 3d.

Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

###Part 3d. ANSWER

As mentioned in our lectures, an interaction term is a cross product of two variables (size and new). After making a brief model of Size and New we can get an idea of the range between the two variables. We see that majority of all new homes fall between having 1000 and 2000 square feet.

In terms of the regression, and the interaction we deal with interactions by adding an interaction term to the regression equation. And as mentioned a cross product of two variable Size*New.


Call:
lm(formula = Price ~ Size + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E(price) = −22228 + -78528(New) + 104(Size) + 62(Size*New) For new homes New = 1 & for Old homes New = 0

##Part 3e.

Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

###Part 3e. ANSWER

New Homes E(price) = −22228 + -78528(New) + 104(Size) + 62(Size*New)

Old Homes E(price) = −22228 + 104(Size)

As the line for new homes would have a one in place for “New” the line for Old homes would have a zero in place of “New”. This reduced the E(price) formula as zeros cancel out half the equation relating to the predicted selling price to the size for homes that are not new.

##Part 3f.

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

###Part 3f. ANSWER

New = zero or one depending on if we solve for new or not new.

new Prediction Equation for New Homes E(price) = −22228 + -78528(New) + 104(Size) + 62(SizeNew) E(price) = −22228 + -78528(1) + 104(3000) + 62(30001) E(price) = −22228 + -78528 + 312000 + 186000 E(price) = $397,244
not new Prediction Equation for Old Homes E(price) = −22228 + -78528(Not New) + 104(Size) + 62(SizeNot New) E(price) = −22228 + -78528(0) + 104(3000) + 62(30000) E(price) = −22228 + 312000 E(price) = $289,772

##Part 3g.

Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

###Part 3g. ANSWER

new Prediction Equation for New Homes E(price) = −22228 + -78528(New) + 104(Size) + 62(SizeNew) E(price) = −22228 + -78528(1) + 104(1500) + 62(15001) E(price) = −22228 + -78528 + 156000 + 93000 E(price) = $148,244
not new Prediction Equation for Old Homes E(price) = −22228 + -78528(Not New) + 104(Size) + 62(SizeNot New) E(price) = −22228 + -78528(0) + 104(1500) + 62(15000) E(price) = −22228 + 156000 E(price) = $133,772

In terms of the predicted selling price we see that there’s not much of a difference in terms of price for homes with square feet of 1500 compared to homes with a square feet of 3000. Homes with a square feet of 3000 that were new vs old had a difference in price of $107,472 where homes with half the square feet at 1500 had a difference in price between new and old of $14,472. Homes with 1500 square feet being roughly seven times less than those of 3000 square feet.

##Part 3h.

Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

###Part 3h. ANSWER

I think the model without the interaction represents the relationship of size and new to the outcome price. What makes me prefer one over another is that the model without the interaction is that without the interaction “New” homes have a higher significance which shows a measure of it p-value variation. A higher p-value variation shows a higher impact in terms of the dependent variable (price).