HW 3

Haley Grace Henson

1. Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_xlsx("./Apartments.xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. Change categorical variables into factors. (0.5 p)

mydata$ParkingF <- factor(mydata$Parking,
                         levels = c("0","1"),
                         labels = c("No","Yes"))

mydata$BalconyF <- factor(mydata$Balcony,
                         levels = c("0","1"),
                         labels = c("No","Yes"))

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

head(mydata)

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony ParkingF BalconyF
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>   
## 1     7       28  1640       0       1 No       Yes     
## 2    18        1  2800       1       0 Yes      No      
## 3     7       28  1660       0       0 No       No      
## 4    28       29  1850       0       1 No       Yes     
## 5    18       18  1640       1       1 Yes      Yes     
## 6    28       12  1770       0       1 No       Yes

mean(mydata$Price)

## [1] 2018.941

sd(mydata$Price)

## [1] 377.8417

library(ggplot2)

ggplot(mydata, aes(x = Price)) +
  geom_histogram(binwidth = 100, colour = "black") +
  ylab("Frequency") + 
  xlab ("Price")

shapiro.test(mydata$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Price
## W = 0.94017, p-value = 0.0006513

median(mydata$Price)

## [1] 1950

wilcox.test(mydata$Price,
            mu = 1900,
            alternative = "two.sided")

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  mydata$Price
## V = 2328, p-value = 0.02844
## alternative hypothesis: true location is not equal to 1900

We reject H0 with p = 0.02828

t.test(mydata$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

H0 is rejected at p = 0.004731

Based off the following tests we can conclude that the mean price does not = 1900.

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (2 p)

fit1 <- lm(Price ~ Age,
           data = mydata)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

corcoef <- sqrt(summary(fit1)$r.squared)
corcoef

## [1] 0.230255

Regression Coefficient: If age increases by 1 year then on average Price decreases by 8.975 assuming all else eq

Coefficient of Correlation: The coefficient of 0.230255 can be described as a weak positive relationship between age of apartment and price.

Coefficient of Determination: The goodness of fit is 0.05302 which means that 5.3% of variance can be explained by the dependent variable in the model.

5. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(mydata[ , c(-4,-5,-6,-7)],
                  smooth = FALSE)

Based off of this graph there appears to be no multicolinearity which means that the variables do not have any/strong relationship with each other.

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <-  lm(Price ~ Age + Distance, 
           data = mydata)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

7. Chech the multicolinearity with VIF statistics. Explain the findings. (1 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

None of the VIF’s are greater than 5 and the mean VIF about = 1 so there is no multicolinearity.

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (2 p)

mydata$StdResiduals <- round(rstandard(fit2), 3)

head(mydata[order(mydata$StdResiduals),],3)

## # A tibble: 3 × 8
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResiduals
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>           <dbl>
## 1     7        2  1760       0       1 No       Yes             -2.15
## 2    12       14  1650       0       1 No       Yes             -1.50
## 3    12       14  1650       0       0 No       No              -1.50

head(mydata[order(-mydata$StdResiduals),],3)

## # A tibble: 3 × 8
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResiduals
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>           <dbl>
## 1     5       45  2180       1       1 Yes      Yes              2.58
## 2     2       11  2790       1       0 Yes      No               2.05
## 3    18        1  2800       1       0 Yes      No               1.78

In the above analysis there are no values greater or less than 3 so we have no reason to assume outliers.

mydata$CooksD <- round(cooks.distance(fit2), 3)

library(ggplot2)

ggplot(mydata, aes(x = CooksD)) +
  geom_histogram(bins= 50, colour = "black") +
  xlim(NA, 0.4) +
  ylab("Frequency") +
  xlab ("Cooks Distance")

## Warning: Removed 1 row containing missing values or values outside the scale range (`geom_bar()`).

head(mydata[order(-mydata$CooksD),])

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResiduals CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>           <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes      Yes              2.58  0.32 
## 2    43       37  1740       0       0 No       No               1.44  0.104
## 3     2       11  2790       1       0 Yes      No               2.05  0.069
## 4     7        2  1760       0       1 No       Yes             -2.15  0.066
## 5    37        3  2540       1       1 Yes      Yes              1.58  0.061
## 6    40        2  2400       0       1 No       Yes              1.09  0.038

We can see a gap between 0.15 and 0.30. We will want to remove this value of hugh influence using the below function.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  filter(CooksD < 0.06)

ggplot(mydata, aes(x = CooksD)) +
  geom_histogram(bins= 50, colour = "black") +
  xlim(NA, 0.4) +
  ylab("Frequency") + 
  xlab ("Cooks Distance")

## Warning: Removed 1 row containing missing values or values outside the scale range (`geom_bar()`).

new histogram with removed outlier.

fit2 <- lm(Price ~ Age + Distance,
           data = mydata)

9. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (1 p)

mydata$StdFitted <- scale(fit2$fitted.values)


library(car)

scatterplot(y = mydata$StdResiduals, x = mydata$StdFitted,
            ylab = "Standardized Residuals",
            xlab = "Standardized Fitted",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

We can see homoskedasticity by looking at this scatterplot for the most part. There seems to be a slight narrowing between -3 and -2. To test this hypothesis further we use the below test.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.738591 
##  Prob > Chi2   =    0.1873174

The p value 0.188 is high (p >0.05) and therefore we can not reject the null hypothesis that variance of errors is constant and there is no apparent heteroskedasticity.

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (1 p)

hist(mydata$StdResiduals,
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = "Standardized Residuals Histogram")

The standardized residuals look slightly skewed to the right side.

shapiro.test(mydata$StdResiduals)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResiduals
## W = 0.93418, p-value = 0.0004761

By using the above test we get a p value < 0.001 thus we reject the null hypothesis that the distribution of standardized residuals is normal.

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (2 p)

fit2 <- lm(Price ~ Age + Distance, 
           data = mydata)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Age Coefficient: Assuming all else equal, on average when age increases by 1 year, price decreases by 7.934.

Distance Coefficient: Assuming all else equal, on average when distance increases by 1 km, price decreases by 20,667.

these coefficients are both significant at a = 0.05.

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, 
           data = mydata)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFYes  128.700     60.801   2.117   0.0376 *  
## BalconyFYes    6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

13. With function anova check if model fit3 fits data better than model fit2. (1 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

The p value 0.01007 is < 0.05 so we reject the null that the difference between the coefficient of determination in the two models is 0 and can conclude that adding more variables improves the goodness of fit significantly. (fit3 is better)

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (2 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFYes  128.700     60.801   2.117   0.0376 *  
## BalconyFYes    6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

ParkingF: Holding all else equal when an apartment has a parking space, on average price increases 196.168 at significance level of 0.01.

BalconyF: This variable is not significant and therefore it can be concluded that it does not have an effect on price. (p = 0.97436)

H0: p^2 = 0 ( all partial regression coeffs are 0 )

H1: p^2 > 0 (at least one partial regression coefficients of all the independent variables differs from 0 )

15. Save fitted values and calculate the residual for apartment ID2. (1 p)

mydata$FittedValues3 <- fitted.values(fit3)
mydata[2,]

## # A tibble: 1 × 11
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResiduals CooksD StdFitted[,1] FittedValues3
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>           <dbl>  <dbl>         <dbl>         <dbl>
## 1    18        1  2800       1       0 Yes      No               1.78   0.03          1.16         2357.

mydata$Residuals <- residuals(fit3)
mydata[2,]

## # A tibble: 1 × 12
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResiduals CooksD StdFitted[,1] FittedValues3 Residuals
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>           <dbl>  <dbl>         <dbl>         <dbl>     <dbl>
## 1    18        1  2800       1       0 Yes      No               1.78   0.03          1.16         2357.      443.