HW 3

ALBA CRESPO

1. Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_xlsx("./Apartments.xlsx")
head(mydata)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. Change categorical variables into factors. (0.5 p)

mydata$ParkingFactor <- factor(mydata$Parking, levels = c("0","1"), labels = c("No","Yes")) 
mydata$BalconyFactor <- factor(mydata$Balcony, levels = c("0","1"), labels = c("No","Yes"))
head(mydata)

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>        
## 1     7       28  1640       0       1 No            Yes          
## 2    18        1  2800       1       0 Yes           No           
## 3     7       28  1660       0       0 No            No           
## 4    28       29  1850       0       1 No            Yes          
## 5    18       18  1640       1       1 Yes           Yes          
## 6    28       12  1770       0       1 No            Yes

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test(mydata$Price, mu = 1900)

## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

H0: Mu_Price = 1900 eur. H1: Mu_Price ≠ 1900 eur.

We reject null hypothesis (H0: Mu_Price = 1900 eur) at p = 0.005. Therefore, we can state average price is not 1900 EUR, differing from this value.

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (2 p)

fit1 <- lm(Price ~ Age, data = mydata )
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

H0: B1 = 0 -> No correlation or association between Price and Age of a building.

H1: B1 ≠ 0 -> Some correlation or association between Price and Age of a building.

We reject null hypothesis at p = 0.034. Hence, we state that there must be some correlation between price and age.

If age increases by one year, price, on average, goes down by 8.98 euros per m2.

Nevertheless, the model only explains linearly around a 5.3% (R2 = 0.053) of the variance in price, suggesting other variables do have a significant impact in the Price of the apartments.

5. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(mydata[,c(3,1,2)])

Analyzing the scatterplot matrix, and the relationships between the variables, there are no apparent strong linear relation between any pair of variables. This lacking pattern suggests that multicollinearity is not a potential problem.

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = mydata)

7. Chech the multicolinearity with VIF statistics. Explain the findings. (1 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The Variance Inflation Factor (VIF) near 1 among both variables, indicates a lack of multicollinearity between the variables studied, improving the regression analysis’ conditions, and therefore its validity. Moreover, this confirms the hypothesis made by looking the scatterplot matrix previously.

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (2 p)

mydata$StdResid <- round(rstandard(fit2), 3)
mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$StdResid, 
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = "Histogram of Std. Residuals")

head(mydata[order(mydata$StdResid),],3)

## # A tibble: 3 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>
## 1     7        2  1760       0       1 No            Yes              -2.15  0.066
## 2    12       14  1650       0       1 No            Yes              -1.50  0.013
## 3    12       14  1650       0       0 No            No               -1.50  0.013

hist(mydata$CooksD, 
     xlab = "Cooks Distance",
     ylab = "Frequency",
     main = "Histogram of Cooks Distances")

head(mydata[order(-mydata$CooksD),],3)

## # A tibble: 3 × 9
##     Age Distance Price Parking Balcony ParkingFactor BalconyFactor StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>         <fct>            <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes           Yes               2.58  0.32 
## 2    43       37  1740       0       0 No            No                1.44  0.104
## 3     2       11  2790       1       0 Yes           No                2.05  0.069

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>% 
  filter(!CooksD == 0.320)
fit2 <- lm(Price ~ Age + Distance, data = mydata)

The histograms and values of the Standardized Residuals of the observations, show that all of them ranged from -2.2 to 2.6. Therefore, there are no obvious outliers, since no level is above +3, nor below -3.

Moreover, checking the Cooks Distances and given the relative size of them in the dataset, a value of 0.320 stands out especially high, representing a unit with high influence. This observation is then removed to ensure that it does not bias the model.

9. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (1 p)

mydata$StdFitted <- scale(fit2$fitted.values)
mydata$CooksD <- round(cooks.distance(fit2), 3)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
            ylab = "Standardized Residuals",
            xlab = "Standardized Fitted Values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

The examination of the scatterplot reveals a random dispersion of points, close to the horizontal zero line, with no evident pattern indicating variance changes. Additionally, the majority of values are compacted between -1 and 1 standardized fitted values, representing consistent variance. These observations suggest that no heteroskedasticity can be evidenced, supporting the assumption of homoskedasticity in the model.

However, the notable difference in the distribution of units between ranges (-3, -1) and (-1, 2), makes a Breusch Pagan test advisable, in order to confirm the absence of a pattern on the distribution and further assess the homoskedasticity assumption.

##extra

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    2.927455 
##  Prob > Chi2   =    0.08708469

H0: Variance is constant.

H1: Variance is not constant.

We cannot reject null hypothesis at p = 0.088, and therefore we assume homoskedasticity

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (1 p)

hist(mydata$StdResid, 
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = "Histogram of Standardized Residuals")

The histogram of standardized residuals suggests that the distribution of values are not normally distributed, as the distribution of standardized residuals does not appear to conform to the typical normal distribution shape. Nevertheless, it would be recommended to perform a Shapiro-Wilk test to confirm this hypothesis.

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.94879, p-value = 0.002187

H0: Values are normally distributed.

H1: Values are not normally distributed.

We reject null hypothesis at p = 0.003. Then, we assume no normality for the standardized residuals.

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (2 p)

# we already have the clean version
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

H0: B1 = 0 -> No correlation or association.

H1: B1 ≠ 0 -> Some correlation or association.

Price ~ Age: We reject null hypothesis at p = 0.044. Hence, we state that there must be some correlation between price and age. If the age of the apartment increases by one year, price, on average, goes down by 6.45 euros per m2 (p = 0.044). Assuming all other variables remain unchanged. This means that Distance does influence Price in a negative way. They have a negative relationship (coefficient = -6.464)

Price ~ Distance: We reject null hypothesis at p < 0.001. Hence, we state that there must be some correlation between price and distance. If distance from the city center increases by one kilometer, price, on average, goes down by 22.96 euros per m2 (p < 0.001). Assuming all other variables remain unchanged This means that Distance does influence Price in a negative way. They have a negative relationship (coefficient = -22.955)

This model explains linearly approximately a 48.38% of the variance in price, suggesting that while age and distance are two important factors, other variables do also impact the price of apartments.

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor, data = mydata)

13. With function anova check if model fit3 fits data better than model fit2. (1 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Δρ2 = 0 -> Both models are equally good (change in coefficient determination is null).

H1: Δρ2 > 0 -> Second model is more statistically significant (positive change in coefficient of determination).

We reject null hypothesis at p = 0.031. Then we assume that the second model, including the two categorical variables, yields a statistically significant improvement from the first one.

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (2 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2329.724     93.066  25.033  < 2e-16 ***
## Age                -5.821      3.074  -1.894  0.06190 .  
## Distance          -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFactorYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFactorYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

H0: B1 = 0 -> No correlation or association.

H1: B1 ≠ 0 -> Some correlation or association.

Price ~ ParkingFactorYes: If the apartment does have a parking, price, on average, goes up by 167.54 euros per m2 (p = 0.01). Assuming all other variables remain unchanged

Price ~ BalconyFactorYes: This variable effect is not statistically significant (p = 0.798). This suggests that balconies do not substantially impact the apartment’s price when other factors are considered.

F-STATISTIC:

The F-statistic tests a joint hypothesis about the significance of the whole model.

H0: All coefficients of the variables are 0 and therefore do not affect the price (ρ2 = 0).

H1: At least one of the variables affects the price of apartments(ρ2 > 0).

We reject null hypothesis at p < 0.001, and assume that one or more variables do have a significant effect on the apartment’s prices.

15. Save fitted values and calculate the residual for apartment ID2. (1 p)

mydata$Fitted <- fitted.values(fit3)
mydata$Residuals <- residuals(fit3)
mydata[2,c("Fitted", "Residuals")]

## # A tibble: 1 × 2
##   Fitted Residuals
##    <dbl>     <dbl>
## 1  2372.      428.

The fitted value for this apartment suggests an estimated price of 2372.197 eur. Nevertheless, the residual of 427.803 indicates an important difference between the estimate and the actual price.