HW 3

Nicolás Ortiz Queirolo

1. Import the dataset Apartments.xlsx

library(readxl)

apartments_db <- read_xlsx("./Apartments.xlsx")
apartments_db <- as.data.frame(apartments_db)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. What could be a possible research question given the data you analyze? (1 p)

A possible, and the most reasonable research question with the data that we have is:

To what extent do apartment characteristics such as age, distance from the city center, and the presence of a parking space or balcony influence the price per square meter?

Alternatives research questions could be:

Is the presence of a parking space associated with higher apartment prices per square meter? By how much?
Or, does the effect of an apartment’s age on its price vary depending on whether it has parking or not?

3. Change categorical variables into factors. (0.5 p)

apartments_db$ParkingFactor <- factor(apartments_db$Parking, 
                                      levels = c(0,1),
                                      labels = c("No","Yes"))

apartments_db$BalconyFactor <- factor(apartments_db$Balcony, 
                                      levels = c(0,1),
                                      labels = c("No","Yes"))

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

To perform this test, we first need to select which type of hypothesis fits the best with the null hypothesis. Then, we have to check the parametric test assumptions in order to see which subtype of test to perform.

Test selection: In this case, as the hypothesis is about the population arithmetic mean, we will perform a t-test if parametric, and a Wilcoxon Signed Rank Test if non-parametric.
Assumptions to check: (1) Numerical variable (2) The variable of interest is normally distributed

Assumption #1: The research variable (income) is numeric

str(apartments_db$Price)

##  num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...

Conclusion: The variable is numerical

Assumption #2: The variable of interest is normally distributed

For cheking this assumption, first, a histogram will be plotted and then analyzed

library(ggplot2)
ggplot(data = apartments_db, aes(x=Price)) +
  geom_histogram(binwidth = 100, colour="gray", fill = "blue") +
  ylab("Frequency")

At first look, the histogram look a little skewed to the right. As no conclusions can be extracted from this graph, in order to check normality, a more formal test will be conducted

library(rstatix)
shapiro_test(apartments_db$Price)

## # A tibble: 1 × 3
##   variable            statistic  p.value
##   <chr>                   <dbl>    <dbl>
## 1 apartments_db$Price     0.940 0.000651

Conclusion: We reject $H_0$ with a P-value < 0.001, meaning that the Price variable is not normally distributed, and thus we need to perform a wilcoxon signed-rank test instead of a t-test

Wilcoxon Signed Rank Test

mu_price <- 1900

wilcox.test(apartments_db$Price,
            mu = mu_price,
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  apartments_db$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900

Conclusion: We reject $H_0$ with a P-value = 0.03, meaning that the median price does not equal 1,900 EUR, and that 50% of the data is equal or less than this figure. In order to assess the effect the size of the deviation from the sample estimate and the one from the $H_0$, the effect size will be calculated and interpreted

library(effectsize)
effectsize <- effectsize(wilcox.test(apartments_db$Price,
            mu = mu_price,
            correct = FALSE))

interpret_rank_biserial(effectsize)

## r (rank biserial) |       95% CI | Interpretation
## -------------------------------------------------
## 0.27              | [0.04, 0.48] |         medium
## 
## - Deviation from a difference of 1900.
## - Interpretation rule: funder2019

Even though the null hypothesis was rejected, the deviation of the sample estimate was medium, meaning that the size of the deviation of the sample from the null hypothesis was neither large nor small

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

For this simple regression we will estimate the simple following function: $\hat{Price}=b_0+b_1\times age$

fit1 <- lm(Price ~ Age,
           data = apartments_db)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apartments_db)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

sqrt(summary(fit1)$r.squared)

## [1] 0.230255

The explanations of the values are as followed:

Estimate of the intercept: is the estimated value of the regression constant, which indicates the value of the dependent variable when all explanatory variables (in this case just age) take the value zero ($b_0$). In this case, the price of the Apartments would takes the value 2,185 EUR per m2 approximately.
Estimate of the regression coefficient ($b_1$): This value shows us that when Age increase by one unit (in this case years), the value of the dependent value estimated (price) decreases on average by 8.98 EUR per m2, keeping all other values the same [P-value = 0.04]
Coefficient of correlation: The interpretation is that the correlation is negative (since the slope of the regression coefficient for age is negative), and that is weak
Coefficient of determination: This coefficient shows us the proportion of variability of the dependent variable that is being explained by the explanatory variables, or in other words, how well the sample fits the data. In this case, 5.302% of all variability on the price of the apartments can be explained by linear effect on age. In this case, this figure is really low and can let us to conclude that more independent variables should be included in the analysis.

6. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

scatterplot_subset <- apartments_db[,c(3,1,2)]

scatterplotMatrix(scatterplot_subset,
                  smooth = FALSE)

Conclusion: Based on the scatterplot matrix, no multiconllinearity was detected among all the variables. Moreover, in the theory, there shouldn’t be also a potential problem of perfect (or strong) correlation among variables. This since they are different from the conceptual basis, and thus none of the explanatory variables should be a linear combination of the other.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance,
           data = apartments_db)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

Findings: Just as previously mentioned and analyzed in the scatterplot, since both figures are below 5, we can conclude that there is no multicolinearity among variables included in the model. Another interpretation is that none of the explanatory variables (age or distance) is a linear combination of the other.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

First, an standardized residuals histogram will be plotted in order to see if any values are outside +-3 (which we could consider as outliers)

apartments_db$SD_Residuals <- round(rstandard(fit2), 3)
hist(apartments_db$SD_Residuals,
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

Conclusion: As all values for standardized residuals are fitted between +- 3, no outliers were detected, and thus there is no necessity to drop any values

Now, we will calculate Cook’s Distances in order to check whether there are units with high impact or not

apartments_db$CD <- round(cooks.distance(fit2), 3)
hist(apartments_db$CD,
     xlab = "Cook's Distances",
     ylab = "Frequency",
     main = "Histogram of Cook's Distances")

Conclusion: A jump in the values can be appreciated for Cook’s Distances above from 0.15. Since this would affect our model, and the conclusions we could take from it, those units with high impact (with a distance higher than 0.15) will be excluded.

For the latter, first we will check which units are the ones with high impact, then we will filter them, and then we will check again to make sure they were removed

# Checking units again
head(apartments_db[order(-apartments_db$CD),],5)

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor SD_Residuals    CD
## 38   5       45  2180       1       1           Yes           Yes        2.577 0.320
## 55  43       37  1740       0       0            No            No        1.445 0.104
## 33   2       11  2790       1       0           Yes            No        2.051 0.069
## 53   7        2  1760       0       1            No           Yes       -2.152 0.066
## 22  37        3  2540       1       1           Yes           Yes        1.576 0.061

We need to remove the first, but is easier to just exclude everything above 0.15

# Filtering units
library(dplyr)
apartments_db <- apartments_db %>%
  filter(CD < 0.15)

# Checking units
head(apartments_db[order(-apartments_db$CD),],5)

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor SD_Residuals    CD
## 54  43       37  1740       0       0            No            No        1.445 0.104
## 33   2       11  2790       1       0           Yes            No        2.051 0.069
## 52   7        2  1760       0       1            No           Yes       -2.152 0.066
## 22  37        3  2540       1       1           Yes           Yes        1.576 0.061
## 38  40        2  2400       0       1            No           Yes        1.091 0.038

The process is now done, and the unit with high impact was removed. This will not be repeated, since we would be trapped in a loop where always there would be units with a higher CD.

10. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (0.5 p)

First, in order for interpret the results with the variable we just excluded, we have to refit the model. This, since the amount of fitted values that we are trying to set is 85, but now as we removed 1 apartment from our data, an error will be generated when trying to plug them into 84 rows. After that, the scatterplot was plotted.

fit2 <- lm(Price ~ Age + Distance,
           data = apartments_db)

apartments_db$SD_Fitted <- scale(fit2$fitted.values)

scatterplot(y = apartments_db$SD_Residuals, x = apartments_db$SD_Fitted,
            xlab = "Standardized Fitted",
            ylab = "Standardized Residuals",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Findings: Even though we can observe a concentration of values around the right hand side of the graph, is not sufficient to state that there is an “opening” pattern. Moreover, as this is just an appreciation, a formal test will be conducted (even though it was not asked), just the prove the intuition

#install.packages("olsrr")
library(olsrr)
ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    2.927455 
##  Prob > Chi2   =    0.08708469

In fact, we cannot reject $H_0$, which means that the variance is constant, and thus Homokedasticity is assumed

11. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

apartments_db$SD_Residuals <- round(rstandard(fit2), 3)
hist(apartments_db$SD_Residuals,
     xlab = "Standardized Residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals",
     breaks = seq(from = -3, to = 3, by = 0.20))

shapiro.test(apartments_db$SD_Residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  apartments_db$SD_Residuals
## W = 0.95649, p-value = 0.006355

Findings: First, when analyzing the corrected standardized residuals histogram (this since the breaks were adjusted), we can appreciate that the distribution of the former it does not look normally distributed (even having more than one mode or peaks). This was further confirmed when doing the formal Shapiro-Wilk test, where we can reject $H_0$ with a P-Value = 0.007, meaning that the errors are not normally distributed. Moreover, as the sample is higher than 30, due to the central limit theorem, we can assume that the errors are normally distributed, so we can continue with the experiment.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

The model was previously refitted excluding the units with high impact, so the summary will be shown, and then the variables will be explained directly

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

Regression constant ($b_0$): This estimate shows that when all explanatory variables (age and distance) are zero, the dependent variable (Price), will be 2,456 EUR per m2. Basically, a 0m2 apartment would cost 2,456 EUR.
Estimate of the regression constant ($b_1$): This value shows us, that when Age increase by one year, the price of the apartment decreases on average by 6.46 EUR per m2, keeping all other values the same [P-value < 0.001]
Estimate of the regression constant ($b_2$): This value shows us, that when distance increase by one km from the city center, the price of the apartment decreases on average by 22.96 EUR per m2, keeping all other values the same [P-value = 0.05]
Coefficient of determination: This shows that 48.38% of variability on the price of the apartments can be explained by linear effect on age and distance. Although more than 50% of variability needs to be yet explained, this is a much higher figure than the first model without the variable of distance, and even though in the theory we don’t have a proxy number to compare it, we could state that this number gives the model more confidence, and leave less things unexplained.
Coefficient of significance: This coefficient indicates whether your linear regression model provides a better fit to the data than a model that contains no independent variables. In this case, we can reject $H_0$ with a P-Value < 0.001, meaning that our model is valid, and that our independent variables explain the changes in our dependant variable (price)

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + ParkingFactor + BalconyFactor,
           data = apartments_db)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingFactor + BalconyFactor
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: We reject $H_0$ with a P-Value = 0.04, meaning that the the differences are statistically significant, and we should choose fit3 over fit2

15. Show the results of fit3 and explain regression coefficients for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingFactor + BalconyFactor, 
##     data = apartments_db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2329.724     93.066  25.033  < 2e-16 ***
## Age                -5.821      3.074  -1.894  0.06190 .  
## Distance          -20.279      2.886  -7.026 6.66e-10 ***
## ParkingFactorYes  167.531     62.864   2.665  0.00933 ** 
## BalconyFactorYes  -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

Regression coefficient ($b_3): This coefficient show us that given all other variables, apartments with parking cost on average 167.53EUR more than those who don’t [P-value = 0.01]
Regression coefficient ($b_4): This coefficient shows that given all other variables, apartments with balconies cost on average 15.21EUR less than those who don’t [P-value = 0.8]. Moreover, given the high P-value, $H_0$ cannot be rejected, meaning that this coefficient is assumed to be zero (and does not have an statistical significant effect on the dependent variable of the price)
F-Statistic Hypothesis: The F-statistic (or the test of significance) is testing whether the overall model is significant. More specifically, it’s checking if any of the independent variables (Age, Distance, Parking, or Balcony) actually explain the variation in apartment prices.
- The null hypothesis says that all coefficients for these variables are zero ($b_1,b_2...,b_k = 0$) which the same as saying that rho squared is zero ($\rho^2=0$). This meaning that neither of the coefficients have any effect on the dependent variable. On the opposite hand, the alternative hypothesis is that at least one of them does have an effect (and is different from zero), or that rho square is higher than zero ($\rho^2>0$)
- In this case, we reject the null hypothesis with a p-value < 0.001 meaning that the model as a whole is statistically significant and at least one of the variables is useful in predicting the price.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

#Saving fitted values
apartments_db$SD_Fitted3 <- scale(fit3$fitted.values)

apartments_db$Residual <- apartments_db$Price - apartments_db$SD_Fitted3

apartments_db$ID <- seq(1,nrow(apartments_db))
apartments_db <- apartments_db %>% relocate(ID)

apartments_db[apartments_db$ID == 2,]$Residual

##       [,1]
## 2 2798.712

The residual for apartment with ID #2, is 2,798.71 EUR