HW 3

Leong Jia Qian

Student ID: h12443762

1. Import the dataset Apartments.xlsx

library(readxl)
aptdata <- read_xlsx("./Apartments.xlsx")

head(aptdata, 5)

## # A tibble: 5 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. What could be a possible research question given the data you analyze? (1 p)

Given the dataset above, a possible research question we could look into is to find out whether the proximity to the city center in Vienna (“distance”) have an impact on apartment prices.

3. Change categorical variables into factors. (0.5 p)

# Categorical variables include "Parking" and "Balcony"

aptdata$ParkingF <- factor(aptdata$Parking,
                           levels = c(1, 0),
                           labels = c("Yes", "No")) # Found that "Yes" has more units which should be the reference point

aptdata$BalconyF <- factor(aptdata$Balcony,
                           levels = c(0, 1),
                           labels = c("No", "Yes"))

summary(aptdata[c(-4, -5)])

##       Age           Distance         Price      ParkingF BalconyF
##  Min.   : 1.00   Min.   : 1.00   Min.   :1400   Yes:43   No :48  
##  1st Qu.:12.00   1st Qu.: 4.00   1st Qu.:1710   No :42   Yes:37  
##  Median :18.00   Median :12.00   Median :1950                    
##  Mean   :18.55   Mean   :14.22   Mean   :2019                    
##  3rd Qu.:24.00   3rd Qu.:20.00   3rd Qu.:2290                    
##  Max.   :45.00   Max.   :45.00   Max.   :2820

4. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

Defining hypotheses:

H0: Mu_Price equals to (=) 1900 Eur
H1: Mu_Price not equals to (=/) 1900 Eur

# Description summary of data 
library(psych)
describe(aptdata$Price)

##    vars  n    mean     sd median trimmed    mad  min  max range skew kurtosis    se
## X1    1 85 2018.94 377.84   1950 1990.29 429.95 1400 2820  1420 0.54    -0.69 40.98

Explanation of some parameters

Mean: The price per m2 for the sample of apartments in the Vienna region cost an average of 2019 EUR.
Median: 50% of the sample of apartments are priced at 1995 EUR per m2 or below, while the remaining 50% are priced above 1995 EUR per m2.
Interquartile range (Q3-Q1): The middle 50% of the sample of apartments are priced between 1710 to 2290 EUR.

# Normality testing
# Using graphic visualisation (Histogram)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(aptdata, aes(x = Price)) +
  geom_histogram(binwidth = 100, color = "black") +
  xlab("Price (in EUR)") +
  ylab("No. of Apartments")

# Using Shapiro Test
shapiro.test(aptdata$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  aptdata$Price
## W = 0.94017, p-value = 0.0006513

Assumption on normality

Looking at the histogram, the distribution appears somewhat asymmetric and slightly right-skewed, which may indicate that there are more apartments in the sample priced in the lower range (at around 1600 EUR) while only a few are priced at the higher end (above 2500 EUR). While this histogram may have shown that normality is not met to a certain extent, a Shapiro-Wilk test would provide a more accurate assessment.

Using the Shapiro test, we reject null hypothesis (H0) that the data is normally distributed as p-value is <0.0001. Hence, we can formally conclude that the data did not meet the normality requirement to continue with T-test. A non-parametric testing would be more appropriate.

Wilcoxon Signed Rank Test

As Price did not meet the normality requirement, non-parametric testing shall be utilised to determine whether Mu_Price = 1900 EUR

median(aptdata$Price)

## [1] 1950

wilcox.test(aptdata$Price,
            mu = 1900,
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  aptdata$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize(wilcox.test(aptdata$Price,
            mu = 1900,
            correct = FALSE))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.27              | [0.04, 0.48]
## 
## - Deviation from a difference of 1900.

interpret_rank_biserial(0.27, rules = "funder2019")

## [1] "medium"
## (Rules: funder2019)

Conclusion:

Using the sample data of apartments through the Wilcoson Signed Rank Test, we reject null hypothesis (Mu = 1900) as the results show that the median price of apartments in the Vienna region has a moderate significant difference from 1900 EUR (p < 0.05, r = 0.27 - medium effect). Median is found to be at 1950 EUR per m2, which is not very far from 1900 EUR in our hypothesis.

5. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (1 p)

# Graphical visualisation (Scatterplot)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplot(aptdata$Price ~ aptdata$Age,
            smooth = FALSE,
            boxplots = FALSE,
            ylab = "Price (in EUR)",
            xlab = "Age of Apartment (in years)")

Observations

From the scatterplot, we can see that there is a negative correlation relationship between “Price” and “Age”. This suggests that the older the apartment is, the price of apartment would relatively be lower.

# Regression model (1)
fit1 <- lm(Price ~ Age,
           data = aptdata)

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = aptdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

# Coefficient of correlation
sqrt(summary(fit1)$r.squared)

## [1] 0.230255

Regression Results

Estimate of regression coefficient:

(Intercept): The price of a brand-new apartment in the Vienna region is estimated to be priced at 2185.46 EUR per m2.
Age: When the age of an apartment increases by 1 year, its price per m2 would decrease 8.98 EUR on average.

Coefficient of determination (r2)

5.302% of the variability of price per m2 in apartments located in the Vienna region can be explained by linear effect of age.

Coefficient of correlation [SQRT(r2)]

The coefficient of correlation is calculated to be at 0.23. This suggests that there is a moderate linear relationship between price per m2 and the age of the apartment.

6. Show the scatterplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

# By scatterplot matrix
library(car)
scatterplotMatrix(aptdata[c("Price", "Age", "Distance")])

Analysis of scatterplot matrix

From the above scatterplot matrix between the dependent variable (Price) and its explanatory variables (Age, Distance), we can see that there is a negative linear relationship between Price and Age, as well as between Price and Distance. This suggests that price per m2 decreases if the age of the apartment is older and if it is located further away from the city center, assuming all other explanatory variables remain constant.

# By correlation matrix (in table form)
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(aptdata[c("Price", "Age", "Distance")]))

##          Price   Age Distance
## Price     1.00 -0.23    -0.63
## Age      -0.23  1.00     0.04
## Distance -0.63  0.04     1.00
## 
## n= 85 
## 
## 
## P
##          Price  Age    Distance
## Price           0.0340 0.0000  
## Age      0.0340        0.6966  
## Distance 0.0000 0.6966

Analysis of the correlation matrix

As seen from the correlation matrix above, there is strong negative linear relationship between Price (dependent variable) and its explanatory variables (Age and Distance) as they have a Pearson correlation coefficient of -0.23 and -0.63 respectively.

Conclusion on multicollinearity

From the scatterplot matrix, it can be seen that there is no strong relationship between the explanatory variables (Age and Distance). This can also be supported from the correlation matrix which indicates a Pearson correlation coefficient of 0.04 between Age and Distance. Hence, we can conclude that there is no potential problem of multicolinearity.

7. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

# Regression model (2)
fit2 <- lm(Price ~ Age + Distance,
           data = aptdata)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = aptdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Regression Results

Estimate of regression coefficient:

(Intercept): The price of a brand-new apartment in the Vienna region is estimated to be priced at 2460.10 EUR per m2.
Age: When the age of an apartment increases by 1 year, its price per m2 would decrease 7.93 EUR on average.
Distance: When the distance from city center increases by 1km, the price per m2 of the apartment would decrease 20.67 EUR on average.

Coefficient of determination (r2)

43.96% of the variability of price per m2 in apartments located in the Vienna region can be explained by linear effect of age and distance.

8. Chech the multicolinearity with VIF statistics. Explain the findings. (0.5 p)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

The VIF statistics calculated above to check for the strength of correlation between the explanatory variables of Age and Distance falls well below the threshold of 5. It can be concluded that there is no problem with too strong multicolinearity between Age and Distance.

9. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (1 p)

aptdata$StdResid <- round(rstandard(fit2), 3)
aptdata$CooksD <- round(cooks.distance(fit2), 3)

aptdata$ID <- seq(1, nrow(aptdata))
head(aptdata[order(aptdata$StdResid), c("ID", "StdResid")], 5)

## # A tibble: 5 × 2
##      ID StdResid
##   <int>    <dbl>
## 1    53    -2.15
## 2    13    -1.50
## 3    72    -1.50
## 4    20    -1.38
## 5    35    -1.26

head(aptdata[order(-aptdata$StdResid), c("ID", "StdResid")], 5)

## # A tibble: 5 × 2
##      ID StdResid
##   <int>    <dbl>
## 1    38     2.58
## 2    33     2.05
## 3     2     1.78
## 4    61     1.78
## 5    58     1.66

Test for outliers

In determining for outliers, a range between -3 and +3 is normally used. Based on the above observations, all the values fall well between the range. Hence, there is no outliers in our data.

hist(aptdata$CooksD,
     xlab = "Cooks Distance",
     ylab = "Frequency",
     main = "Histogram of Cooks Distance")

head(aptdata[order(-aptdata$CooksD), c("ID", "CooksD"), 8])

## # A tibble: 6 × 2
##      ID CooksD
##   <int>  <dbl>
## 1    38  0.32 
## 2    55  0.104
## 3    33  0.069
## 4    53  0.066
## 5    22  0.061
## 6    39  0.038

Test for units with high influence

From our test above, it can be seen that the unit with the largest impact that is significantly different from the rest is unit ID 38, scoring a Cook’s distance of 0.320. In addition, unit ID 55 seems to also have a slightly higher value as compared to the rest of the units. Hence, both unit ID 38 and 55 are to be removed from the data as shown below.

# To remove the units with high influence (i.e. CooksD > 0.3)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

aptdata <- aptdata %>%
  filter(!ID %in% c(38, 55))

10. Check for potential heteroskedasticity with scatterplot between standardized residuals and standardized fitted values. Explain the findings. (0.5 p)

# Re-estimation of regression model (2)
fit2 <- lm(Price ~ Age + Distance,
           data = aptdata)

# Check for potential heteroskedasticity
aptdata$StdFitted <- scale(fit2$fitted.values)
aptdata$StdResid <- round(rstandard(fit2), 3)

library(car)
scatterplot(y = aptdata$StdResid, x = aptdata$StdFitted,
            ylab = "Standardised Residuals",
            xlab = "Standardised Fitted Values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

Observations

Based on the scatterplot graph between standardised residuals and standardised fitted values, the variability seems to slightly increase as we look from the negative values of standardised fitted values towards the positive values. Additionally, it can also be seen that there are more units distributed on the positive side of standardised fitted values. Hence, this could suggest that potential heteroskedasticity may be present in the dataset.

# Formal testing of heteroskedasticity
library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.775135 
##  Prob > Chi2   =    0.05201969

Breusch Pagan Test - Findings

From the results above, we cannot reject the null hypothesis that the variance of errors is constant as p-value is above 0.05. Hence, we can formally verify that there is homoskedasticity in the variance of errors.

11. Are standardized residuals distributed normally? Show the graph and formally test it. Explain the findings. (0.5 p)

# Graphic Visualisation
hist(aptdata$StdResid,
     xlab = "Standardised residuals",
     ylab = "Frequency",
     main = "Histogram of standardised residuals")

# Using Shapiro Test
shapiro.test(aptdata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  aptdata$StdResid
## W = 0.95952, p-value = 0.01044

Findings on normal distribution of standardised residuals

Based on the distribution of the standardised residuals through the histogram graph, we can see that the residuals appears to be somewhat asymmetrical and bimodal (having 2 peaks). Through the formal Shapiro-Wilks normality test, its p-value is calculated to be lesser than 0.05. Hence, we have to reject the null hypothesis that errors are normally distributed, which means the errors in the population are most likely not normally distributed.

Fortunately, our sample size is large enough which includes 85 apartments (including the units of high influence that we have removed earlier on) so the assumption of normal distribution of errors is not as important.

12. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (1 p)

fit2 <- lm(Price ~ Age + Distance,
           data = aptdata)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = aptdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -627.27 -212.96  -46.23  205.05  578.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2490.112     76.189  32.684  < 2e-16 ***
## Age           -7.850      3.244  -2.420   0.0178 *  
## Distance     -23.945      2.826  -8.473 9.53e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.5 on 80 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4842 
## F-statistic: 39.49 on 2 and 80 DF,  p-value: 1.173e-12

# Coefficient of correlation
sqrt(summary(fit2)$r.squared)

## [1] 0.7048456

Explanation of regression results

Estimate of regression coefficient:

(Intercept): The price of a brand-new apartment in the Vienna region is estimated to be priced at 2490.11 EUR per m2.
Age: When the age of an apartment increases by 1 year, the price per m2 of the apartment would decrease 7.85 EUR on average.
Distance: When the distance from city center increases by 1km, the price per m2 of the apartment would decrease 23.95 EUR on average.

Coefficient of determination (r2)

49.68% of the variability of price per m2 in apartments located in the Vienna region can be explained by linear effect of age and distance.

Coefficient of correlation [SQRT(r2)]

The coefficient of correlation is calculated to be at 0.705. This suggests that there is a very strong linear relationship between price per m2, age of the apartment and distance from city center.

13. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3. (0.5 p)

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF,
           data = aptdata)

14. With function anova check if model fit3 fits data better than model fit2. (0.5 p)

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     80 5982100                              
## 2     78 5458696  2    523404 3.7395 0.02813 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

Using the ANOVA function to determine which model (model fit2 or fit3) fits the data better, the results above show that model fit3 fits the data better (p = 0.03) as compared to model fit2.

15. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (1 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = aptdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -499.06 -194.33  -32.04  219.03  544.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2527.821     79.105  31.955  < 2e-16 ***
## Age           -7.197      3.148  -2.286  0.02499 *  
## Distance     -21.241      2.911  -7.296 2.14e-10 ***
## ParkingFNo  -168.921     62.166  -2.717  0.00811 ** 
## BalconyFYes   -6.985     58.745  -0.119  0.90566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264.5 on 78 degrees of freedom
## Multiple R-squared:  0.5408, Adjusted R-squared:  0.5173 
## F-statistic: 22.97 on 4 and 78 DF,  p-value: 1.449e-12

Explanation of coefficients for categorical variables in regression model 3

ParkingFNo: Given the values of the other explanatory variables, the group of apartments without a parking space have an average price per m2 that is 168.92 EUR less than the group of apartments with a parking space (p < 0.01)
BalconyFYes: Based on the regression results above, we did not find a significant difference between the average price per m2 of 2 equal apartments with exception that one apartment with balcony and the other without balcony (p = 0.906)

Hypothesis tested with F-statistics

F-statistics is to test whether our linear regression model that we are working on fits the data better than a model that contains no explanatory variables.

Hypotheses in F-statistics:

H0: The population coefficient of determination (p2) is equal to 0 (p2 = 0)
H1: The population coefficient of determination is greater than 0

From the regression result above, we reject null hypothesis at p < 0.001. This suggests that population determination coefficient is greater than 0, where at least part of the variability of the dependent variable can be explained by the linear effect of the explantory variable. Hence, we can conclue that the overall regression model is significant.

16. Save fitted values and calculate the residual for apartment ID2. (0.5 p)

aptdata$Fitted <- fitted.values(fit3)
aptdata$Residuals <- residuals(fit3)

head(aptdata[colnames(aptdata) %in% c("ID", "Price", "Fitted", "Residuals")], 3)

## # A tibble: 3 × 4
##   Price    ID Fitted Residuals
##   <dbl> <int>  <dbl>     <dbl>
## 1  1640     1  1707.     -66.8
## 2  2800     2  2377.     423. 
## 3  1660     3  1714.     -53.8

Based on the results of model fit3, the expected price per m2 for the apartment ID2 would be 2377 EUR. However, its actual price is listed higher than expected at 2800 EUR per m2. With these 2 values, the difference between the actual and estimated price per m2 (residual) for apartment ID2 is 423 EUR