Vladyslava Kovalenko

1. Import the dataset Apartments.xlsx

library(readxl)
mydata <- read_xlsx("./Apartments.xlsx")
mydata <- as.data.frame(mydata)
head(mydata)
##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

2. Change categorical variables into factors. (0.5 p)

mydata$Parking <- factor(mydata$Parking, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))
mydata$Balcony <- factor(mydata$Balcony, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))
head(mydata)
##   Age Distance Price Parking Balcony
## 1   7       28  1640      No     Yes
## 2  18        1  2800     Yes      No
## 3   7       28  1660      No      No
## 4  28       29  1850      No     Yes
## 5  18       18  1640     Yes     Yes
## 6  28       12  1770      No     Yes
summary(mydata)
##       Age           Distance         Price      Parking  Balcony 
##  Min.   : 1.00   Min.   : 1.00   Min.   :1400   No :42   No :48  
##  1st Qu.:12.00   1st Qu.: 4.00   1st Qu.:1710   Yes:43   Yes:37  
##  Median :18.00   Median :12.00   Median :1950                    
##  Mean   :18.55   Mean   :14.22   Mean   :2019                    
##  3rd Qu.:24.00   3rd Qu.:20.00   3rd Qu.:2290                    
##  Max.   :45.00   Max.   :45.00   Max.   :2820

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

t.test (mydata$Price,
        mu = 1900,
        alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

H0: Mu_Price = 1900 eur

H1: Mu_Price =/ 1900 eur

We reject H0 at p-value < 0.005.

This means there is statistically significant evidence to conclude that the average price of the apartments is not equal to 1900 EUR. The mean of the sample (2018.941 EUR) supports this finding.

Source(t.test): https://www.denis-statistika.si/_files/ugd/5f6427_497986749e164071812524365d91dd7c.pdf, p.7.

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (2 p)

fit1 <- lm(Price ~ Age, 
          data = mydata)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
cor_coef <- sqrt(summary(fit1)$r.squared)
cor_coef
## [1] 0.230255

Price per m2 = 2185.455 + (-8.975)*Age

Intercept (2185.455): This is the estimated price when the age of the apartment is zero. It provides a baseline price of 2185.455 EUR for new apartments (age = 0 years).

Age (-8.975): This coefficient shows the average change in the apartment price for each additional year of age. The negative value indicates that, on average, the price decreases by about 8.975 EUR per year as the apartment ages.

H0: β1 = 0 (Age has no effect on price per m2)

H1: β1 =/ 0 (Age Has effect on price per m2)

This effect is statistically significant, we reject H0 at p-value < 0.04.

Multiple R-squared (0.05302): This value tells us that about 5.302% of the variability in apartment prices can be explained by the linear effect of the age of the apartment. This is a relatively low R-squared value, suggesting that the model does not explain a large proportion of the variation in prices.

Correlation Coefficient (0.230255): This is the square root of R-squared. The positive sign here is contrary to the negative slope => correlation is negative and weak according to Pearson correlation coefficient interpretation.

5. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)
## Loading required package: carData
scatterplotMatrix(mydata[ ,c(-4, -5)], 
                  smooth = FALSE) 

Age and Distance: The scatterplot between “Age” and “Distance” don’t shows strong linear relationship.

Age and Price: The plot between “Age” and “Price” indicates a very mild downward trend. However, the relationship does not appear very strong.

Distance and Price: The relationship between “Distance” and “Price” shows a more pronounced negative trend, indicating that prices tend to decrease as the distance from the city center increases. there is potential problem with multicolinearity.

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, 
           data = mydata)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

7. Chech the multicolinearity with VIF statistics. Explain the findings. (1 p)

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

A VIF value close to 1 indicates that there is no multicollinearity among “Age” and “Distance”. VIF should be < 5.

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (2 p)

mydata$StdResid <- round(rstandard(fit2), 3) 
mydata$CooksD <- round(cooks.distance(fit2), 3) 

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

Values greater than 3 and less than -3 are considered potential outliers.There are no such values.

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

We have a gap between 0.15 and 0.30, so we should remove units which are located between 0.30 and 0.35.

head(mydata[order(-mydata$CooksD),], 5) 
##    Age Distance Price Parking Balcony StdResid CooksD
## 38   5       45  2180     Yes     Yes    2.577  0.320
## 55  43       37  1740      No      No    1.445  0.104
## 33   2       11  2790     Yes      No    2.051  0.069
## 53   7        2  1760      No     Yes   -2.152  0.066
## 22  37        3  2540     Yes     Yes    1.576  0.061

We should remove the apartment with the price 2180 per m2.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata <- mydata %>%
  filter(!Price == 2180)
fit2 <- lm(Price ~ Age + Distance, 
           data = mydata) #creating new fit with 84 obs.

9. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (1 p)

mydata$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

The graph looks homoscedastic.

#install.packages("olsrr")
library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    2.927455 
##  Prob > Chi2   =    0.08708469

Prob > Chi2 = 0.08708469 = p-value. We do not reject the H0. The graph is homoscedastic.

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (1 p)

hist(mydata$StdResid,
     main = "Distribution of standardized residuals",
     xlab = "Frequency",
     ylab = "Standardized residuals",
     breaks = seq(from = -5, to = 5, by = 1))

Visually standardized residuals are normally distributed.

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.94879, p-value = 0.002187

H0: Errors are normally distributed. H1: Errors are not normally distributed.

We should have to reject H0 at p-value < 0.003. But assuming, that we have 84 observation units and visually the errors on graph are normally distributed, I would assume that the standardized residuals are normally distributed.

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (2 p)

fit2 <- lm(Price ~ Age + Distance, 
           data = mydata)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -604.92 -229.63  -56.49  192.97  599.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2456.076     73.931  33.221  < 2e-16 ***
## Age           -6.464      3.159  -2.046    0.044 *  
## Distance     -22.955      2.786  -8.240 2.52e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.1 on 81 degrees of freedom
## Multiple R-squared:  0.4838, Adjusted R-squared:  0.4711 
## F-statistic: 37.96 on 2 and 81 DF,  p-value: 2.339e-12

Price per m2 = 2456.076 + (-6.464) x Age + (-22.955) x Distance

Intercept (2456.076): This is the estimated price when the age of the apartment is zero and the distance to the city center is also 0. It provides a baseline price of 2456.076 EUR for new apartments (age = 0 years, distance = 0 km).

Age (-6.464): The coefficient for Age is -6.464, meaning that for each additional year, the price of the apartment decreases by about 6.464 EUR, holding Distance constant. The p-value for Age is 0.044, suggesting that the effect of Age on Price is statistically significant.

H0: β1 = 0 (Age has no effect on price per m2)

H1: β1 =/ 0 (Age Has effect on price per m2)

This effect is statistically significant, we reject H0 at p-value < 0.05.

Distance (-22.955): The coefficient for Distance is -22.955, implying that for every additional kilometer from the center, the price decreases by 22.955 EUR, holding Age constant. This coefficient is also highly significant (p-value = 2.52e-12), showing a strong inverse relationship between Distance and Price.

H0: β2 = 0 (Distance has no effect on price per m2)

H1: β2 =/ 0 (Distance Has effect on price per m2)

This effect is statistically significant, we reject H0 at p-value < 0.001.

Multiple R-squared (0.4838): This value tells us that about 48.38% of the variability in apartment prices can be explained by the linear effect of the age of the apartment and the distance to the city center. This is a high R-squared value, suggesting that the model does explain a large proportion of the variation in prices.

Correlation Coefficient (0.230255): This is the square root of R-squared. The positive sign here is contrary to the negative slope => correlation is negative and weak according to Pearson correlation coefficient interpretation.

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, 
           data = mydata) 

13. With function anova check if model fit3 fits data better than model fit2. (1 p)

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     81 6176767                              
## 2     79 5654480  2    522287 3.6485 0.03051 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Δp^2 = 0 (both models are equally good) H1: Δp^2 > 0

We reject H0 at p-value < 0.04.

The p-value (0.03051) is significant at the 5% level, suggesting that adding Parking and Balcony to the model significantly improves the fit compared to just using Age and Distance.

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (2 p)

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.21 -192.37  -28.89  204.17  558.77 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2329.724     93.066  25.033  < 2e-16 ***
## Age           -5.821      3.074  -1.894  0.06190 .  
## Distance     -20.279      2.886  -7.026 6.66e-10 ***
## ParkingYes   167.531     62.864   2.665  0.00933 ** 
## BalconyYes   -15.207     59.201  -0.257  0.79795    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 267.5 on 79 degrees of freedom
## Multiple R-squared:  0.5275, Adjusted R-squared:  0.5035 
## F-statistic: 22.04 on 4 and 79 DF,  p-value: 3.018e-12

ParkingYes (167.531):This coefficient indicates that having parking is associated with an average price increase of about 167.531 EUR, holding all other variables constant.

BalconyYes (-15.207):This coefficient suggests that having a balcony is associated with a decrease in the apartment price by about 15.207 EUR, with all other factors held constant.

H0: None of the predictors (Age, Distance, ParkingYes, BalconyYes) have an effect on the dependent variable Price.

H1: At least one of the predictors has a non-zero effect on the dependent variable Price.

15. Save fitted values and calculate the residual for apartment ID2. (1 p)

fitted_values <- fitted(fit3)
residuals <- residuals(fit3)

fitted_value_ID2 <- fitted_values[2]
residual_ID2 <- residuals[2]
print(paste("Fitted Value for Apartment ID2:", fitted_value_ID2))
## [1] "Fitted Value for Apartment ID2: 2372.19707577646"
print(paste("Residual for Apartment ID2:", residual_ID2))
## [1] "Residual for Apartment ID2: 427.802924223543"

Source (print(paste)): https://www.geeksforgeeks.org/printing-output-of-an-r-program/