Task 3

Import the dataset Apartments.xlsx

library(readxl)

Apartments <- read_xlsx("Apartments.xlsx")

Apartments <- as.data.frame(Apartments)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

Apartments$Parking <- factor(Apartments$Parking,
                        levels = c("0", "1"),
                        labels = c("no", "yes"))

Apartments$Balcony <- factor(Apartments$Balcony,
                           levels = c("0", "1"),
                           labels = c("yes", "no"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

t.test(Apartments$Price, 
       mu = 1990,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 0.70618, df = 84, p-value = 0.482
## alternative hypothesis: true mean is not equal to 1990
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Because p value is larger than alpha=5%, we can conclude that we can not decline null hypothetis - we accept null hypothesis.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

scatterplot(Distance ~ Price,
            data = Apartments,
            main = "Scatterplot: Distance vs Price",
            xlab = "Price",
            ylab = "Distance",
            pch  = 19,
            col  = "blue",
            smooth=FALSE)

I don’t think there is a problem with multicolinearity, variables do not have a strong linear connection.

library(car)  

scatterplot(Age ~ Price,
            data = Apartments,
            main = "Scatterplot: Age vs Price",
            xlab = "Price",
            ylab = "Age",
            pch  = 19,
            col  = "blue",
            smooth=FALSE)

I don’t think there is a problem with multicolinearity, variables do not have a strong linear connection.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)

summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)

vif(fit2)

##      Age Distance 
## 1.001845 1.001845

VIF close to 1 means there is almost no variance because of other variables, so there is no multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

r_std <- rstandard(fit2)

cooks <- cooks.distance(fit2)

head(r_std)

##          1          2          3          4          5          6 
## -0.6653487  1.7832876 -0.5937629  0.7543794 -1.0733987 -0.7775190

which(abs(r_std) > 2)

## 33 38 53 
## 33 38 53

Higher than 2 are outliers.

which(abs(r_std) > 3)

## named integer(0)

Higher than 3 are strong outliers.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

r_std <- rstandard(fit2)         
std_fit <- scale(fitted(fit2))

plot(std_fit, r_std,
     main = "Standardized Residuals vs Standardized Fitted Values",
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     pch  = 19, col = "blue")
abline(h = 0, lty = 2, col = "red")

This is an example of homoskedaticity, heteroskedaticity is not present. Homoskedaticity means that outliers have similar variance. They are somewhat regularly distributed around the horizontal axis.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

hist(r_std,
     breaks = 20,
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals",
     col = "orange")

They are not distributed normally.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

str(Apartments)

## 'data.frame':    85 obs. of  5 variables:
##  $ Age     : num  7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num  28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num  1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ Balcony : Factor w/ 2 levels "yes","no": 2 1 1 2 2 2 2 2 1 1 ...

Apartments$Parking <- as.factor(Apartments$Parking)
Apartments$Balcony <- as.factor(Apartments$Balcony)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parkingyes   196.168     62.868   3.120  0.00251 ** 
## Balconyno      1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null hypothesis: parking and balcony do not have an affect and beta is the same in both cases. ALternative hypothesis: adding parking and balcony improves the fit to explain price.

P value is smaller than 0.05 meaning than we have to accept the alternative hypothesis. Meaning fit 3 is statistically better than fit2.

Including the variables Parking and Balcony significantly improves the fit of the regression model for predicting Price. Model 2 (fit3) provides a statistically better explanation of the variability in Price than Model 1 (fit2).

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2301.667     94.271  24.415  < 2e-16 ***
## Age           -6.799      3.110  -2.186  0.03172 *  
## Distance     -18.045      2.758  -6.543 5.28e-09 ***
## Parkingyes   196.168     62.868   3.120  0.00251 ** 
## Balconyno      1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

Explanation: if use ceteris paribus principle and say all other variables are zero then price would be 2301,7 approximately.

If all else stays the same then every year added to age increases the price approximately for 6.8 units.

Every added unit of distance decreases the price for 18 units if all the other variables stay the same.

Apartments with parking mean a 196 units increase in price if all else stays the same.

Balcony affect is not statistically significat with p value of 0.97436.

Save fitted values and claculate the residual for apartment ID2.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = Apartments)

fitted_vals <- fitted(fit3)
residual_vals <- resid(fit3)

fitted_ID2   <- fitted_vals[2]
residual_ID2 <- residual_vals[2]

fitted_ID2

##        2 
## 2357.411

residual_ID2

##        2 
## 442.5889