Task 3

Eseniya Ganina

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("~/Desktop/R Take Home Exam 2025/Task 3/Apartments.xlsx")

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

mydata <-force(Apartments)
mydata$Parking <- factor(mydata$Parking,
                         levels = c(0, 1),
                         labels = c("No", "Yes"))
mydata$Balcony <-factor(mydata$Balcony,
                        levels = c(0,1),
                        labels = c("No","Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

mean(mydata$Price)

## [1] 2018.941

sd(mydata$Price)

## [1] 377.8417

qt(p = 0.025, df = 84, lower.tail = FALSE)

## [1] 1.98861

qt(p = 0.025, df = 84, lower.tail = TRUE)

## [1] -1.98861

t.test(mydata$Price,
       mu = 1900,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydata$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Based on a sample data we can reject the H0 at p<0.005. The average price for the apartment has increased.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <-lm(Price~Age,
          data=mydata)
coef(fit1)

## (Intercept)         Age 
## 2185.454892   -8.975058

cor(mydata$Price, mydata$Age)

## [1] -0.230255

summary(fit1)$r.squared

## [1] 0.05301737

summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Regression coefficient: The intercept indicates that when Age is zero and all other factors remain unchanged, the average Price is about €2185.46. The coefficient for Age shows that for every additional year of Age, the Price decreases on average by €8.98, assuming everything else stays constant. Since the p-value is below 0.05, we can say with 95% confidence that these coefficients are statistically significant and not equal to zero, which means we reject the null hypothesis. Correlation coefficient: This value measures both the strength and the direction of the linear relationship between two variables. In this case, Age and Price have a weak negative relationship, as indicated by the correlation coefficient of –0.23. Coefficient of determination: The R-squared statistic shows that only 5.3% of the variation in Price can be explained by the Age variable.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

scatterplotMatrix(mydata[ , c("Price","Age","Distance")], 
                  smooth = FALSE)

No potential problem with multicolinearity

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <-lm(Price~Age+Distance,
          data=mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

VIF statistics are below 5, while the average is equal to 1, which means that in the model fit2 there is no problem with multicolinearity.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

mydata$StdResid <- round(rstandard(fit2), 3)
shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.95303, p-value = 0.003645

mydata$CooksD <- round(cooks.distance(fit2), 3)
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")

we will remove all observations that appear on the graph after this break, as they can be considered being influential points and contribute to the result

mydata$ID <- seq_len(nrow(mydata))
head(mydata[order(-mydata$CooksD),],)

## # A tibble: 6 × 8
##     Age Distance Price Parking Balcony StdResid CooksD    ID
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl> <int>
## 1     5       45  2180 Yes     Yes         2.58  0.32     38
## 2    43       37  1740 No      No          1.44  0.104    55
## 3     2       11  2790 Yes     No          2.05  0.069    33
## 4     7        2  1760 No      Yes        -2.15  0.066    53
## 5    37        3  2540 Yes     Yes         1.58  0.061    22
## 6    40        2  2400 No      Yes         1.09  0.038    39

head(mydata[order(mydata$StdResid),],)

## # A tibble: 6 × 8
##     Age Distance Price Parking Balcony StdResid CooksD    ID
##   <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl> <int>
## 1     7        2  1760 No      Yes        -2.15  0.066    53
## 2    12       14  1650 No      Yes        -1.50  0.013    13
## 3    12       14  1650 No      No         -1.50  0.013    72
## 4    13        8  1800 No      No         -1.38  0.012    20
## 5    14       16  1660 No      Yes        -1.26  0.008    35
## 6    24        5  1830 Yes     No         -1.19  0.012    36

By looking at the standardized residuals we can remove all the variables that have value close to -3 or 3 (these values often used to detect outliers)

Removing outliers

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

drop_ids <-c(38,55,33,53,22)
mydata <- mydata %>% filter (!ID %in% drop_ids)
print(mydata)

## # A tibble: 80 × 8
##      Age Distance Price Parking Balcony StdResid CooksD    ID
##    <dbl>    <dbl> <dbl> <fct>   <fct>      <dbl>  <dbl> <int>
##  1     7       28  1640 No      Yes       -0.665  0.007     1
##  2    18        1  2800 Yes     No         1.78   0.03      2
##  3     7       28  1660 No      No        -0.594  0.006     3
##  4    28       29  1850 No      Yes        0.754  0.008     4
##  5    18       18  1640 Yes     Yes       -1.07   0.005     5
##  6    28       12  1770 No      Yes       -0.778  0.005     6
##  7    14       20  1850 No      Yes       -0.302  0.001     7
##  8    18        6  1970 Yes     Yes       -0.787  0.004     8
##  9    22        7  2270 Yes     No         0.455  0.001     9
## 10    25        2  2570 Yes     No         1.24   0.017    10
## # ℹ 70 more rows

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance, data = mydata)
mydata$StdResid <- round(rstandard(fit2), 3)
mydata$StdFittedValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)

The points in the scatter plot of standardized residuals versus standardized fitted values should appear randomly scattered within a horizontal band, showing a consistent spread across all fitted values. Here variability doesn’t change so we can say that heteroskedastiity is not present.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

standardized_residuals <- rstandard(fit2)
shapiro.test(standardized_residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  standardized_residuals
## W = 0.94156, p-value = 0.001168

hist(rstandard(fit2))

From the graph we can see that the distribution is not normally distributed and its slightly skewed to the right. Below the distribution is formally tested with Shapiro-Wilk test.

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.94154, p-value = 0.001166

Ho- distribution is normal N1- distribution is not normal According to results from Shapiro-Wilk test, p-value < 0,05 which means that we can reject the null hypothesis. Data is not normally distributed.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm (Price~ Age +Distance,
            data=mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

The regression model shows that both Age and Distance have significant negative effects on Price. A newly built apartment located in the city center is expected to cost approximately €2,502.47 (Intercept) For each additional year of age, price decreases by about $8.67, and for each additional unit of distance, price decrease by $24.06. R²: 53.6% of variability of apartments is explained by linear effect of Age, Distance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price~ Age + Distance + Parking + Balcony,
           data = mydata)

With function anova check if model fit3 fits data better than model fit2.

anova (fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

The regression coefficient for paring tells us that if an apartment has a parking space and everything else remains unchanged, then the price is 128.7 euros higher on average. Same for balcony, if the appartment has a balcony price decrease by 6.03 euros on average.

F hypothesis: Ho: the population coefficient of determination=0 H1: the population coefficient of determintaion >0

Save fitted values and claculate the residual for apartment ID2.

Fitted_ID2   <- fitted(fit3)[mydata$ID == 2]
Residual_ID2 <- resid(fit3)[mydata$ID == 2]
round(c(Fitted = Fitted_ID2, Residual = Residual_ID2), 3)

##   Fitted.2 Residual.2 
##   2356.597    443.403

A positive residual means the apartment’s actual price is higher than the model predicts. The actual price for the apartment is by 443.403 euros higher that the estimated value from the regression