Homework assignment 3 at the course Applied Data Analysis in Business with R

Author: Hui-Ju Huang

1. Import the dataset Apartments.xlsx

# install.packages("readxl")

library(readxl)

mydata <- read_xlsx("./Apartments.xlsx")
head(mydata)

## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

2. Change categorical variables into factors. (0.5 p)

mydata$ParkingF <- factor(mydata$Parking, 
                         levels = c(0, 1), 
                         labels = c("No", "Yes"))

mydata$BalconyF <- factor(mydata$Balcony, 
                         levels = c(0, 1), 
                         labels = c("No", "Yes"))
head(mydata)

## # A tibble: 6 × 7
##     Age Distance Price Parking Balcony ParkingF BalconyF
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>   
## 1     7       28  1640       0       1 No       Yes     
## 2    18        1  2800       1       0 Yes      No      
## 3     7       28  1660       0       0 No       No      
## 4    28       29  1850       0       1 No       Yes     
## 5    18       18  1640       1       1 Yes      Yes     
## 6    28       12  1770       0       1 No       Yes

library(pastecs)
round(stat.desc(mydata[c("Age", "Distance", "Price")]), 2)

##                  Age Distance     Price
## nbr.val        85.00    85.00     85.00
## nbr.null        0.00     0.00      0.00
## nbr.na          0.00     0.00      0.00
## min             1.00     1.00   1400.00
## max            45.00    45.00   2820.00
## range          44.00    44.00   1420.00
## sum          1577.00  1209.00 171610.00
## median         18.00    12.00   1950.00
## mean           18.55    14.22   2018.94
## SE.mean         1.05     1.23     40.98
## CI.mean.0.95    2.09     2.45     81.50
## var            93.96   129.44 142764.34
## std.dev         9.69    11.38    377.84
## coef.var        0.52     0.80      0.19

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

shapiro.test(mydata$Price)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$Price
## W = 0.94017, p-value = 0.0006513

The Shapiro-Wilk normality test indicates that the null hypothesis (variable is normally distributed) should be rejected at a p-value of 0.0007, which means that the normality assumption is not met.Therefore, the appropriate non-parametric alternative test must be performed. This test is the Wilcoxon signed rank test.
The hypotheses can be formulated as follows:
- H0: Mu_Price = 1900 EUR
- H1: Mu_Price ≠ 1900 EUR

wilcox.test(mydata$Price,
            mu = 1900,
            correct = FALSE)

## 
##  Wilcoxon signed rank test
## 
## data:  mydata$Price
## V = 2328, p-value = 0.02828
## alternative hypothesis: true location is not equal to 1900

The p-value, which is 0.028, supports the rejection of the null hypothesis, which means the median price is different from 1900 EUR.

library(effectsize)
effectsize(wilcox.test(mydata$Price, 
                       mu = 1900, 
                       correct = FALSE))

## r (rank biserial) |       95% CI
## --------------------------------
## 0.27              | [0.04, 0.48]
## 
## - Deviation from a difference of 1900.

interpret_rank_biserial(0.27, rules="funder2019")

## [1] "medium"
## (Rules: funder2019)

Conclusions:

Based on the sample data, we found that the median price of apartments is different from 1900 EUR (p < 0.028, 𝑟 = 0.27 – medium effect size). Therefore, there is evidence to suggest that the median price of apartments in the dataset is not 1900 EUR.

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (2 p)

fit1 <- lm(Price ~ Age, data = mydata)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

cor_coef <- sqrt(summary(fit1)$r.squared)
cor_coef

## [1] 0.230255

Regression Coefficient (Estimate):
- The regression coefficient for the Age variable is -8.975.
- This indicates that for each additional year increase in Age, the Price of the apartment is estimated to decrease by approximately 8.975 EUR, holding all other variables constant.
- The coefficient is statistically significant at the 0.05 level (p-value = 0.034), suggesting that Age has a significant effect on Price.
Coefficient of Correlation (R):
- The R value is 0.230255, which indicates a weak positive linear relationship between Age and Price.
- This suggests that while there is some association between Age and Price, other factors not included in the model may also influence the relationship.
Coefficient of Determination (R-squared):
- The multiple R-squared value is 0.05302.
- This value represents the proportion of variance in the Price variable that is explained by the Age variable in the model.
- In other words, approximately 5.302% of the variability in Price can be accounted for by Age.

5. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

library(car)

## Loading required package: carData

scatterplotMatrix(mydata[c("Price", "Age", "Distance")],
                  smooth = FALSE)

There is a linear negative relationship between Price and Age and between Price and Distance.

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = mydata)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

7. Chech the multicolinearity with VIF statistics. Explain the findings. (1 p)

library(car)
vif(fit2) #Checking multicolinearity

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

All VIF statistics are below 5, while the average VIF statistic is 1.002, which is close to 1.

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (2 p)

mydata$StdResid <- round(rstandard(fit2), 3)

hist(mydata$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

Since all the values are between -3 and +3, the thresholds commonly used to identify outliers, we have no problem with outliers.

mydata$CooksD <- round(cooks.distance(fit2), 3)

hist(mydata$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

We can see that at least 1 unit has a value that is to some extent greater than the values of other units. This can be seen by a gap between 0.15 and 0.30 in the Cook’s distance histogram.

head(mydata[order(-mydata$CooksD), c("CooksD")], 10)

## # A tibble: 10 × 1
##    CooksD
##     <dbl>
##  1  0.32 
##  2  0.104
##  3  0.069
##  4  0.066
##  5  0.061
##  6  0.038
##  7  0.037
##  8  0.034
##  9  0.032
## 10  0.03

We print the first 10 units with the highest Cook’s distances, we can see that the unit with the greatest impact is 0.320. In addition, the CooksD above 0.040 have a slightly higher value, so we decide to remove them as well.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  filter(!CooksD %in% c(0.320, 0.104, 0.069, 0.066, 0.061))

9. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (1 p)

fit2.1 <- lm(Price ~ Age + Distance, data = mydata)

mydata$StdResid <- round(rstandard(fit2.1), 3)
mydata$StdFittedValues <- scale(fit2.1$fitted.values)

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

The points in the scatterplot of the standardized residuals against the standardized fitted values must be randomly distributed in a horizontal band of constant variability, with no curvature (which would indicate the problem of nonlinearity).

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (1 p)

hist(mydata$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals")

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.94154, p-value = 0.001166

When formally test the distribution with the Shapiro-Wilk normality test, the p-value is lower than 0.05. Thus, the null hypothesis that the standardized residuals are normally distributed is rejected, which means that the errors in the population are not most likely normally distributed.

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (2 p)

summary(fit2.1)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

The coefficient of determination is 0.5361. This coefficient indicates the proportion of the total variability of the dependent variable that can be explained by the linear effect of all explanatory variables: 53.61% of the variability of the price is explained by the linear effect of age and distance.
For each additional year in the age of an apartment, the price decreases on average by 8.674 EUR (p <0.001), assuming that the other explanatory variables remain unchanged.
For each additional kilometer from the city center, the price decreases on average by 24.063 EUR (p <0.001), assuming that the other explanatory variables remain unchanged.

sqrt(summary(fit2.1)$r.squared)

## [1] 0.732187

A multiple correlation coefficient is obtained by calculating the square root of the multiple coefficient of determination: The linear correlation between price, age and distance of the apartment is strong.

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)

13. With function anova check if model fit3 fits data better than model fit2. (1 p)

anova(fit2.1, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

The results show that the fit2 model fits the data better (p = 0.1135).
- With a p-value of 0.1135, fit3 does not provide a significant improvement in fit compared to fit2 at the conventional significance level of 0.05.
- This suggests that including ParkingF and BalconyF in the model may not significantly improve the model’s ability to explain Price beyond Age and Distance alone.

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (2 p)

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFYes  128.700     60.801   2.117   0.0376 *  
## BalconyFYes    6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

We can see that the multiple coefficient of determination in the fit3 model has increased (from 0.5361 to 0.5623).
For each additional year in the age of an apartment, the price decreases on average by 7.970 EUR (p-value = 0.0147), assuming that the other explanatory variables remain unchanged.
For each additional kilometer from the city center, the price decreases on average by 21.961 EUR (p <0.001), assuming that the other explanatory variables remain unchanged.
Apartments with parking have on average a higher Price of 128.700 EUR compared to apartments without parking (p = 0.038).
We found that the presence or absence of a balcony does not have a statistically significant effect on Price, holding all other variables constant.
Hypothesis tested with F-statistics:
- H0: all coefficients (Age, Distance, ParkingFYes, BalconyFYes) in the model are equal to zero, meaning that the independent variables do not have any effect on the dependent variable.
- H1: at least one coefficient is not equal to zero, indicating that at least one independent variable has a significant effect on the dependent variable.

15. Save fitted values and calculate the residual for apartment ID2. (1 p)

mydata$StdFittedValues3 <- scale(fit3$fitted.values)

index_ID2 <- which(mydata$Age == 18 & mydata$Distance == 1 & mydata$Price == 2800 & mydata$Parking == 1 & mydata$Balcony == 0)

# Extract the data for apartment ID2
data_ID2 <- mydata[index_ID2, ]
fitted_value_ID2 <- predict(fit3, newdata = data_ID2)
residual_ID2 <- data_ID2$Price - fitted_value_ID2

print(paste("Fitted value for apartment ID2:", fitted_value_ID2))

## [1] "Fitted value for apartment ID2: 2356.59743503779"

print(paste("Residual for apartment ID2:", residual_ID2))

## [1] "Residual for apartment ID2: 443.402564962207"

Homework assignment 3 at the course Applied Data Analysis in Business with R

2024-05-12

1. Import the dataset Apartments.xlsx

2. Change categorical variables into factors. (0.5 p)

3. Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude? (1 p)

4. Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination. (2 p)

5. Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity. (0.5 p)

6. Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

7. Chech the multicolinearity with VIF statistics. Explain the findings. (1 p)

8. Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence). (2 p)

9. Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings. (1 p)

10. Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings. (1 p)

11. Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients. (2 p)

12. Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

13. With function anova check if model fit3 fits data better than model fit2. (1 p)

14. Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output? (2 p)

15. Save fitted values and calculate the residual for apartment ID2. (1 p)