library(readxl)
apartments <- read_excel("Apartments.xlsx")
head(apartments)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
What is the relationship between the age of the apartment, and it’s price?
apartments$Balcony <- as.numeric(apartments$Balcony)
apartments$Parking <- as.numeric(apartments$Parking)
apartments$Balcony <- factor(apartments$Balcony,
levels=c(0,1),
labels = c("No", "Yes"))
apartments$Parking <- factor(apartments$Parking,
levels=c(0,1),
labels = c("No", "Yes"))
str(apartments)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
t_test <- t.test(apartments$Price, mu = 1900)
print(t_test)
##
## One Sample t-test
##
## data: apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Based on the one-sided t-test, the following can be concluded:
The test statistic is t = 2.9022 with 84 degrees of freedom, resulting in a p-value of 0.004731, which is less than the conventional significance level of 0.05.
The sample mean price is 2018.941 EUR, which is higher than the hypothesized value of 1900 EUR.
The 95% confidence interval for the true mean price is (1937.443, 2100.440) EUR, which does not include 1900 EUR.
This indicates that there is statistically significant evidence that the true mean apartment price in the dataset is different from 1900 EUR.
Specifically, the apartment prices are significantly higher than 1900 EUR on average.
The practical significance of this finding is that apartments in the dataset tend to cost about 119 EUR more on average than the hypothesized value of 1900 EUR.
fit1 <- lm(Price ~ Age, data = apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
correlation <- cor(apartments$Age, apartments$Price)
print(paste("Correlation coefficient:", correlation))
## [1] "Correlation coefficient: -0.230255017823335"
Regression Coefficient:
The coefficient for Age is -8.975, which is statistically significant (p = 0.034 < 0.05).
This means that for each additional year of apartment age, the price decreases by approximately 8.98 EUR on average– ie: older apartments tend to be less expensive, with price declining by about 9 EUR per year of age.
Coefficient of Correlation:
The correlation coefficient is -0.2303.
The negative sign confirms the inverse relationship between Age and Price - as apartments get older, their prices tend to decrease.
Coefficient of Determination:
The Multiple R-squared value is 0.05302, meaning that only around 5.3% of the variability in apartment prices can be explained by differences in age.
This is a low value, suggesting that while age has a statistically significant effect on price, it’s not a strong predictor on its own.
The adjusted R-squared (0.04161) is even lower, accounting for the model’s simplicity.
#install.packages("GGally")
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Create scatterplot matrix for Price, Age, and Distance
pairs(apartments[, c("Price", "Age", "Distance")],
main = "Scatterplot Matrix of Price, Age, and Distance")
ggpairs(apartments, columns = c("Price", "Age", "Distance"))
fit2 <- lm(Price ~ Age + Distance, data = apartments)
library(car)
## Loading required package: carData
vif(fit2)
## Age Distance
## 1.001845 1.001845
# Calculate standardized residuals
standardized_residuals <- rstandard(fit2)
# Calculate Cook's distances
cooks_distances <- cooks.distance(fit2)
# Cook's distance: Generally, values above 0.5 or 1 indicate high influence (threshold = 1)
# Standardized residuals: Values exceeding ±2 or ±3 are considered outliers (threshold = 2)
# Create a logical vector to identify problematic units
problematic_units <- abs(standardized_residuals) > 2 | cooks_distances > 1
# Remove problematic units from the dataset
cleaned_data <- apartments[!problematic_units, ]
# Fit the model again with the cleaned dataset
fit2_cleaned <- lm(Price ~ Age + Distance, data = cleaned_data)
# View the summary of the new regression model
summary(fit2_cleaned)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = cleaned_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -404.0 -230.9 -51.4 190.6 504.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2455.768 73.296 33.505 < 2e-16 ***
## Age -6.011 3.086 -1.948 0.055 .
## Distance -23.543 2.665 -8.834 2.05e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 262.6 on 79 degrees of freedom
## Multiple R-squared: 0.5179, Adjusted R-squared: 0.5057
## F-statistic: 42.44 on 2 and 79 DF, p-value: 3.042e-13
# Calculate standardized residuals
standardized_residuals <- rstandard(fit2)
# Calculate standardized fitted values
standardized_fitted_values <- rstudent(fit2)
# Create scatterplot
plot(standardized_fitted_values, standardized_residuals,
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
main = "Scatterplot to Check for Heteroskedasticity")
abline(h = 0, col = "plum", lty = 2) # Adds horizontal line at residual = 0
# Get standardized residuals
std_residuals <- rstandard(fit1)
# Visual check using histogram
hist(std_residuals,
breaks = 20,
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals",
freq = FALSE)
curve(dnorm(x), add = TRUE, col = "plum", lwd = 2)
# Q-Q plot
qqnorm(std_residuals)
qqline(std_residuals, col = "hotpink")
# Formal test for normality using Shapiro-Wilk test
shapiro_test <- shapiro.test(std_residuals)
print(shapiro_test)
##
## Shapiro-Wilk normality test
##
## data: std_residuals
## W = 0.94844, p-value = 0.001935
Visual Results:
Histogram of Standardized Residuals:
The histogram shows some departure from normality
Some irregularities in the distribution with multiple peaks
Q-Q Plot:
Points should follow the diagonal red line if perfectly normally distributed
There are noticeable deviations at both tails (below -1.5 and above +1.5)
The pattern shows a slight S-shape, with points deviating from the line at both extremes
This suggests heavier tails than would be expected in a normal distribution
Formal Test (Shapiro-Wilk):
W = 0.94844, p-value = 0.001935
Since the p-value (0.001935) is less than 0.05, we reject the null hypothesis of normality
This confirms that the residuals are not normally distributed
# Re-estimate the model with the cleaned dataset
fit2_cleaned <- lm(Price ~ Age, data = cleaned_data)
# Show summary of the new model
summary(fit2_cleaned)
##
## Call:
## lm(formula = Price ~ Age, data = cleaned_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -619.26 -275.55 -66.69 244.81 780.74
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2164.000 91.680 23.604 <2e-16 ***
## Age -8.041 4.312 -1.865 0.0659 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 367.9 on 80 degrees of freedom
## Multiple R-squared: 0.04166, Adjusted R-squared: 0.02968
## F-statistic: 3.478 on 1 and 80 DF, p-value: 0.06586
Intercept (2164.000):
This is the estimated value of the apartment’s price when the age of the apartment is zero (Age = 0).
The t value
(23.604)
and Pr(>|t|)
(< 2e-16) indicate that this intercept
is statistically significant (p < 0.001). This means that the
intercept is reliably different from zero.
Age (-8.041):
This is the estimated change in the apartment’s price for each one-year increase in the apartment’s age.
The value is -8.041 EUR.
The t value
(-1.865)
and Pr(>|t|)
(0.0659) indicate that, at the 0.05
significance level, this is not statistically significant. However, it
is significant at the 0.1 level. It is showing a trend, but with less
certainty.
Residual standard error (367.9):
This value represents the average amount that the observed apartment prices deviate from the predicted prices of the model. The closer to 0, the more accurate a model’s prediction is.
Multiple R-squared (0.04166):
This value indicates the proportion of the variance in apartment prices that is explained by the age of the apartment.
Only about 4.17% of the variability in price is explained by age.
The adjusted R-squared, which adjusts for the number of explanatory variables in the model, is even lower at 0.02868.
F-statistic (3.478) and p-value (0.06586):
The F-statistic tests the overall significance of the regression model.
The associated p-value (0.06586) indicates that, at the 0.05 level, the overall model is not statistically significant.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartments)
anova_result <- anova(fit2, fit3)
print(anova_result)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual Degrees of Freedom:
Model 1 has 82 degrees of freedom, and Model 2 has 80.
Residual Sum of Squares (RSS):
F-statistic:
P-value:
The p-value (0.01007) is less than 0.05, which means that the improvement in the model’s fit when adding “Parking” and “Balcony” is statistically significant.
Therefore, Model 2 (Price ~ Age + Distance + Parking + Balcony) fits the data significantly better than Model 1 (Price ~ Age + Distance).
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
Explanation of Coefficients:
Intercept (2636.565):
Age (-5.719):
Distance (-23.755):
ParkingYes (218.423):
BalconyYes (162.138):
Hypothesis Tested by the F-statistic:
The F-statistic at the bottom of the output (23.49) tests the overall significance of the regression model.
Null Hypothesis (H0):
Alternative Hypothesis (H1):
P-value (4.882e-14):
The p-value associated with the F-statistic is extremely small, much less than 0.05.
This indicates strong evidence against the null hypothesis.
Therefore, we reject the null hypothesis and conclude that at least one of the predictors is significantly related to Price.
In short, the F test tells us if the model, as a whole, is statistically significant
fitted_values <- fitted(fit3)
residual_id2 <- apartments$Price[2] - fitted_values[2]
print(paste("Residual for Apartment ID 2:", residual_id2))
## [1] "Residual for Apartment ID 2: 442.58892757391"