mydata <- read.table("./Cost_of_Living_Index_Europe.csv",
header = TRUE,
sep = ",",
dec = ".",
check.names = FALSE)
mydata <- mydata[ , c("Rank",
"Country",
"Cost of Living Index",
"Rent Index",
"Groceries Index",
"Local Purchasing Power Index")]
knitr::kable(head(mydata, 10), caption = "Cost of living in Europe")
| Rank | Country | Cost of Living Index | Rent Index | Groceries Index | Local Purchasing Power Index |
|---|---|---|---|---|---|
| 2 | Albania | 35.50 | 8.47 | 29.32 | 30.19 |
| 5 | Armenia | 33.89 | 11.61 | 27.59 | 28.86 |
| 7 | Austria | 71.04 | 27.13 | 65.88 | 77.25 |
| 13 | Belarus | 30.89 | 9.81 | 27.24 | 31.78 |
| 14 | Belgium | 72.61 | 25.79 | 63.32 | 79.72 |
| 18 | Bosnia And Herzegovina | 36.12 | 6.82 | 31.14 | 44.10 |
| 21 | Bulgaria | 38.38 | 9.07 | 34.02 | 45.96 |
| 29 | Croatia | 48.94 | 13.38 | 41.98 | 47.55 |
| 31 | Cyprus | 59.03 | 22.36 | 49.20 | 57.31 |
| 32 | Czech Republic | 48.24 | 18.40 | 42.76 | 66.47 |
For my analysis, I chose a dataset containing information about different indexes related to the cost of living. The original dataset consisted of all the countries in the world, but I selected only the European countries for my analysis. The original dataset included six indexes, and I focused on four of them.
The ranking may appear unusual because it is still calculated in relation to all the countries in the world.
The dataset consists of 44 units (countries) and 6 variables, of which 1 is categorical (Country) and 5 are numerical (Rank, Cost of Living Index, Rent Index, Groceries Index and Local Purchasing Power Index).
#install.packages("tidyr")
library("tidyr")
mydata <- drop_na(mydata)
I wrote this to check if there were any rows with missing values. As it turns out, there are none
mydata$`Ratio COL/Rent` <- round(
mydata$`Cost of Living Index` / mydata$`Rent Index`, 2)
#mydata <- subset(mydata, select = -c(`COL_to_rent`))
I introduced a new variable, the ratio, which compares the overall Cost of Living Index to the Rent Index. The ratio is rounded to two decimal points.
The higher the value of the ratio, the cheaper the rent is compared to the overall cost of living (rent takes up a smaller share of expenses). If the ratio has a lower value, the rent is relatively expensive compared to the overall cost of living (housing is a heavier burden).
mydata$Country <- as.character(mydata$Country)
mydata$Country[mydata$Country == "Bosnia And Herzegovina"] <- "BIH"
mydata <- mydata[order(-mydata$`Ratio COL/Rent`), ]
mydata$Country <- factor(mydata$Country, levels = mydata$Country)
With this chunk I changed Bosnia and Herzegovina to BIH, to fit better in the graph.
#install.packages("ggplot2")
library(ggplot2)
ggplot(mydata, aes(x = Country, y = `Ratio COL/Rent`)) +
geom_col(fill = "pink") +
xlab("Country") +
ylab("Cost of Living / Rent (ratio)") +
ggtitle("Cost of Living to Rent Ratio — Europe") +
theme_classic() +
theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 7))
This graph shows the ranking of European countries based on the ratio, calculated as the Cost of Living Index divided by the Rent Index, ordered from the highest to the lowest ratio.
The country with the highest ratio is BIH (Bosnia and Herzegovina), meaning that rent makes up only a small share of the overall cost of living compared to other countries. At the other end, Luxembourg has the lowest ratio, where rent represents a larger proportion of living costs, making housing comparatively more expensive.
summary(mydata[ , c("Cost of Living Index", "Rent Index", "Groceries Index",
"Local Purchasing Power Index", "Ratio COL/Rent")])
## Cost of Living Index Rent Index Groceries Index Local Purchasing Power Index Ratio COL/Rent
## Min. : 27.05 Min. : 6.06 Min. : 22.64 Min. : 28.86 Min. :1.340
## 1st Qu.: 35.97 1st Qu.:10.47 1st Qu.: 30.82 1st Qu.: 39.26 1st Qu.:2.600
## Median : 48.70 Median :16.24 Median : 42.13 Median : 56.01 Median :2.915
## Mean : 54.69 Mean :20.31 Mean : 47.83 Mean : 60.38 Mean :3.078
## 3rd Qu.: 71.22 3rd Qu.:26.25 3rd Qu.: 62.41 3rd Qu.: 80.36 3rd Qu.:3.425
## Max. :123.35 Max. :60.09 Max. :128.13 Max. :118.44 Max. :5.300
library(knitr)
num_vars <- c("Cost of Living Index",
"Rent Index",
"Groceries Index",
"Local Purchasing Power Index")
stats_all <- sapply(mydata[, num_vars], function(x) {
c(Mean = round(mean(x, na.rm = TRUE), 2),
Median = round(median(x, na.rm = TRUE), 2),
SD = round(sd(x, na.rm = TRUE), 2))})
stats_all <- t(stats_all)
stats_all <- as.data.frame(stats_all)
kable(stats_all, caption = "Descriptive Statistics for Selected Variables")
| Mean | Median | SD | |
|---|---|---|---|
| Cost of Living Index | 54.69 | 48.70 | 22.12 |
| Rent Index | 20.31 | 16.24 | 12.69 |
| Groceries Index | 47.83 | 42.13 | 22.29 |
| Local Purchasing Power Index | 60.38 | 56.01 | 24.06 |
The Cost of Living Index has a mean of 54.69 and a median of 48.70, suggesting a mild right-skew in the distribution.
The Rent Index has the lowest average (20.31) and also the lowest variability (SD = 12.69), which means that rent prices are relatively more stable across countries compared to other indices.
The Groceries Index has an average of 47.83, close to the Cost of Living Index, with a similar spread (SD = 22.29).
The Local Purchasing Power Index has the highest average (60.38) and a large spread (SD = 24.06), indicating that purchasing power varies greatly among countries.
library(ggplot2)
ggplot(mydata, aes(x = `Local Purchasing Power Index`, y = `Cost of Living Index`)) +
geom_point(shape = 21, color = "darkolivegreen3", fill = NA, size = 2, stroke = 1) +
scale_x_continuous(breaks = seq(0, max(mydata$`Local Purchasing Power Index`), by = 15),
position = "bottom") +
geom_smooth(method = "lm", se = FALSE, color = "violetred3", linetype = "dashed") +
xlab("Local Purchasing Power Index") +
ylab("Cost of Living Index") +
ggtitle("Link Between Purchasing Power and Cost of Living") +
theme_bw()
The scatterplot shows the relationship between the Local Purchasing Power Index (x-axis) and the Cost of Living Index (y-axis), where each point represents one country.
The points form an upward trend, suggesting that higher purchasing power is associated with higher costs of living. Although people earn more, goods and services are also more expensive.
If the points were to form a downward trend, it would suggest that countries with higher local purchasing power tend to have a lower cost of living, meaning people in those countries can afford more with their income.
From this graph, we can conclude that purchasing power and the cost of living are positively related.
library(ggplot2)
mydata$LPP_group <- cut(
mydata$`Local Purchasing Power Index`,
breaks = c(-Inf, 40, 70, Inf),
labels = c("Low", "Medium", "High"))
ggplot(mydata, aes(x = LPP_group, y = `Rent Index`)) +
geom_boxplot(fill = "steelblue4") +
xlab("LPP group") +
ylab("Rent Index") +
scale_y_continuous(breaks = seq(0, max(mydata$`Rent Index`, na.rm = TRUE), by = 10)) +
theme_light()
I then created a new categorical variable, LPP group, which divides countries into three groups based on their Local Purchasing Power Index: Low (values up to 40), Medium (values between 40 and 70), and High (values above 70).
In the next step I plotted a boxplot graph of Rent index for each of these groups.
For each group (Low, Medium, High), the box shows the middle 50% of the Rent Index values. The line inside the box represents the median Rent Index for that group, while the “whiskers” show the range of most Rent Index values. Any dots outside the “whiskers” are outliers, meaning that rents in those countries are very different from the rest of their group.
I wanted to see if there is any correlation between the purchasing power in countries and the rent. The higher the median Rent index is, the higher is the purchasing power in the country as well. That means that countries with higher purchasing power tend to have generally higher rents.
On the other hand the Low LPP group has also a lower median Rent index, which means that rents are cheaper where people have less purchasing power.
Because the spread of each box (its hight) tells us variability of rents within that group, we can conclude that in countries with higher purchasing power (that also have higher rents), rents vary more than in those with lower rents.
Even though there are some outliers which highlight “exceptional” countries, they still follow the trend of the LPP groups.
library(readxl)
library(knitr)
mydata <- read_xlsx("~/Documents/Magisterij /IMB /Bootcamp /Statistika/Take_home_exam/R Take Home Exam 2025-2/Task 2/Business School.xlsx")
mydata <- mydata[, c("Undergrad Degree", "Annual Salary", "MBA Grade")]
kable(head(mydata, 10), caption = "Business School data (first 10 rows)")
| Undergrad Degree | Annual Salary | MBA Grade |
|---|---|---|
| Business | 111000 | 90.2 |
| Computer Science | 107000 | 68.7 |
| Finance | 109000 | 83.3 |
| Business | 148000 | 88.7 |
| Finance | 255500 | 75.4 |
| Computer Science | 103500 | 82.1 |
| Engineering | 114500 | 66.9 |
| Engineering | 124000 | 76.8 |
| Business | 132500 | 72.3 |
| Finance | 99000 | 72.4 |
I imported the data set and selected the three variables relevant for this task: Undergrad Degree, Annual Salary, and MBA Grade.
The table above displays the first 10 rows of the data set to provide an overview of the data that will be analyzed later.
The Undergrad Degree column identifies the students’ field of undergraduate study.
The Annual Salary column reports their annual income (in USD).
The MBA Grade column shows their performance in the MBA program.
library(ggplot2)
ggplot(mydata, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "magenta4", color = "black", width = 0.6) +
xlab("Undergraduate Degree") +
ylab("Count of Students") +
ggtitle("Distribution of Undergraduate Degrees (MBA Cohort)") +
scale_y_continuous(breaks = seq(0, 35, 5)) +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar chart shows the distribution of undergraduate degrees among MBA students. Business is the most common background, followed by Computer Science and Finance, while Art and Engineering are less represented.
library(knitr)
deg_counts <- sort(table(mydata$`Undergrad Degree`), decreasing = TRUE)
kable(
as.data.frame(deg_counts),
col.names = c("Undergraduate Degree", "Number of Students"),
caption = "Counts of Undergraduate Degrees (sorted)")
| Undergraduate Degree | Number of Students |
|---|---|
| Business | 35 |
| Computer Science | 25 |
| Finance | 25 |
| Engineering | 9 |
| Art | 6 |
The table shows the distribution of undergraduate degrees among the MBA students. It confirms what the graph has shown us. The most common background is Business (35 students), followed by Computer Science and Finance (25 each). Engineering (9) and Art (6) are less frequent, this indicates that students who pursue the MBA degree have most commonly a business background.
salary <- mydata$`Annual Salary`
mean_sal <- round(mean(salary), 2)
median_sal <- round(median(salary), 2)
sd_sal <- round(sd(salary), 2)
min_sal <- round(min(salary), 2)
max_sal <- round(max(salary), 2)
q1_sal <- round(quantile(salary, 0.25), 2)
q3_sal <- round(quantile(salary, 0.75), 2)
kable(data.frame(
Statistic = c("Mean","Median","SD","Min","Q1","Q3","Max"),
Value = c(mean_sal, median_sal, sd_sal, min_sal, q1_sal, q3_sal, max_sal)),
caption = "Annual Salary — Descriptive Statistics")
| Statistic | Value |
|---|---|
| Mean | 109058.00 |
| Median | 103500.00 |
| SD | 41501.49 |
| Min | 20000.00 |
| Q1 | 87125.00 |
| Q3 | 124000.00 |
| Max | 340000.00 |
The average annual salary in the dataset is 109,058, with a median of 103,500, suggesting that salaries are slightly right-skewed (since the mean is higher than the median).
The salaries range from a minimum of 20,000 to a maximum of 340,000, indicating a wide variation.
The interquartile range (from Q1 = 87,125 to Q3 = 124,000) shows that the middle 50% of students earn within this range, while the relatively large standard deviation (41,501) confirms that there is large variability in salaries among students.
ggplot(mydata, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = diff(range(salary))/12, fill = "cornflowerblue", color = "navyblue") +
xlab("Annual Salary") + ylab("Frequency") +
ggtitle("Distribution of Annual Salary") +
theme_minimal()
The histogram shows the distribution of annual salaries among MBA students. Most students earn between 80,000 and 130,000, with a peak around 100,000, while a few students earn much higher salaries, creating a right-skewed distribution.
This suggests that although the majority of salaries are clustered in the lower range, there are outliers with significantly higher incomes.
On the x-axis, salaries are displayed in scientific notation, where 1e+05 represents 1×10*5=100,000, 2e+05 is 200,000, and so on.
library(broom)
library(knitr)
library(dplyr)
tt <- t.test(mydata$`MBA Grade`, mu = 74, alternative = "two.sided")
tt_tidy <- tidy(tt) %>%
rename(
Estimate = estimate,
Statistic = statistic,
`P-value` = p.value,
`Degrees of Freedom` = parameter,
`CI Lower` = conf.low,
`CI Upper` = conf.high,
Method = method,
Alternative = alternative)
tt_tidy$Alternative <- "Two-sided"
kable(tt_tidy, caption = "T-test results for MBA Grade (H₀: μ = 74)")
| Estimate | Statistic | P-value | Degrees of Freedom | CI Lower | CI Upper | Method | Alternative |
|---|---|---|---|---|---|---|---|
| 76.04055 | 2.658658 | 0.0091499 | 99 | 74.51764 | 77.56346 | One Sample t-test | Two-sided |
The one-sample t-test shows that the average MBA grade in the sample (76.04) is significantly higher than the hypothesized one.
The 95% confidence interval for the mean ranges from 74.52 to 77.56, which does not include 74, supporting the conclusion that the true mean differs from the hypothesized value.
grade <- mydata$`MBA Grade`
mean_grade <- round(mean(grade, na.rm = TRUE), 3)
sd_grade <- round(sd(grade, na.rm = TRUE), 3)
t_stat <- round(unname(tt$statistic), 3)
df_val <- unname(tt$parameter)
p_val <- round(tt$p.value, 3)
ci_low <- round(tt$conf.int[1], 3)
ci_high <- round(tt$conf.int[2], 3)
cohen_d <- round((mean(grade, na.rm = TRUE) - 74) / sd(grade, na.rm = TRUE), 3)
summary_table <- data.frame(
Statistic = c("Mean (MBA Grade)", "SD (MBA Grade)", "t statistic", "df",
"p-value", "95% CI Lower", "95% CI Upper", "Cohen's d"),
Value = c(mean_grade, sd_grade, t_stat, df_val, p_val, ci_low, ci_high, cohen_d)
)
kable(summary_table,
caption = "Summary statistics and effect size for MBA Grade (vs μ = 74)")
| Statistic | Value |
|---|---|
| Mean (MBA Grade) | 76.041 |
| SD (MBA Grade) | 7.675 |
| t statistic | 2.659 |
| df | 99.000 |
| p-value | 0.009 |
| 95% CI Lower | 74.518 |
| 95% CI Upper | 77.563 |
| Cohen’s d | 0.266 |
The average MBA grade in the sample is 76.041 (SD = 7.675), which is higher than the hypothesized mean of 74. The t-test result (t(99)=2.659, p=0.009) indicates that this difference is statistically significant at the 1% level.
The 95% confidence interval [74.518, 77.563] does not include 74, supporting the conclusion.
However, the effect size (Cohen’s d = 0.266) suggests that the difference is small in practical terms.
library(readxl)
library(knitr)
library(kableExtra)
apartments <- read_excel("~/Documents/Magisterij /IMB /Bootcamp /Statistika/Take_home_exam/R Take Home Exam 2025-2/Task 3/Apartments.xlsx")
kable(head(apartments, 10), caption = "Overview of the Apartments Dataset") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| Age | Distance | Price | Parking | Balcony |
|---|---|---|---|---|
| 7 | 28 | 1640 | 0 | 1 |
| 18 | 1 | 2800 | 1 | 0 |
| 7 | 28 | 1660 | 0 | 0 |
| 28 | 29 | 1850 | 0 | 1 |
| 18 | 18 | 1640 | 1 | 1 |
| 28 | 12 | 1770 | 0 | 1 |
| 14 | 20 | 1850 | 0 | 1 |
| 18 | 6 | 1970 | 1 | 1 |
| 22 | 7 | 2270 | 1 | 0 |
| 25 | 2 | 2570 | 1 | 0 |
Description:
library(knitr)
apartments$Parking <- factor(apartments$Parking,
levels = c("0", "1"),
labels = c("No", "Yes"))
apartments$Balcony <- factor(apartments$Balcony,
levels = c("0", "1"),
labels = c("No", "Yes"))
kable(head(apartments, 10), caption = "Apartments Dataset with Factors")
| Age | Distance | Price | Parking | Balcony |
|---|---|---|---|---|
| 7 | 28 | 1640 | No | Yes |
| 18 | 1 | 2800 | Yes | No |
| 7 | 28 | 1660 | No | No |
| 28 | 29 | 1850 | No | Yes |
| 18 | 18 | 1640 | Yes | Yes |
| 28 | 12 | 1770 | No | Yes |
| 14 | 20 | 1850 | No | Yes |
| 18 | 6 | 1970 | Yes | Yes |
| 22 | 7 | 2270 | Yes | No |
| 25 | 2 | 2570 | Yes | No |
library(broom)
library(knitr)
library(dplyr)
price <- apartments$Price
tt_price <- t.test(price, mu = 1900, alternative = "two.sided")
mean_price <- round(mean(price, na.rm = TRUE), 3)
sd_price <- round(sd(price, na.rm = TRUE), 3)
t_stat <- round(unname(tt_price$statistic), 3)
df_val <- unname(tt_price$parameter)
p_val <- round(tt_price$p.value, 3)
ci_low <- round(tt_price$conf.int[1], 3)
ci_high <- round(tt_price$conf.int[2], 3)
cohen_d <- round((mean(price, na.rm = TRUE) - 1900) / sd(price, na.rm = TRUE), 3)
summary_table <- data.frame(
Statistic = c("Mean (Price)", "SD (Price)", "t statistic", "df", "p-value",
"95% CI Lower", "95% CI Upper", "Cohen's d"),
Value = c(mean_price, sd_price, t_stat, df_val, p_val, ci_low, ci_high, cohen_d))
tt_price_tidy <- tidy(tt_price) %>%
rename(
Estimate = estimate,
Statistic = statistic,
`P-value` = p.value,
`Degrees of Freedom` = parameter,
`CI Lower` = conf.low,
`CI Upper` = conf.high,
Method = method,
Alternative = alternative)
tt_price_tidy$Alternative <- "Two-sided"
kable(tt_price_tidy, caption = "T-test results for Apartment Price (H₀: μ = 1900 EUR)")
| Estimate | Statistic | P-value | Degrees of Freedom | CI Lower | CI Upper | Method | Alternative |
|---|---|---|---|---|---|---|---|
| 2018.941 | 2.90223 | 0.0047306 | 84 | 1937.443 | 2100.44 | One Sample t-test | Two-sided |
kable(summary_table, caption = "Summary statistics and effect size for Apartment Price (vs μ = 1900 EUR)")
| Statistic | Value |
|---|---|
| Mean (Price) | 2018.941 |
| SD (Price) | 377.842 |
| t statistic | 2.902 |
| df | 84.000 |
| p-value | 0.005 |
| 95% CI Lower | 1937.443 |
| 95% CI Upper | 2100.440 |
| Cohen’s d | 0.315 |
The one-sample t-test shows that the mean apartment price per m² (2018.94€) is significantly higher than the hypothesized value of 1900€ (t(84) = 2.90, p = 0.005).
The 95% confidence interval [1937.44, 2100.44] does not include 1900, meaning we can reject the null hypothesis.
The effect size (Cohen’s d = 0.315) indicates a small-to-moderate practical difference.
To conclude, apartments in Ljubljana, are on average, priced noticeably higher than 1900€ per m².
fit1 <- lm(Price ~ Age, data = apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
reg_coef <- coef(fit1)[2]
cor_val <- cor(apartments$Price, apartments$Age)
r_squared <- summary(fit1)$r.squared
reg_coef
## Age
## -8.975058
cor_val
## [1] -0.230255
r_squared
## [1] 0.05301737
library(knitr)
summary_table <- data.frame(
Statistic = c("Regression Coefficient (Age)", "Correlation (Price, Age)", "R-squared"),
Value = c(reg_coef, cor_val, r_squared))
kable(summary_table, caption = "Regression results: Price as a function of Age")
| Statistic | Value |
|---|---|
| Regression Coefficient (Age) | -8.9750576 |
| Correlation (Price, Age) | -0.2302550 |
| R-squared | 0.0530174 |
Regression Coefficient (Age) = 8.976. This means that for every additional year of apartment age, the price per m² decreases by about 9 EUR, on average.The negative value shows that older apartments are generally cheaper.
Coefficient of Correlation (r) = –0.230. The correlation between Age and Price is negative. This confirms that older apartments have a lower price, but since the value is close to 0, the relationship is not strong.
Coefficient of Determination (R²) = 0.053. About 5.3% of the variation in apartment prices can be explained by the age of the apartment. This is a relatively small proportion, meaning other factors (distance, parking, balcony, etc.) likely explain price much better.
library(GGally)
ggpairs(apartments[, c("Price", "Age", "Distance")],
title = "Scatterplot Matrix: Price, Age, and Distance")
Price vs Age Correlation = –0.230. This suggests that as apartments get older, the price per m² decreases slightly, but the relationship is weak.
Price vs Distance Correlation = –0.631. This is a clear relationship that shows that apartments further from the city center tend to be significantly cheaper.
Age vs Distance Correlation = 0.043. No meaningful relationship between age and distance, because the value is close to 0.
Multicolinearity is only a problem if the independent variables (Age and Distance) are strongly correlated with each other. Here, their correlation is only 0.043.
fit2 <- lm(Price ~ Age + Distance, data = apartments)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(knitr)
library(broom)
fit2_tidy <- tidy(fit2)
fit2_glance <- glance(fit2)
summary_table <- data.frame(
Term = fit2_tidy$term,
Estimate = round(fit2_tidy$estimate, 3),
`Std. Error` = round(fit2_tidy$std.error, 3),
`t value` = round(fit2_tidy$statistic, 3),
`p value` = round(fit2_tidy$p.value, 3))
kable(summary_table, caption = "Multiple Regression Results: Price ~ Age + Distance")
| Term | Estimate | Std..Error | t.value | p.value |
|---|---|---|---|---|
| (Intercept) | 2460.101 | 76.632 | 32.103 | 0.000 |
| Age | -7.934 | 3.225 | -2.460 | 0.016 |
| Distance | -20.667 | 2.748 | -7.520 | 0.000 |
I created a table for easier representation of the results.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif_values <- vif(fit2)
library(knitr)
kable(as.data.frame(vif_values),
caption = "Variance Inflation Factor (VIF) for predictors in fit2")
| vif_values | |
|---|---|
| Age | 1.001845 |
| Distance | 1.001845 |
The VIF values for both Age and Distance are approximately 1 (1.0018), which indicates that the predictors are not correlated with each other. Since VIF values below 5 (and especially close to 1) suggest no multicolinearity, we conclude that there is no evidence of multicolinearity in this regression model.
library(car)
std_res <- rstandard(fit2)
cooks_d <- cooks.distance(fit2)
thr_res <- 3
thr_cook <- 4 / length(std_res)
problematic_idx <- which(abs(std_res) > thr_res | cooks_d > thr_cook)
apartments_clean <- if (length(problematic_idx)) apartments[-problematic_idx, ] else apartments
fit2_clean <- lm(Price ~ Age + Distance, data = apartments_clean)
cat("Removed", length(problematic_idx), "observations:",
if (length(problematic_idx)) paste(problematic_idx, collapse = ", ") else "(none)", "\n")
## Removed 5 observations: 22, 33, 38, 53, 55
cat("Cook's D threshold:", round(thr_cook, 5), "\n\n")
## Cook's D threshold: 0.04706
cat("Original model (fit2):\n")
## Original model (fit2):
print(summary(fit2))
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
cat("\nCleaned model (fit2_clean):\n")
##
## Cleaned model (fit2_clean):
print(summary(fit2_clean))
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartments_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
std_resid <- rstandard(fit2_clean)
std_fitted <- scale(fitted(fit2_clean))
plot(std_fitted, std_resid,
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
main = "Heteroskedasticity")
abline(h = 0, col = "hotpink2", lwd = 2)
lines(lowess(std_fitted, std_resid), col = "lightskyblue", lwd = 2)
The scatterplot of standardized residuals versus standardized fitted values shows a curved and slightly spread-out pattern. This means the variation in residuals is not the same across all fitted values. In other words, there are signs of heteroskedasticity.
This is important because heteroskedasticity can make the estimated standard errors unreliable, which may lead to incorrect conclusions about which predictors are significant.
std_resid <- rstandard(fit2_clean)
par(mfrow = c(1, 2))
hist(std_resid,
main = "Standardized Residuals",
xlab = "Standardized Residuals",
col = "plum1", border = "violetred",
ylim = c(0, max(hist(std_resid, plot = FALSE)$counts) + 2))
qqnorm(std_resid, main = "Q-Q Plot of Standardized Residuals")
qqline(std_resid, col = "red", lwd = 2)
shapiro_test <- shapiro.test(std_resid)
library(knitr)
shapiro_table <- data.frame(
Statistic = c("W Statistic", "p-value"),
Value = c(round(shapiro_test$statistic, 3), round(shapiro_test$p.value, 3)))
kable(shapiro_table, caption = "Shapiro-Wilk Test for Normality of Standardized Residuals")
| Statistic | Value | |
|---|---|---|
| W | W Statistic | 0.942 |
| p-value | 0.001 |
To check if the standardized residuals are normally distributed, I looked at both the graphs and the formal test.
The histogram looks kind of bell-shaped, but it isn’t perfectly symmetric.
The Q–Q plot shows most points close to the red line, but the points at the ends deviate, which means the distribution is not perfectly normal.
The Shapiro–Wilk test gave a p-value of 0.001, which is below 0.05. This means we reject the assumption of normality.
We can conclude that the residuals are not normally distributed. They look okay in the middle but the tails don’t fit well, and the formal test confirms this.
fit2_summary <- summary(fit2_clean)
coef_table <- as.data.frame(fit2_summary$coefficients)
coef_table$Variable <- rownames(coef_table)
rownames(coef_table) <- NULL
coef_table <- coef_table[, c("Variable", "Estimate", "Std. Error", "t value", "Pr(>|t|)")]
library(knitr)
kable(coef_table, caption = paste("Regression results for Price ~ Age + Distance"))
| Variable | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | 2502.466701 | 75.023970 | 33.355562 | 0.0000000 |
| Age | -8.673848 | 3.220960 | -2.692939 | 0.0086876 |
| Distance | -24.063024 | 2.691783 | -8.939438 | 0.0000000 |
The regression table shows the estimates, their standard errors, t-values, and p-values.
Within the estimate coefficients the intercept (2502.47) means that if an apartment had age = 0 and distance = 0, its predicted price would be around 2502 EUR.The Age coefficient (-8.67) shows that with every additional year, the price decreases by about 9 EUR. The Distance coefficient (-24.06) shows that for every extra kilometer from the city, the price drops by about 24 EUR.
The standard errors are relatively small, which means the estimates are quite precise.
The t-value is calculated as the estimate divided by its standard error, and it shows how strongly the coefficient differs from zero. Larger absolute t-values mean stronger evidence. In this case, Age has a t-value of -2.69 and Distance has -8.94, both of which are far from zero, so together with the low p-values this confirms that both predictors significantly affect apartment price.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartments)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fit3 fits the data better than Fit2 because the ANOVA test gave a p-value of about 0.01, which is below 0.05. This means that including Parking and Balcony explains more of the variation in Price, so Fit3 is the preferred model.
library(broom)
library(knitr)
fit3_table <- tidy(fit3)
fit3_table$estimate <- round(fit3_table$estimate, 3)
fit3_table$std.error <- round(fit3_table$std.error, 3)
fit3_table$statistic <- round(fit3_table$statistic, 3)
fit3_table$p.value <- round(fit3_table$p.value, 4)
kable(fit3_table, caption = "Regression Results for Model fit3")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2301.667 | 94.271 | 24.415 | 0.0000 |
| Age | -6.799 | 3.110 | -2.186 | 0.0317 |
| Distance | -18.045 | 2.758 | -6.543 | 0.0000 |
| ParkingYes | 196.168 | 62.868 | 3.120 | 0.0025 |
| BalconyYes | 1.935 | 60.014 | 0.032 | 0.9744 |
First categorical variable, parking: Apartments with parking cost on average about 196 EUR more than those without parking. This effect is statistically significant (p = 0.0025).
First categorical variable, balcony: Apartments with a balcony cost on average about 2 EUR more than those without a balcony. However, this effect is not significant (p = 0.97), meaning balconies don’t really impact price.
Hypothesis being tested with F-statistics: H₀ (null hypothesis), none of the predictors (Age, Distance, Parking, Balcony) influence apartment prices. H₁ (alternative hypothesis), at least one predictor has an effect on apartment prices.
fitted_values <- fitted(fit3)
residuals_values <- resid(fit3)
ID2_fitted <- fitted_values[2]
ID2_residual <- residuals_values[2]
library(knitr)
ID2_results <- data.frame(
Apartment = "ID2",
Fitted_Value = round(ID2_fitted, 2),
Residual = round(ID2_residual, 2))
kable(ID2_results, caption = "Fitted Value and Residual for Apartment ID2")
| Apartment | Fitted_Value | Residual | |
|---|---|---|---|
| 2 | ID2 | 2357.41 | 442.59 |
For apartment ID2, the regression model predicts a fitted value of 2357.41 EUR. The actual price of this apartment is 2800 EUR, which is higher than the predicted value. The residual for ID2 is therefore 442.59 EUR, meaning that the model underestimates the price of this apartment by about 443 EUR.
This difference (residual) shows how far the observed value is from the predicted regression line. In this case, the positive residual indicates that the apartment is more expensive than what the model expects, given its age, distance, parking, and balcony.