The dataset consists of 250 movies from the IMDB Top-ranked list. Each observation corresponds to a single film, making the unit of analysis the individual movie.
The dataset consists of the following variables:
mydata <- read.table("./IMDB_Movies.csv",
header = TRUE,
sep = ",",
dec = ".",
quote = "\"",
stringsAsFactors = FALSE,
fill = TRUE,
comment.char = "")
head(mydata)
## Rank Title Release Runtime Rated
## 1 1 The Shawshank Redemption 1994 2h 22m R
## 2 2 The Godfather 1972 2h 55m R
## 3 3 The Dark Knight 2008 2h 32m PG-13
## 4 4 The Godfather Part II 1974 3h 22m R
## 5 5 12 Angry Men 1957 1h 36m Approved
## 6 6 The Lord of the Rings: The Return of the King 2003 3h 21m PG-13
## Ratings
## 1 9.3
## 2 9.2
## 3 9.0
## 4 9.0
## 5 9.0
## 6 9.0
First, I transformed the original Runtime variable (given in hours and minutes, e.g. “2h 22m”) into a numeric variable expressed in total minutes. I stored this as a new variable called Runtime_minutes. This makes future manipulations and calculations much easier, since runtimes are now stored as numbers rather than text. Finally, I deleted the original Runtime column because it was no longer needed.
hours <- as.numeric(gsub("h.*", "", mydata$Runtime))
minutes <- as.numeric(gsub(".*h |m", "", mydata$Runtime))
minutes[is.na(minutes)] <- 0
mydata$Runtime_minutes <- hours * 60 + minutes
head(mydata)
## Rank Title Release Runtime Rated
## 1 1 The Shawshank Redemption 1994 2h 22m R
## 2 2 The Godfather 1972 2h 55m R
## 3 3 The Dark Knight 2008 2h 32m PG-13
## 4 4 The Godfather Part II 1974 3h 22m R
## 5 5 12 Angry Men 1957 1h 36m Approved
## 6 6 The Lord of the Rings: The Return of the King 2003 3h 21m PG-13
## Ratings Runtime_minutes
## 1 9.3 142
## 2 9.2 175
## 3 9.0 152
## 4 9.0 202
## 5 9.0 96
## 6 9.0 201
mydata$Runtime <- NULL
head(mydata)
## Rank Title Release Rated Ratings
## 1 1 The Shawshank Redemption 1994 R 9.3
## 2 2 The Godfather 1972 R 9.2
## 3 3 The Dark Knight 2008 PG-13 9.0
## 4 4 The Godfather Part II 1974 R 9.0
## 5 5 12 Angry Men 1957 Approved 9.0
## 6 6 The Lord of the Rings: The Return of the King 2003 PG-13 9.0
## Runtime_minutes
## 1 142
## 2 175
## 3 152
## 4 202
## 5 96
## 6 201
To improve data quality, I deleted all units where the variable Rated was equal to “Not Rated”. These observations do not provide a meaningful classification and could bias the analysis. By removing them, the dataset now only contains movies with valid rating categories.
mydata <- mydata[mydata$Rated != "Not Rated", ]
nrow(mydata)
## [1] 221
To simplify the dataset, I renamed Release to Year and Ratings to Rating. Using shorter and clearer variable names reduces the chance of mistakes and makes later analysis easier to read and interpret.
names(mydata)[names(mydata) == "Release"] <- "Year"
names(mydata)[names(mydata) == "Ratings"] <- "Rating"
head(mydata)
## Rank Title Year Rated Rating
## 1 1 The Shawshank Redemption 1994 R 9.3
## 2 2 The Godfather 1972 R 9.2
## 3 3 The Dark Knight 2008 PG-13 9.0
## 4 4 The Godfather Part II 1974 R 9.0
## 5 5 12 Angry Men 1957 Approved 9.0
## 6 6 The Lord of the Rings: The Return of the King 2003 PG-13 9.0
## Runtime_minutes
## 1 142
## 2 175
## 3 152
## 4 202
## 5 96
## 6 201
I created a new data frame called movies_post1990, which includes only movies from 1990 and later. I want to use this dataset for further analysis because the movie industry before 1990 was very different (for example in budgets and production).
movies_post1990 <- mydata[mydata$Year > 1990, ]
head(movies_post1990)
## Rank Title Year Rated Rating
## 1 1 The Shawshank Redemption 1994 R 9.3
## 3 3 The Dark Knight 2008 PG-13 9.0
## 6 6 The Lord of the Rings: The Return of the King 2003 PG-13 9.0
## 7 7 Schindler's List 1993 R 9.0
## 8 8 Pulp Fiction 1994 R 8.9
## 9 9 The Lord of the Rings: The Fellowship of the Ring 2001 PG-13 8.9
## Runtime_minutes
## 1 142
## 3 152
## 6 201
## 7 195
## 8 154
## 9 178
summary(movies_post1990$Runtime_minutes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 80.0 111.0 129.5 130.5 148.2 201.0
summary(movies_post1990$Rating)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 8.100 8.250 8.323 8.500 9.300
library(psych)
describe(movies_post1990[, c("Runtime_minutes", "Rating")])
## vars n mean sd median trimmed mad min max range skew
## Runtime_minutes 1 132 130.45 26.30 129.50 129.13 27.43 80 201.0 121.0 0.39
## Rating 2 132 8.32 0.25 8.25 8.29 0.22 8 9.3 1.3 1.29
## kurtosis se
## Runtime_minutes -0.43 2.29
## Rating 1.57 0.02
I calculated the mode of the Year variable in the movies_post1990 dataset. The mode represents the release year with the highest number of movies in the dataset. Since R does not have a built-in function to calculate the mode for categorical variables, I first counted how many movies were released in each year, then ordered the years from most common to least common, and finally selected the most frequent year as the mode.
year_mode <- as.numeric(names(sort(table(movies_post1990$Year),
decreasing = TRUE)[1]))
year_mode
## [1] 1995
mean_rating <- mean(movies_post1990$Rating,
na.rm = TRUE)
mean_rating
## [1] 8.323485
median_runtime <- median(movies_post1990$Runtime_minutes,
na.rm = TRUE)
median_runtime
## [1] 129.5
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(movies_post1990, aes(x = Rating)) +
geom_histogram(binwidth = 0.1, fill = "skyblue", color = "black") +
labs(title = "Distribution of Ratings (movies_post1990)",
x = "Rating", y = "Count")
ggplot(movies_post1990, aes(x = Runtime_minutes)) +
geom_histogram(binwidth = 10, fill = "lightgreen", color = "black") +
labs(title = "Distribution of Runtime (movies_post1990)",
x = "Runtime (minutes)", y = "Count")
Histogram of Ratings:
Histogram of Runtime:
ggplot(movies_post1990, aes(x = Runtime_minutes, y = Rating)) +
geom_point(alpha = 0.6, color = "blue") +
labs(title = "Scatterplot: Rating vs Runtime (movies_post1990)",
x = "Runtime (minutes)", y = "Rating")
The scatterplot shows the relationship between runtime and rating.
Movies are spread across runtimes of 80–200 minutes, while ratings
remain consistently high (8.0–9.3). There is no clear upward or downward
trend, meaning longer movies are not necessarily rated
higher.
ggplot(movies_post1990, aes(y = Rating)) +
geom_boxplot(fill = "orange", color = "black") +
labs(title = "Boxplot of Ratings (movies_post1990)",
y = "Rating") + scale_y_continuous(breaks = seq(8, 9.5, 0.25))
The boxplot shows that most post-1990 movies have very high ratings,
clustered between 8.1 and 8.5. The median rating is around 8.3, meaning
a typical movie is highly rated. The small spread reflects little
variation. One outlier above 9.0 represents an exceptionally rated film
- The Shawshank Redemption (9.3).
library(readxl)
mba_data <- read_excel("./Business_School.xlsx")
head(mba_data)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
str(mba_data)
## tibble [100 × 9] (S3: tbl_df/tbl/data.frame)
## $ Student ID : num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
## $ Undergrad Degree : chr [1:100] "Business" "Computer Science" "Finance" "Business" ...
## $ Undergrad Grade : num [1:100] 68.4 70.2 76.4 82.6 76.9 83.3 76 82.8 76 76.9 ...
## $ MBA Grade : num [1:100] 90.2 68.7 83.3 88.7 75.4 82.1 66.9 76.8 72.3 72.4 ...
## $ Work Experience : chr [1:100] "No" "Yes" "No" "No" ...
## $ Employability (Before): num [1:100] 252 101 401 287 275 254 117 219 152 228 ...
## $ Employability (After) : num [1:100] 276 119 462 342 347 313 163 304 211 286 ...
## $ Status : chr [1:100] "Placed" "Placed" "Placed" "Placed" ...
## $ Annual Salary : num [1:100] 111000 107000 109000 148000 255500 ...
library(ggplot2)
ggplot(mba_data, aes(x = `Undergrad Degree`)) +
geom_bar(fill = "wheat2", color = "black") +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.3) +
labs(title = "Distribution of Undergraduate Degrees",
x = "Undergraduate Degree", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The bar chart shows the distribution of undergraduate degrees among the MBA students. The most common degree is Business, with 35 students.
summary(mba_data$"Annual Salary")
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
library(psych)
describe(mba_data$"Annual Salary")
## vars n mean sd median trimmed mad min max range skew
## X1 1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22
## kurtosis se
## X1 9.41 4150.15
The Annual Salary variable has a mean of 109,058 and a median of
103,500. Since the mean is higher than the median, the distribution is
slightly right-skewed, pulled upward by high salaries. Most students
earn between 87,125 (1st Quartile) and 124,000 (3rd Quartile). Salaries
range from a minimum of 20,000 to a maximum of 340,000, showing the
presence of a few extremely high values that stretch the upper tail of
the distribution.
library(ggplot2)
ggplot(mba_data, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 5000, fill = "limegreen", color = "black") +
labs(title = "Distribution of Annual Salary",
x = "Annual Salary",
y = "Count") +
scale_x_continuous(breaks = seq(0, 350000, 50000),
labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The histogram shows that most MBA graduates earn between 80,000 and 130,000, with the distribution concentrated around 100,000. The shape is slightly right-skewed, with a few students earning very high salaries (over 200,000 and up to 340,000).
We are testing whether the average MBA grade of current students is equal to the previous year’s average of 74.
t_test_result <- t.test(mba_data$`MBA Grade`,
mu = 74)
t_test_result
##
## One Sample t-test
##
## data: mba_data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
We reject the H0 at p = 0.05 and conclude that the average MBA grade of current students (76.04) is higher than the previous year’s average of 74.
sd(mba_data$`MBA Grade`, na.rm = TRUE)
## [1] 7.675114
We calculated the effect size using Hedges’ g, which adjusts Cohen’s d for small sample bias.
library(effsize)
##
## Attaching package: 'effsize'
## The following object is masked from 'package:psych':
##
## cohen.d
x <- mba_data$`MBA Grade`
y <- rep(74, length(x))
cohen_d_result <- cohen.d(x, y, paired = TRUE, hedges.correction = TRUE)
cohen_d_result
##
## Hedges's g
##
## g estimate: 0.3731353 (small)
## 95 percent confidence interval:
## lower upper
## 0.08689656 0.65937412
The effect size (Hedges’ g = 0.37) indicates a small effect,
suggesting that the improvement, while statistically significant, is
modest in practical terms.
library(readxl)
apart <- read_excel("./Apartments.xlsx")
head(apart)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
str(apart)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
Description:
str(apart)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
apart$Parking <- as.numeric(apart$Parking)
apart$Balcony <- as.numeric(apart$Balcony)
apart$Parking <- factor(apart$Parking, levels = c(0, 1), labels = c("No", "Yes"))
apart$Balcony <- factor(apart$Balcony, levels = c(0, 1), labels = c("No", "Yes"))
str(apart)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
table(apart$Parking)
##
## No Yes
## 42 43
table(apart$Balcony)
##
## No Yes
## 48 37
We are testing whether the average price per m² of apartments in Ljubljana is equal to the hypothesized market average of 1900 EUR.
t_test_result <- t.test(apart$Price,
mu = 1900)
t_test_result
##
## One Sample t-test
##
## data: apart$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We reject the null hypothesis at p = 0.05 and conclude that the average price per m² of apartments (2018.94 EUR) is different from the hypothesized average of 1900 EUR
fit1 <- lm(Price ~ Age, data = apart)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
correlation_r <- sqrt(summary(fit1)$r.squared)
correlation_r
## [1] 0.230255
Explanation:
pairs(~ Price + Age + Distance, data = apart)
The plots show the following:
Conclusion: Based on the scatterplot matrix, there is no strong linear correlation between Age and Distance, which means there is no evidence of problematic multicollinearity between these predictors.
fit2 <- lm(Price ~ Age + Distance, data = apart)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
vif(fit2)
## Age Distance
## 1.001845 1.001845
# Step 1: Standardized residuals
standard_resid <- rstandard(fit2)
# Step 2: Cook's distances
cooks_dist <- cooks.distance(fit2)
# Step 3: Combine into a data frame to inspect
diagnostics <- data.frame(
ID = 1:nrow(apart),
StdResid = standard_resid,
CooksD = cooks_dist
)
# Step 4: Identify potential outliers and high influence points
# Outliers: standardized residuals > |2|
outliers <- diagnostics[abs(diagnostics$StdResid) > 2, ]
# Influential: Cook’s D > 4/n (rule of thumb)
cutoff <- 4 / nrow(apart)
influential <- diagnostics[diagnostics$CooksD > cutoff, ]
# View results
outliers
## ID StdResid CooksD
## 33 33 2.050586 0.06913379
## 38 38 2.576772 0.31973058
## 53 53 -2.151787 0.06625775
influential
## ID StdResid CooksD
## 22 22 1.575982 0.06086868
## 33 33 2.050586 0.06913379
## 38 38 2.576772 0.31973058
## 53 53 -2.151787 0.06625775
## 55 55 1.444768 0.10420445
We calculated standardized residuals and Cook’s Distances for the multiple regression model. Five observations (IDs: 22, 33, 38, 53, 55) were identified as outliers or highly influential, based on:
problematic_ids <- union(outliers$ID, influential$ID)
apart_clean <- apart[-problematic_ids, ]
fit2_clean <- lm(Price ~ Age + Distance, data = apart_clean)
std_resid <- rstandard(fit2_clean)
std_fitted <- scale(fitted(fit2_clean)) # standardized fitted values
# Step 2: Scatterplot
plot(std_fitted, std_resid,
main = "Standardized Residuals vs Standardized Fitted Values",
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
pch = 19, col = "darkgray")
abline(h = 0, col = "blue", lwd = 2)
# Extract standardized residuals
std_resid <- rstandard(fit2_clean)
# Plot: Histogram + QQ Plot
hist(std_resid,
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals",
col = "royalblue",
breaks = 10)
qqnorm(std_resid,
main = "Q-Q Plot of Standardized Residuals")
qqline(std_resid, col = "red", lwd = 2)
# Formal test: Shapiro-Wilk normality test
shapiro.test(std_resid)
##
## Shapiro-Wilk normality test
##
## data: std_resid
## W = 0.94156, p-value = 0.001168
Explanation:
Shapiro-Wilk Test
The Shapiro-Wilk test produced a test statistic of W = 0.942 and p = 0.001, which is below 0.05. Therefore, we reject the H0 of normality and conclude that the residuals are not perfectly normally distributed.
fit2_clean <- lm(Price ~ Age + Distance, data = apart_clean)
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apart_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
Intercept (2502.47, p < 0.001): Represents the predicted price per m² of an apartment that is brand new (Age = 0) and located in the city center (Distance = 0). The baseline predicted price is about 2502 EUR/m².
Age (-8.67, p = 0.009): Holding distance constant, each additional year of apartment age reduces the price by about 8.67 EUR/m². This effect is statistically significant, confirming that older apartments are cheaper.
Distance (-24.06, p < 0.001): Holding age constant, each additional kilometer farther from the city center decreases the price by about 24.06 EUR/m². This effect is highly significant and stronger than age.
Model fit (R² = 0.536, Adjusted R² = 0.524): About 53% of the variation in apartment prices is explained by the model, which indicates a fairly good fit for cross-sectional housing data.
F-statistic (44.49, p < 0.001): The regression model as a whole is statistically significant, meaning that age and distance together explain a substantial share of price variation.
Both apartment age and distance from the city center significantly and negatively affect price per m².
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apart_clean)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apart_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Intercept (2393.32, p < 0.001): Baseline predicted price for an apartment that is brand new (Age = 0), located in the city center (Distance = 0), without parking and without a balcony.
Age (-7.97, p = 0.015): Each additional year of apartment age reduces the price by about 7.97 EUR/m², holding distance, parking, and balcony constant. Older apartments are cheaper, and the effect is statistically significant.
Distance (-21.96, p < 0.001): Each additional kilometer farther from the city center reduces the price by about 21.96 EUR/m², controlling for other factors. Strong negative effect, highly significant.
Parking (128.70, p = 0.038): Apartments with parking are on average 128.70 EUR/m² more expensive than those without, controlling for age, distance, and balcony. Parking has a positive and significant impact on prices.
Balcony (6.03, p = 0.916): Apartments with a balcony are on average only 6 EUR/m² more expensive, but the effect is not statistically significant (p ≈ 0.916). Balcony does not significantly affect apartment prices in this dataset.
Hypotheses for ANOVA comparison
anova(fit2_clean, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
We do not reject the H0 at p = 0.05 and conclude that adding Parking
and Balcony together does not significantly improve the model compared
to using only Age and Distance.
The ANOVA comparison between model fit2_clean (Age, Distance) and model fit3 (Age, Distance, Parking, Balcony) shows that the overall improvement of adding Parking and Balcony is not statistically significant (F(2,75) = 2.24, p = 0.1135).
This means that, taken together, Parking and Balcony do not significantly improve the model compared to using only Age and Distance.
However, from the regression results of fit3, we see that Parking
has a significant positive effect, while Balcony does not. The ANOVA
result reflects the joint test, which is diluted because Balcony
contributes very little.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apart_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Interpretation of categorical variables
Apartments with parking are estimated to have prices about 128.7 EUR/m² higher than apartments without parking, holding all other variables constant. This effect is statistically significant at the 5% level.
Apartments with a balcony are estimated to have prices about 6.0
EUR/m² higher than those without a balcony, holding other variables
constant. This effect is not statistically significant (p ≫ 0.05), so we
cannot conclude balconies have an effect on price.
Hypotheses for the F-statistic
We reject H0 at p = 0.05 and conclude that the regression model with Age, Distance, Parking, and Balcony provides a significantly better fit than an intercept-only model.
apart_clean$fitted_values <- fitted(fit3)
# Save residuals from fit3
apart_clean$residuals <- resid(fit3)
# Extract results for apartment with ID = 2
apart_clean[2, c("Price", "fitted_values", "residuals")]
## # A tibble: 1 × 3
## Price fitted_values residuals
## <dbl> <dbl> <dbl>
## 1 2800 2357. 443.
Interpretation