TASK 1

IMDB - Top 250 Movies of All Time - DATA ANALYSIS

The dataset consists of 250 movies from the IMDB Top-ranked list. Each observation corresponds to a single film, making the unit of analysis the individual movie.

The dataset consists of the following variables:

  • Rank
    • Type: Ordinal
    • Description: Placement of the movie in IMDb Top 250
    • Units: None

  • Title
    • Type: Nominal
    • Description: Official movie title
    • Units: Not applicable

  • Release
    • Type: Interval
    • Description: Calendar year when the movie was released
    • Units: Years (e.g., 1994)

  • Runtime
    • Type: Ratio
    • Description: Length of the movie’s running time
    • Units: Hours and minutes (e.g., 2h 22m)

  • Rated
    • Type: Ordinal
    • Description: Content rating for audience suitability
    • Units: MPAA / certification categories

  • Ratings
    • Type: Interval
    • Description: IMDb average score given by users
    • Units: 1–10 scale (with decimals)

mydata <- read.table("./IMDB_Movies.csv", 
                     header = TRUE, 
                     sep = ",", 
                     dec = ".",
                     quote = "\"",
                     stringsAsFactors = FALSE,
                     fill = TRUE,
                     comment.char = "")
head(mydata)
##   Rank                                         Title Release Runtime    Rated
## 1    1                      The Shawshank Redemption    1994  2h 22m        R
## 2    2                                 The Godfather    1972  2h 55m        R
## 3    3                               The Dark Knight    2008  2h 32m    PG-13
## 4    4                         The Godfather Part II    1974  3h 22m        R
## 5    5                                  12 Angry Men    1957  1h 36m Approved
## 6    6 The Lord of the Rings: The Return of the King    2003  3h 21m    PG-13
##   Ratings
## 1     9.3
## 2     9.2
## 3     9.0
## 4     9.0
## 5     9.0
## 6     9.0

CREATING A NEW VARIABLE

First, I transformed the original Runtime variable (given in hours and minutes, e.g. “2h 22m”) into a numeric variable expressed in total minutes. I stored this as a new variable called Runtime_minutes. This makes future manipulations and calculations much easier, since runtimes are now stored as numbers rather than text. Finally, I deleted the original Runtime column because it was no longer needed.

hours <- as.numeric(gsub("h.*", "", mydata$Runtime))
minutes <- as.numeric(gsub(".*h |m", "", mydata$Runtime))
minutes[is.na(minutes)] <- 0
mydata$Runtime_minutes <- hours * 60 + minutes

head(mydata)
##   Rank                                         Title Release Runtime    Rated
## 1    1                      The Shawshank Redemption    1994  2h 22m        R
## 2    2                                 The Godfather    1972  2h 55m        R
## 3    3                               The Dark Knight    2008  2h 32m    PG-13
## 4    4                         The Godfather Part II    1974  3h 22m        R
## 5    5                                  12 Angry Men    1957  1h 36m Approved
## 6    6 The Lord of the Rings: The Return of the King    2003  3h 21m    PG-13
##   Ratings Runtime_minutes
## 1     9.3             142
## 2     9.2             175
## 3     9.0             152
## 4     9.0             202
## 5     9.0              96
## 6     9.0             201
mydata$Runtime <- NULL

head(mydata)
##   Rank                                         Title Release    Rated Ratings
## 1    1                      The Shawshank Redemption    1994        R     9.3
## 2    2                                 The Godfather    1972        R     9.2
## 3    3                               The Dark Knight    2008    PG-13     9.0
## 4    4                         The Godfather Part II    1974        R     9.0
## 5    5                                  12 Angry Men    1957 Approved     9.0
## 6    6 The Lord of the Rings: The Return of the King    2003    PG-13     9.0
##   Runtime_minutes
## 1             142
## 2             175
## 3             152
## 4             202
## 5              96
## 6             201

DELETING UNITS WITH MISSING OR UNWANTED DATA

To improve data quality, I deleted all units where the variable Rated was equal to “Not Rated”. These observations do not provide a meaningful classification and could bias the analysis. By removing them, the dataset now only contains movies with valid rating categories.

mydata <- mydata[mydata$Rated != "Not Rated", ]

nrow(mydata)
## [1] 221

RENAMING VARIABLES FOR CLARITY

To simplify the dataset, I renamed Release to Year and Ratings to Rating. Using shorter and clearer variable names reduces the chance of mistakes and makes later analysis easier to read and interpret.

names(mydata)[names(mydata) == "Release"] <- "Year"
names(mydata)[names(mydata) == "Ratings"] <- "Rating"

head(mydata)
##   Rank                                         Title Year    Rated Rating
## 1    1                      The Shawshank Redemption 1994        R    9.3
## 2    2                                 The Godfather 1972        R    9.2
## 3    3                               The Dark Knight 2008    PG-13    9.0
## 4    4                         The Godfather Part II 1974        R    9.0
## 5    5                                  12 Angry Men 1957 Approved    9.0
## 6    6 The Lord of the Rings: The Return of the King 2003    PG-13    9.0
##   Runtime_minutes
## 1             142
## 2             175
## 3             152
## 4             202
## 5              96
## 6             201

CREATING A NEW DATA FRAME

I created a new data frame called movies_post1990, which includes only movies from 1990 and later. I want to use this dataset for further analysis because the movie industry before 1990 was very different (for example in budgets and production).

movies_post1990 <- mydata[mydata$Year > 1990, ]

head(movies_post1990)
##   Rank                                             Title Year Rated Rating
## 1    1                          The Shawshank Redemption 1994     R    9.3
## 3    3                                   The Dark Knight 2008 PG-13    9.0
## 6    6     The Lord of the Rings: The Return of the King 2003 PG-13    9.0
## 7    7                                  Schindler's List 1993     R    9.0
## 8    8                                      Pulp Fiction 1994     R    8.9
## 9    9 The Lord of the Rings: The Fellowship of the Ring 2001 PG-13    8.9
##   Runtime_minutes
## 1             142
## 3             152
## 6             201
## 7             195
## 8             154
## 9             178

GENERAL DESCRIPTIVE STATISTICS OVERVIEW FOR Runtime_minutes AND Rating IN movies_post1990

summary(movies_post1990$Runtime_minutes)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    80.0   111.0   129.5   130.5   148.2   201.0
summary(movies_post1990$Rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   8.100   8.250   8.323   8.500   9.300
library(psych)
describe(movies_post1990[, c("Runtime_minutes", "Rating")])
##                 vars   n   mean    sd median trimmed   mad min   max range skew
## Runtime_minutes    1 132 130.45 26.30 129.50  129.13 27.43  80 201.0 121.0 0.39
## Rating             2 132   8.32  0.25   8.25    8.29  0.22   8   9.3   1.3 1.29
##                 kurtosis   se
## Runtime_minutes    -0.43 2.29
## Rating              1.57 0.02

SPECIFIC DESCRIPTIVE STATISTICS: MEAN, MEDIAN AND MODE IN Movies_post1990

MODUS (Year)

I calculated the mode of the Year variable in the movies_post1990 dataset. The mode represents the release year with the highest number of movies in the dataset. Since R does not have a built-in function to calculate the mode for categorical variables, I first counted how many movies were released in each year, then ordered the years from most common to least common, and finally selected the most frequent year as the mode.

year_mode <- as.numeric(names(sort(table(movies_post1990$Year), 
                                   decreasing = TRUE)[1]))

year_mode
## [1] 1995

MEAN (Rating)

mean_rating <- mean(movies_post1990$Rating,
                    na.rm = TRUE)

mean_rating
## [1] 8.323485

MEDIAN (Runtime_minutes)

median_runtime <- median(movies_post1990$Runtime_minutes,
                         na.rm = TRUE)

median_runtime
## [1] 129.5

GRAPHICAL VISUALIZATION OF SELECTED VARIABLES

HISTOGRAM: Ratings and Runtime_minutes

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(movies_post1990, aes(x = Rating)) +
  geom_histogram(binwidth = 0.1, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Ratings (movies_post1990)",
       x = "Rating", y = "Count")

ggplot(movies_post1990, aes(x = Runtime_minutes)) +
  geom_histogram(binwidth = 10, fill = "lightgreen", color = "black") +
  labs(title = "Distribution of Runtime (movies_post1990)",
       x = "Runtime (minutes)", y = "Count")

Histogram of Ratings:

  • The histogram shows that ratings are concentrated between 8.0 and 8.5, with fewer movies rated above 9.0. The distribution is skewed to the right.

Histogram of Runtime:

  • The histogram shows that most movies have runtimes between 100 and 150 minutes. The distribution appears fairly balanced, suggesting runtimes could be approximately normally distributed, centered around two hours.

SCATTERPLOT: Runtime_minutes vs Rating

ggplot(movies_post1990, aes(x = Runtime_minutes, y = Rating)) +
  geom_point(alpha = 0.6, color = "blue") +
  labs(title = "Scatterplot: Rating vs Runtime (movies_post1990)",
       x = "Runtime (minutes)", y = "Rating")

The scatterplot shows the relationship between runtime and rating. Movies are spread across runtimes of 80–200 minutes, while ratings remain consistently high (8.0–9.3). There is no clear upward or downward trend, meaning longer movies are not necessarily rated higher.

BOXPLOT: Distribution of Movie Ratings

ggplot(movies_post1990, aes(y = Rating)) +
  geom_boxplot(fill = "orange", color = "black") +
  labs(title = "Boxplot of Ratings (movies_post1990)",
       y = "Rating") + scale_y_continuous(breaks = seq(8, 9.5, 0.25))

The boxplot shows that most post-1990 movies have very high ratings, clustered between 8.1 and 8.5. The median rating is around 8.3, meaning a typical movie is highly rated. The small spread reflects little variation. One outlier above 9.0 represents an exceptionally rated film - The Shawshank Redemption (9.3).

TASK 2

library(readxl)

mba_data <- read_excel("./Business_School.xlsx")

head(mba_data)
## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
str(mba_data)
## tibble [100 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Student ID            : num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Undergrad Degree      : chr [1:100] "Business" "Computer Science" "Finance" "Business" ...
##  $ Undergrad Grade       : num [1:100] 68.4 70.2 76.4 82.6 76.9 83.3 76 82.8 76 76.9 ...
##  $ MBA Grade             : num [1:100] 90.2 68.7 83.3 88.7 75.4 82.1 66.9 76.8 72.3 72.4 ...
##  $ Work Experience       : chr [1:100] "No" "Yes" "No" "No" ...
##  $ Employability (Before): num [1:100] 252 101 401 287 275 254 117 219 152 228 ...
##  $ Employability (After) : num [1:100] 276 119 462 342 347 313 163 304 211 286 ...
##  $ Status                : chr [1:100] "Placed" "Placed" "Placed" "Placed" ...
##  $ Annual Salary         : num [1:100] 111000 107000 109000 148000 255500 ...

DISTRIBUTION OF UNDERGRAD DEGREES

library(ggplot2)

ggplot(mba_data, aes(x = `Undergrad Degree`)) +
  geom_bar(fill = "wheat2", color = "black") +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.3) +
  labs(title = "Distribution of Undergraduate Degrees",
       x = "Undergraduate Degree", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The bar chart shows the distribution of undergraduate degrees among the MBA students. The most common degree is Business, with 35 students.

DESCRIPTIVE STATISTICS AND DISTRIBUTION OF ANNUAL SALARY

summary(mba_data$"Annual Salary")
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000
library(psych)
describe(mba_data$"Annual Salary")
##    vars   n   mean       sd median  trimmed     mad   min    max  range skew
## X1    1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22
##    kurtosis      se
## X1     9.41 4150.15

The Annual Salary variable has a mean of 109,058 and a median of 103,500. Since the mean is higher than the median, the distribution is slightly right-skewed, pulled upward by high salaries. Most students earn between 87,125 (1st Quartile) and 124,000 (3rd Quartile). Salaries range from a minimum of 20,000 to a maximum of 340,000, showing the presence of a few extremely high values that stretch the upper tail of the distribution.

HISTOGRAM OF ANNUAL SALARY

library(ggplot2)

ggplot(mba_data, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 5000, fill = "limegreen", color = "black") +
  labs(title = "Distribution of Annual Salary",
       x = "Annual Salary",
       y = "Count") +
  scale_x_continuous(breaks = seq(0, 350000, 50000),
                     labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The histogram shows that most MBA graduates earn between 80,000 and 130,000, with the distribution concentrated around 100,000. The shape is slightly right-skewed, with a few students earning very high salaries (over 200,000 and up to 340,000).

HYPOTHESIS TESTING: MBA GRADE VS PREVIOUS YEAR AVERAGE

We are testing whether the average MBA grade of current students is equal to the previous year’s average of 74.

  • H0: μ = 74 (The mean MBA grade equals 74)
  • H1: μ ≠ 74 (The mean MBA grade is different from 74)
t_test_result <- t.test(mba_data$`MBA Grade`,
                        mu = 74)

t_test_result
## 
##  One Sample t-test
## 
## data:  mba_data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

We reject the H0 at p = 0.05 and conclude that the average MBA grade of current students (76.04) is higher than the previous year’s average of 74.

sd(mba_data$`MBA Grade`, na.rm = TRUE)
## [1] 7.675114

EFFECT SIZE

We calculated the effect size using Hedges’ g, which adjusts Cohen’s d for small sample bias.

library(effsize)
## 
## Attaching package: 'effsize'
## The following object is masked from 'package:psych':
## 
##     cohen.d
x <- mba_data$`MBA Grade`
y <- rep(74, length(x))

cohen_d_result <- cohen.d(x, y, paired = TRUE, hedges.correction = TRUE)

cohen_d_result
## 
## Hedges's g
## 
## g estimate: 0.3731353 (small)
## 95 percent confidence interval:
##      lower      upper 
## 0.08689656 0.65937412

The effect size (Hedges’ g = 0.37) indicates a small effect, suggesting that the improvement, while statistically significant, is modest in practical terms.

TASK 3

Import the dataset Apartments.xlsx

library(readxl)

apart <- read_excel("./Apartments.xlsx")

head(apart)
## # A tibble: 6 × 5
##     Age Distance Price Parking Balcony
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl>
## 1     7       28  1640       0       1
## 2    18        1  2800       1       0
## 3     7       28  1660       0       0
## 4    28       29  1850       0       1
## 5    18       18  1640       1       1
## 6    28       12  1770       0       1
str(apart)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...

Description:

  • Age: Age of an apartment in years
  • Distance: The distance from city center in km
  • Price: Price per m2
  • Parking: 0-No, 1-Yes
  • Balcony: 0-No, 1-Yes

Change categorical variables into factors.

str(apart)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
apart$Parking <- as.numeric(apart$Parking)
apart$Balcony <- as.numeric(apart$Balcony)

apart$Parking <- factor(apart$Parking, levels = c(0, 1), labels = c("No", "Yes"))
apart$Balcony <- factor(apart$Balcony, levels = c(0, 1), labels = c("No", "Yes"))

str(apart)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ Balcony : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
table(apart$Parking)
## 
##  No Yes 
##  42  43
table(apart$Balcony)
## 
##  No Yes 
##  48  37

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

We are testing whether the average price per m² of apartments in Ljubljana is equal to the hypothesized market average of 1900 EUR.

  • H0: μ = 1900 (The mean price equals 1900 EUR)
  • H1: μ ≠ 1900 (The mean price is different from 1900 EUR)
t_test_result <- t.test(apart$Price,
                        mu = 1900)

t_test_result
## 
##  One Sample t-test
## 
## data:  apart$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

We reject the null hypothesis at p = 0.05 and conclude that the average price per m² of apartments (2018.94 EUR) is different from the hypothesized average of 1900 EUR

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = apart)

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age, data = apart)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401
correlation_r <- sqrt(summary(fit1)$r.squared)
correlation_r
## [1] 0.230255

Explanation:

  • The intercept (2185.46) represents the estimated price per m² for a new apartment (Age = 0).
  • The regression coefficient for Age is −8.98, meaning the price decreases by approximately 8.98 EUR/m² for each additional year of age.
  • The coefficient is statistically significant (p = 0.034 < 0.05).
  • The coefficient of determination (R²) is 0.053, meaning only 5.3% of the variability in apartment price is explained by age.
  • The correlation coefficient is approximately −0.23, indicating a weak negative linear relationship between age and price.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

pairs(~ Price + Age + Distance, data = apart)

The plots show the following:

  • There is a weak negative relationship between Price and Age, as expected (older apartments tend to be cheaper).
  • There is a slight negative relationship between Price and Distance, suggesting that apartments farther from the center may also be cheaper.
  • Importantly, the scatterplot between Age and Distance does not show a strong linear pattern - the data points are fairly scattered.

Conclusion: Based on the scatterplot matrix, there is no strong linear correlation between Age and Distance, which means there is no evidence of problematic multicollinearity between these predictors.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = apart)

summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11
  • The intercept (2460.10) is the estimated price per m² for an apartment that is 0 years old and located at the city center.
  • The coefficient for Age is −7.93, meaning that for each additional year of apartment age, the price decreases by approximately 7.93 EUR/m², holding Distance constant. This coefficient is statistically significant (p = 0.016).
  • The coefficient for Distance is −20.67, indicating that each additional kilometer from the city center is associated with a 20.67 EUR/m² decrease in price, holding Age constant. This is highly statistically significant (p < 0.001).

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
vif(fit2)
##      Age Distance 
## 1.001845 1.001845
  • Both Age and Distance had VIF values of 1.0018, which are very close to 1.
  • This indicates that there is no multicollinearity between the predictors, and the model estimates are stable and reliable.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

# Step 1: Standardized residuals
standard_resid <- rstandard(fit2)

# Step 2: Cook's distances
cooks_dist <- cooks.distance(fit2)

# Step 3: Combine into a data frame to inspect
diagnostics <- data.frame(
  ID = 1:nrow(apart),
  StdResid = standard_resid,
  CooksD = cooks_dist
)

# Step 4: Identify potential outliers and high influence points
# Outliers: standardized residuals > |2|
outliers <- diagnostics[abs(diagnostics$StdResid) > 2, ]

# Influential: Cook’s D > 4/n (rule of thumb)
cutoff <- 4 / nrow(apart)
influential <- diagnostics[diagnostics$CooksD > cutoff, ]

# View results
outliers
##    ID  StdResid     CooksD
## 33 33  2.050586 0.06913379
## 38 38  2.576772 0.31973058
## 53 53 -2.151787 0.06625775
influential
##    ID  StdResid     CooksD
## 22 22  1.575982 0.06086868
## 33 33  2.050586 0.06913379
## 38 38  2.576772 0.31973058
## 53 53 -2.151787 0.06625775
## 55 55  1.444768 0.10420445

We calculated standardized residuals and Cook’s Distances for the multiple regression model. Five observations (IDs: 22, 33, 38, 53, 55) were identified as outliers or highly influential, based on:

  • |Standardized residual| > 2 (outliers)
  • Cook’s Distance > 4/n (influence)
problematic_ids <- union(outliers$ID, influential$ID)
apart_clean <- apart[-problematic_ids, ]

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2_clean <- lm(Price ~ Age + Distance, data = apart_clean)

std_resid <- rstandard(fit2_clean)
std_fitted <- scale(fitted(fit2_clean))  # standardized fitted values

# Step 2: Scatterplot
plot(std_fitted, std_resid,
     main = "Standardized Residuals vs Standardized Fitted Values",
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     pch = 19, col = "darkgray")
abline(h = 0, col = "blue", lwd = 2)

  • The scatterplot of standardized residuals vs standardized fitted values does not show any clear pattern or funnel shape.
  • The residuals appear randomly scattered around the horizontal line at zero, which suggests that the assumption of homoskedasticity (constant variance) is met.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

# Extract standardized residuals
std_resid <- rstandard(fit2_clean)

# Plot: Histogram + QQ Plot
hist(std_resid,
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals",
     col = "royalblue",
     breaks = 10)

qqnorm(std_resid,
       main = "Q-Q Plot of Standardized Residuals")
qqline(std_resid, col = "red", lwd = 2)

# Formal test: Shapiro-Wilk normality test
shapiro.test(std_resid)
## 
##  Shapiro-Wilk normality test
## 
## data:  std_resid
## W = 0.94156, p-value = 0.001168

Explanation:

  • The histogram appeared approximately bell-shaped, though not perfectly normal.
  • The Q-Q plot showed that most points follow the 45° line, but there are noticeable deviations at the tails.

Shapiro-Wilk Test

  • H0: residuals are normally distributed
  • H1: residuals are not normally distributed

The Shapiro-Wilk test produced a test statistic of W = 0.942 and p = 0.001, which is below 0.05. Therefore, we reject the H0 of normality and conclude that the residuals are not perfectly normally distributed.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2_clean <- lm(Price ~ Age + Distance, data = apart_clean)

summary(fit2_clean)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apart_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Intercept (2502.47, p < 0.001): Represents the predicted price per m² of an apartment that is brand new (Age = 0) and located in the city center (Distance = 0). The baseline predicted price is about 2502 EUR/m².

Age (-8.67, p = 0.009): Holding distance constant, each additional year of apartment age reduces the price by about 8.67 EUR/m². This effect is statistically significant, confirming that older apartments are cheaper.

Distance (-24.06, p < 0.001): Holding age constant, each additional kilometer farther from the city center decreases the price by about 24.06 EUR/m². This effect is highly significant and stronger than age.

Model fit (R² = 0.536, Adjusted R² = 0.524): About 53% of the variation in apartment prices is explained by the model, which indicates a fairly good fit for cross-sectional housing data.

F-statistic (44.49, p < 0.001): The regression model as a whole is statistically significant, meaning that age and distance together explain a substantial share of price variation.

Both apartment age and distance from the city center significantly and negatively affect price per m².

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apart_clean)

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apart_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Intercept (2393.32, p < 0.001): Baseline predicted price for an apartment that is brand new (Age = 0), located in the city center (Distance = 0), without parking and without a balcony.

Age (-7.97, p = 0.015): Each additional year of apartment age reduces the price by about 7.97 EUR/m², holding distance, parking, and balcony constant. Older apartments are cheaper, and the effect is statistically significant.

Distance (-21.96, p < 0.001): Each additional kilometer farther from the city center reduces the price by about 21.96 EUR/m², controlling for other factors. Strong negative effect, highly significant.

Parking (128.70, p = 0.038): Apartments with parking are on average 128.70 EUR/m² more expensive than those without, controlling for age, distance, and balcony. Parking has a positive and significant impact on prices.

Balcony (6.03, p = 0.916): Apartments with a balcony are on average only 6 EUR/m² more expensive, but the effect is not statistically significant (p ≈ 0.916). Balcony does not significantly affect apartment prices in this dataset.

With function anova check if model fit3 fits data better than model fit2.

Hypotheses for ANOVA comparison

  • H0: β_Parking = β_Balcony = 0 (Adding Parking and Balcony does not improve the model)
  • H1: At least one of β_Parking, β_Balcony ≠ 0 (At least one improves the model)
anova(fit2_clean, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

We do not reject the H0 at p = 0.05 and conclude that adding Parking and Balcony together does not significantly improve the model compared to using only Age and Distance.

  • The ANOVA comparison between model fit2_clean (Age, Distance) and model fit3 (Age, Distance, Parking, Balcony) shows that the overall improvement of adding Parking and Balcony is not statistically significant (F(2,75) = 2.24, p = 0.1135).

  • This means that, taken together, Parking and Balcony do not significantly improve the model compared to using only Age and Distance.

  • However, from the regression results of fit3, we see that Parking has a significant positive effect, while Balcony does not. The ANOVA result reflects the joint test, which is diluted because Balcony contributes very little.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apart_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Interpretation of categorical variables

  • ParkingYes = 128.700 (p = 0.0376)

Apartments with parking are estimated to have prices about 128.7 EUR/m² higher than apartments without parking, holding all other variables constant. This effect is statistically significant at the 5% level.

  • BalconyYes = 6.032 (p = 0.9165)

Apartments with a balcony are estimated to have prices about 6.0 EUR/m² higher than those without a balcony, holding other variables constant. This effect is not statistically significant (p ≫ 0.05), so we cannot conclude balconies have an effect on price.

Hypotheses for the F-statistic

  • H0: β_Age = β_Distance = β_Parking = β_Balcony = 0 (None of the predictors have any effect on Price)
  • H1: At least one β_j ≠ 0 (At least one predictor significantly affects Price)

We reject H0 at p = 0.05 and conclude that the regression model with Age, Distance, Parking, and Balcony provides a significantly better fit than an intercept-only model.

Save fitted values and claculate the residual for apartment ID2.

apart_clean$fitted_values <- fitted(fit3)

# Save residuals from fit3
apart_clean$residuals <- resid(fit3)

# Extract results for apartment with ID = 2
apart_clean[2, c("Price", "fitted_values", "residuals")]
## # A tibble: 1 × 3
##   Price fitted_values residuals
##   <dbl>         <dbl>     <dbl>
## 1  2800         2357.      443.

Interpretation

  • The model predicted a price of about 2356.60 EUR/m² for apartment ID2.
  • The actual price was 2800 EUR/m², which is higher than predicted.
  • The residual is +443.40, meaning the model underestimates the price for this apartment.