Task 1:

Analysis of dataset on imdb score of movies

I found this data on Kaggle website. It contains 28 variables for 5043 movies, spanning across 100 years in 66 countries.

#Importing the dataset and libraries.

#Load libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

# Import the full dataset (5043 movies, 28 variables) 
movies <- read.csv("C:/Users/Žabica/Desktop/IMB/bootcamp/R Take Home Exam 2025/movie_metadata.csv", stringsAsFactors = FALSE)

head(movies)

##   color     director_name num_critic_for_reviews duration
## 1 Color     James Cameron                    723      178
## 2 Color    Gore Verbinski                    302      169
## 3 Color        Sam Mendes                    602      148
## 4 Color Christopher Nolan                    813      164
## 5             Doug Walker                     NA       NA
## 6 Color    Andrew Stanton                    462      132
##   director_facebook_likes actor_3_facebook_likes     actor_2_name
## 1                       0                    855 Joel David Moore
## 2                     563                   1000    Orlando Bloom
## 3                       0                    161     Rory Kinnear
## 4                   22000                  23000   Christian Bale
## 5                     131                     NA       Rob Walker
## 6                     475                    530  Samantha Morton
##   actor_1_facebook_likes     gross                          genres
## 1                   1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2                  40000 309404152        Action|Adventure|Fantasy
## 3                  11000 200074175       Action|Adventure|Thriller
## 4                  27000 448130642                 Action|Thriller
## 5                    131        NA                     Documentary
## 6                    640  73058679         Action|Adventure|Sci-Fi
##      actor_1_name                                             movie_title
## 1     CCH Pounder                                                 Avatar 
## 2     Johnny Depp               Pirates of the Caribbean: At World's End 
## 3 Christoph Waltz                                                Spectre 
## 4       Tom Hardy                                  The Dark Knight Rises 
## 5     Doug Walker Star Wars: Episode VII - The Force Awakens             
## 6    Daryl Sabara                                            John Carter 
##   num_voted_users cast_total_facebook_likes         actor_3_name
## 1          886204                      4834            Wes Studi
## 2          471220                     48350       Jack Davenport
## 3          275868                     11700     Stephanie Sigman
## 4         1144337                    106759 Joseph Gordon-Levitt
## 5               8                       143                     
## 6          212204                      1873         Polly Walker
##   facenumber_in_poster
## 1                    0
## 2                    0
## 3                    1
## 4                    0
## 5                    0
## 6                    1
##                                                      plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 2     goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5                                                                 
## 6               alien|american civil war|male nipple|mars|princess
##                                        movie_imdb_link num_user_for_reviews
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1                 3054
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1                 1238
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1                  994
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1                 2701
## 5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1                   NA
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1                  738
##   language country content_rating    budget title_year actor_2_facebook_likes
## 1  English     USA          PG-13 237000000       2009                    936
## 2  English     USA          PG-13 300000000       2007                   5000
## 3  English      UK          PG-13 245000000       2015                    393
## 4  English     USA          PG-13 250000000       2012                  23000
## 5                                        NA         NA                     12
## 6  English     USA          PG-13 263700000       2012                    632
##   imdb_score aspect_ratio movie_facebook_likes
## 1        7.9         1.78                33000
## 2        7.1         2.35                    0
## 3        6.8         2.35                85000
## 4        8.5         2.35               164000
## 5        7.1           NA                    0
## 6        6.6         2.35                24000

I will focus on a subset that meets the exam requirement (≥4 variables, mostly numeric plus ≥1 categorical):

imdb_score: IMDB Score of the movie on IMDB (numeric, average rating 0–10),
duration: Duration in minutes (numeric),
budget: P roduction budget of the movie in Dollars (numeric),
gross: Gross earnings of the movie in Dollars (numeric),
color: Film colorization (categorical, “Black and White” or “Color”)

#Data cleaning and manipulation

# Keep only variables we actually use later (gross, budget, duration, imdb_score, color, language)
movies_clean <-
  movies %>%
  select(gross, budget, duration, imdb_score, color, language) %>%
  
  # Remove rows with missing or unusable key values
  filter(
    !is.na(imdb_score),
    !is.na(duration),
    !is.na(budget),
    !is.na(gross),
    !is.na(color)) %>%
  # Create ROI (return on investment); protect against zero budgets
  mutate(roi = ifelse(budget > 0, gross / budget, NA_real_)) %>%
  # Rename for convenience
  rename(imdb = imdb_score) %>%
  # Recode color to a binary factor: Color = 0, Black and White = 1
  mutate(
    ColorBinary = factor(
      color,
      levels = c("Color", "Black and White"),
      labels = c(0, 1)))

# 6) Subset: English-language films (used later for comparison)
movies_eng <- movies_clean %>% filter(language == "English")

I removed rows with missing values in key variables to ensure consistency.

I created new variable: roi = gross/budget, a simple profitability measure.

I renamed imdb_score to imdb to simplify code.

I reformated categorical value color: 0 = Color, 1 = Black and White.

I converted content_rating into a factor so it can be used in grouped summaries and boxplots.

I prepared a subset of English-language films for comparison.

#Descriptive statistics

summary(movies_clean[, c("imdb","duration","budget","gross","roi", "ColorBinary")])

##       imdb          duration         budget              gross          
##  Min.   :1.600   Min.   : 34.0   Min.   :2.180e+02   Min.   :      162  
##  1st Qu.:5.900   1st Qu.: 95.0   1st Qu.:1.000e+07   1st Qu.:  6844452  
##  Median :6.600   Median :106.0   Median :2.400e+07   Median : 27996968  
##  Mean   :6.464   Mean   :109.9   Mean   :4.520e+07   Mean   : 51068087  
##  3rd Qu.:7.200   3rd Qu.:120.0   3rd Qu.:5.000e+07   3rd Qu.: 65406486  
##  Max.   :9.300   Max.   :330.0   Max.   :1.222e+10   Max.   :760505847  
##       roi            ColorBinary
##  Min.   :   0.0000   0   :3757  
##  1st Qu.:   0.4513   1   :   0  
##  Median :   1.0708   NA's: 133  
##  Mean   :   6.2554              
##  3rd Qu.:   2.2286              
##  Max.   :7194.4855

Interpretations:

Mean and median forIMDB ratings: mean is aproximately 6.5 suggesting ratings are roughly symmetric around 6–7.Most films are rated positively, but the mean is slightly pulled down by a few poorly rated movies. Median IMDB rating is also about 6.5, very close to the mean — indicating that ratings are fairly symmetric around the center.

Quartiles for duration: Q1 ≈ 95 minutes and Q3 ≈ 120 minutes. This means that 50% of all movies fall between 95–120 minutes.

Minimum and maximum foe budget & Gross: The minimum budget is only a few hundred dollars, while the maximum is over $1 billion. This big gap creates extreme outliers (highly right-skewed).; a few blockbusters dominate the upper tail.

Standard deviation for budget and gross: The SDs are extremely large, confirming that financial variables are highly variable and spread out, with big differences between typical films and blockbuster outliers.

ROI: very skewed; most films near ROI ≈ 1, but some very high outliers indicate runaway hits.

Frequency for ColorBinary: Majority of films are Color = 0, while Black & White = 1 represents only a small share of the dataset.

#Graphs

##Histogram of IMDB ratings

library(forcats)

ggplot(movies_clean, aes(x = imdb)) +
  geom_histogram(binwidth = 0.5, fill = "darkgoldenrod1", color = "mediumseagreen") +
  labs(title = "Distribution of IMDB ratings",
       x = "IMDB rating", y = "Count")

Interpretation: Asimetrical - scewed slightly to the left, unimodal. Ratings concentrate between 6 and 8; tail is thinnest at very low scores.

##Budget vs Gross

ggplot(movies_clean, aes(x = budget, y = gross)) +
  geom_point(alpha = 0.4) +
  labs(title = "Budget vs Gross revenue",
       x = "Budget (USD)", y = "Gross (USD)")

Interpretation: A clear positive relationship with large dispersion—higher budgets often earn more, but not always.

Boxplot: ROI by Color vs Black & White

# Boxplot of ROI by ColorBinary (design-focused)
p95 <- quantile(movies_clean$roi, 0.95, na.rm = TRUE)

ggplot(movies_clean, aes(x = ColorBinary, y = roi, fill = ColorBinary)) +
  geom_boxplot(
    width = 0.6,
    notch = TRUE,                       # notched median
    outlier.shape = 21,                 # hollow circles for outliers
    outlier.size = 2,
    outlier.fill = "white",
    alpha = 0.8,
    color = "black"
  ) +
  stat_summary(fun = mean, geom = "point",
               shape = 23, size = 3, fill = "red") +  # red diamond mean
  coord_cartesian(ylim = c(0, p95)) +
  scale_fill_manual(
    values = c("0" = "#1f77b4",   # blue for Color
               "1" = "#ff7f0e")   # orange for Black & White
  ) +
  labs(
    title = "ROI by Film Color",
    subtitle = "Boxplot trimmed at 95th percentile",
    x = "Film color (0 = Color, 1 = Black & White)",
    y = "ROI (gross / budget)",
    caption = "◇ = mean • Notch ≈ 95% CI around median"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 14, color = "navy"),
    plot.subtitle = element_text(size = 11, color = "grey30"),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10)),
    panel.grid.major.y = element_line(color = "grey85", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    legend.position = "none"
  )

Interpretation: ROI distributions are highly skewed in both groups (long upper tails). Medians can be compared directly; trimming at the 95th percentile keeps the central pattern readable while acknowledging extreme successes.

(Design:

Color palette: blue for Color = 0, orange for B&W = 1 (clear, colorblind-safe).

Box style: notched, semi-transparent, black border.

Outliers: hollow circles with white fill to stand out.

Mean: red diamond marker.

Theme: theme_minimal() with custom title, subtitle, dashed gridlines.)

Task 2:

Bussiness School dataset

#Importing the dataset and libraries

 #Libraries
library(readxl)
#library(dplyr)
#library(ggplot2)
#library(psych)     # describe()
#library(pastecs)   # stat.desc()
#library(forcats)   # fct_infreq, fct_reorder

# Import (first sheet)
bsdata <- readxl::read_excel("C:/Users/Žabica/Desktop/IMB/bootcamp/R Take Home Exam 2025/Task 2/Business School.xlsx")

head(bsdata)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

I loaded the Excel file into bsdata(BusinessSchool). In the following i will focus on three variables: Undergrad Degree (categorical), Annual Salary (numeric), MBA grade (numeric).

1.) Distribution of undergrad degrees (ggplot)

# Bar chart of undergrad degrees
library(forcats)


ggplot(bsdata, aes(x = fct_infreq(as.factor(`Undergrad Degree`)))) +
  geom_bar(fill  = "yellow3" , color = "violetred2", alpha = 0.9, width = 0.7) +
  labs(
    title = "Distribution of Undergrad Degrees",
    x = "Undergrad Degree",
    y = "Count"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 30, hjust = 1)
  )

# Which degree is most common?
deg_counts <- bsdata %>%
  count(`Undergrad Degree`, sort = TRUE)
deg_counts[1, ]  # top category and its frequency

## # A tibble: 1 × 2
##   `Undergrad Degree`     n
##   <chr>              <int>
## 1 Business              35

The most common degree is Bussines, which is the dominant category.

2.a) Descriptiv statistics of the Annual Salary

# Descriptive statistics
summary(bsdata$`Annual Salary`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

# Detailed set (includes cv, median, etc.)
round(pastecs::stat.desc(bsdata$`Annual Salary`), 2)

##      nbr.val     nbr.null       nbr.na          min          max        range 
## 1.000000e+02 0.000000e+00 0.000000e+00 2.000000e+04 3.400000e+05 3.200000e+05 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
## 1.090580e+07 1.035000e+05 1.090580e+05 4.150150e+03 8.234800e+03 1.722373e+09 
##      std.dev     coef.var 
## 4.150149e+04 3.800000e-01

Interpretation: Because the mean is higher than the median, the distribution is right-skewed.

From the quartiles and IQR we can see that half of the cohort earns between 87,125 and 124,000.

Standard deviation: salaries vary by €41k around the mean.

Wide range and the large SD confirms heavy right tail and probably high-end outliers.

Average salary is estimated at €109k with a 95% CI of €101k–€117k.

2b) Annual Salary — histogram

# Choose a data-driven binwidth: 30 bins across the observed range
binw <- diff(range(bsdata$`Annual Salary`, na.rm = TRUE))/30

ggplot(bsdata, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = binw, fill = "#10B981", color = "black", alpha = 0.9) +
  labs(title = "Distribution of Annual Salary",
    x = "Annual Salary",
    y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Interpretation: Asimetrical- right scewed, unimodal, peak around 90k-120k; matches the median and mean.

Long right tail with small amount of high salaries - outliers.

Peak around 140-160k possibly represents subgroup.

3.)Hypothesis test

# Optional diagnostic visuals
ggplot(bsdata, aes(x = `MBA Grade`)) +
  geom_histogram(binwidth = 2, fill = "#F59E0B", color = "black", alpha = 0.9) +
  labs(title = "MBA Grade: distribution check", x = "MBA Grade", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

qqnorm(bsdata$`MBA Grade`, main = "Q-Q plot: MBA Grade"); qqline(bsdata$`MBA Grade`, col = 2)

# One-sample two-sided t-test against last year's mean = 74
tt <- t.test(bsdata$`MBA Grade`, mu = 74, alternative = "two.sided")
tt

## 
##  One Sample t-test
## 
## data:  bsdata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

# Effect size: Cohen's d for one-sample test
m_mean <- mean(bsdata$`MBA Grade`, na.rm = TRUE)
m_sd   <- sd(bsdata$`MBA Grade`, na.rm = TRUE)
d      <- (m_mean - 74) / m_sd

list(mean = m_mean, sd = m_sd, cohen_d = d,
     conf.low = tt$conf.int[1], conf.high = tt$conf.int[2],
     t = unname(tt$statistic), df = unname(tt$parameter), p = tt$p.value)

## $mean
## [1] 76.04055
## 
## $sd
## [1] 7.675114
## 
## $cohen_d
## [1] 0.2658658
## 
## $conf.low
## [1] 74.51764
## 
## $conf.high
## [1] 77.56346
## 
## $t
## [1] 2.658658
## 
## $df
## [1] 99
## 
## $p
## [1] 0.009149922

Hypotheses: 𝐻0:𝜇=74 𝐻1:𝜇≠74

p<α p<0.05 0.009149922 < 0.05….We can reject the H0. The mean MBA grade differs significantly from 74.

The current cohort’s average MBA grade (76) is higher than last year’s 74. The effect size is small which means the improvement is small.

Task 3

Apartments dataset

Import the dataset Apartments.xlsx

# Libraries
#library(dplyr)
#library(ggplot2)

# Optional tools we’ll use later
#library(psych)     # describe()
library(car)       # scatterplotMatrix(), vif()

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

# Import dataset
library(readxl)
apart <- read_excel("C:/Users/Žabica/Desktop/IMB/bootcamp/R data/Bootcamp1-15-9-25/Apartments.xlsx")
#View(apart)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

 apart <- apart %>%
  mutate(
    ParkingF = factor(Parking, levels = c(0,1), labels = c("No","Yes")),
    BalconyF = factor(Balcony, levels = c(0,1), labels = c("No","Yes")))

str(apart)

## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
##  $ ParkingF: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ BalconyF: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

ParkingF and BalconyF are now factors (No/Yes) so models treat them as categorical. This also makes boxplots and model summaries easier to interpret. I kept the original columns Parking and Balcony (0/1) in case i would need them later.

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

# One-sample t-test 
t_price <- t.test(apart$Price, mu = 1900, alternative = "two.sided")

print(t_price)

## 
##  One Sample t-test
## 
## data:  apart$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

Interpretation: H0:μ=1900
H1:μ≠1900

p<α p<0.05 0.00473 < 0.05

I am rejecting the null hypothesis at the 5% level. The mean apartment price is significantly different from 1900 - higher by 119 on average.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = apart)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apart)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Interpretation: Each additional year is on average associated with 8,98 lower apartment price - a weak negative linear association. The correlation is not highly significant. About 5.3% of the variation in price is explained by age alone; the other ~94.7% is due to other factors.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

car::scatterplotMatrix(apart[, c("Price","Age","Distance")], smooth = FALSE)

Interpretation: The scatterplot matrix shows no multicollinearity.The scatterplot matrix for Price, Age, and Distance helps assess linear form and outliers.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, data = apart)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Interpretation: Holding distance fixed, each extra year of age is associated with about 7.9 lower price on average (statistically significant). Holding age fixed, each extra unit of distance lowers expected price by about 20.7 (highly significant). About 44% of the variation in prices is explained by age and distance. I conclude there is strong evidence that at least one - age or sistance helps explain the price (overall model is significant).

Chech the multicolinearity with VIF statistics. Explain the findings.

vif_vals <- car::vif(fit2)
vif_vals

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

Interpretation: VIFs ~ 1 and mean VIF ~ 1 tell me there is no evidence of multicollinearity between age and distance. Coefficient estimates should be stable and their standard errors are unaffected by correlation among the regressors.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

# Standardized residuals and Cook's distances for fit2
apart$StdResid <- round(rstandard(fit2), 3)
apart$CooksD   <- cooks.distance(fit2)

# Histogram of standardized residuals
hist(apart$StdResid,
     col = "lightblue",
     border = "blue",
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals (fit2)")

Interpretation: Residuals are centered near 0 with most values within ±2, indicating no major systematic bias. A mild right tail and ~1 observation near +3 suggest limited outlier presence. Using Cook’s D with the 4/n rule, there are potentially problematic units: at least one outlier (StdResid ≈ +3).If Cook’s D > 4/n for the same unit, then it’s both an outlier and influential.

# Cook's Distance for fit2
apart$CooksD <- round(cooks.distance(fit2), 3)

# Histogram of Cook's Distance
hist(apart$CooksD,
     col = "magenta1",
     border = "magenta4",
     xlab = "Cook's Distance",
     ylab = "Frequency",
     main = "Histogram of Cook's Distances (fit2)")

# Show top cases with highest Cook's Distance
head(apart[order(-apart$CooksD), ])

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes      Yes          2.58  0.32 
## 2    43       37  1740       0       0 No       No           1.44  0.104
## 3     2       11  2790       1       0 Yes      No           2.05  0.069
## 4     7        2  1760       0       1 No       Yes         -2.15  0.066
## 5    37        3  2540       1       1 Yes      Yes          1.58  0.061
## 6    40        2  2400       0       1 No       Yes          1.09  0.038

Interpretation: There is one clear outlier number 38 at 0.33, which is highly influential and worth inspecting if there are more.

# --- First chunk: Calculate Cook's D and show top 10 ---
apart$ID <- seq_len(nrow(apart))
apart$CooksD <- cooks.distance(fit2)

head(apart[order(-apart$CooksD),
           c("Age","Distance","Price","StdResid","CooksD")], 10)

## # A tibble: 10 × 5
##      Age Distance Price StdResid CooksD
##    <dbl>    <dbl> <dbl>    <dbl>  <dbl>
##  1     5       45  2180     2.58 0.320 
##  2    43       37  1740     1.44 0.104 
##  3     2       11  2790     2.05 0.0691
##  4     7        2  1760    -2.15 0.0663
##  5    37        3  2540     1.58 0.0609
##  6    40        2  2400     1.09 0.0375
##  7     8        2  2820     1.66 0.0365
##  8     8       26  2300     1.57 0.0341
##  9    10        1  2810     1.60 0.0320
## 10    18        1  2800     1.78 0.0304

Interpretation: Here we have the possible outliers. I will presume the rule of D > 4/n. For my dataset that means I will remove all the cases with D > 0.04.

#Remove flagged cases ---
library(dplyr)

# Example: replace these with IDs flagged above
drop_ids <- c(38, 55, 33, 53, 22)

apart <- apart %>% filter(!ID %in% drop_ids) 

#View(apart)

Interpretation: Previously mentioned cases with D > 0.04 are filtered out.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

# Multiple regression model (already created as fit2, shown again for clarity)
fit2 <- lm(Price ~ Age + Distance, data = apart)

# Standardized residuals and standardized fitted values
apart$StdResid <- round(rstandard(fit2), 3)
apart$StdFitValues <- scale(fit2$fitted.values)

# Residuals vs fitted plot
library(car)
scatterplot(y = apart$StdResid,
            x = apart$StdFitValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE,
            main = "Standardized residuals vs fitted values (fit2)")

Interpretation: The scatter suggests heteroskedasticity increasing with fitted values. I will have to use robust SEs and check that results are robust after removing the few influential points I identified.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

# Standardized residuals for fit2
apart$StdResid <- round(rstandard(fit2), 3)

# Histogram of standardized residuals
hist(apart$StdResid,
     col = "lightblue",
     border = "blue",
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of standardized residuals (fit2)")

Interpretation: Removing those IDs reduced the influence of a handful of extreme points, yielding residuals more consistent with model assumptions (normality and constant variance). Parameter estimates and standard errors should be less distorted.

# Shapiro-Wilk test of normality for standardized residuals
shapiro.test(apart$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  apart$StdResid
## W = 0.94154, p-value = 0.001166

Interpretation: Even after dropping those IDs, residuals aren’t normal enough for a clean Shapiro pass. I would have to do the procedure of finding and eliminating the ouliers again.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance, data = apart)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Interpretation: Each additional year is on average associated with 8,67 lower apartment price - a weak negative linear association. The correlation is significant. About 5.3% of the variation in price is explained by age alone; the other ~94.7% is due to other factors.

sqrt(summary(fit2)$r.squared)

## [1] 0.732187

Interpretation: This is R, the multiple correlation between the observed response and the model’s fitted values. My result 0.732 means the model’s predictions correlate about 0.73 with actual prices. Squaring it gives back R² ≈ 0.536, i.e., the model explains ~53.6% of the variance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = apart)
print(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = apart)
## 
## Coefficients:
## (Intercept)          Age     Distance  ParkingFYes  BalconyFYes  
##    2393.316       -7.970      -21.961      128.700        6.032

Interpretation: Age (−7.97): each 1-unit increase in Age is associated with ~8 units lower Price, other variables fixed.. Distance (−21.96): each 1-unit increase in Distance lowers Price by ~22 units, other variables fixed. Parking (+128.7): apartments with parking are about 129 units more expensive than those without, other variables fixed. Balcony (+6.03): a balcony adds ~6 units on average, other variables fixed.

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

Interpretation: Since p = 0.12 > 0.05, I do not have evidence that adding Parking and Balcony improves fit beyond Age and Distance.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = apart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingFYes  128.700     60.801   2.117   0.0376 *  
## BalconyFYes    6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Interpretation: Apartments with parking are estimated to cost about 129 units more than those without parking, other variables fixed (statistically significant at 5%). Having a balcony is associated with ~6 units higher price, other variables fixed, but this effect is not statistically important.

Save fitted values and claculate the residual for apartment ID2.

# Fitted value and residual for apartment with ID = 2
Fitted_ID2   <- fitted(fit3)[apart$ID == 2]
Residual_ID2 <- resid(fit3)[apart$ID == 2]

round(c(Fitted = Fitted_ID2, Residual = Residual_ID2), 3)

##   Fitted.2 Residual.2 
##   2356.597    443.403

Interpretation: A positive residual means the apartment’s actual price is higher than the model predicts. This apartment was sold for ≈ 2356.597 + 443.403 = 2800 (same units as Price), i.e., about 443 above the model’s expectation given its Age, Distance, Parking, and Balcony.

Take home exam

Nuša Brezovnik Bunderla

2025-09-20