#Importing the dataset and libraries.
#Load libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
# Import the full dataset (5043 movies, 28 variables)
movies <- read.csv("C:/Users/Žabica/Desktop/IMB/bootcamp/R Take Home Exam 2025/movie_metadata.csv", stringsAsFactors = FALSE)
head(movies)
## color director_name num_critic_for_reviews duration
## 1 Color James Cameron 723 178
## 2 Color Gore Verbinski 302 169
## 3 Color Sam Mendes 602 148
## 4 Color Christopher Nolan 813 164
## 5 Doug Walker NA NA
## 6 Color Andrew Stanton 462 132
## director_facebook_likes actor_3_facebook_likes actor_2_name
## 1 0 855 Joel David Moore
## 2 563 1000 Orlando Bloom
## 3 0 161 Rory Kinnear
## 4 22000 23000 Christian Bale
## 5 131 NA Rob Walker
## 6 475 530 Samantha Morton
## actor_1_facebook_likes gross genres
## 1 1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2 40000 309404152 Action|Adventure|Fantasy
## 3 11000 200074175 Action|Adventure|Thriller
## 4 27000 448130642 Action|Thriller
## 5 131 NA Documentary
## 6 640 73058679 Action|Adventure|Sci-Fi
## actor_1_name movie_title
## 1 CCH Pounder Avatar
## 2 Johnny Depp Pirates of the Caribbean: At World's End
## 3 Christoph Waltz Spectre
## 4 Tom Hardy The Dark Knight Rises
## 5 Doug Walker Star Wars: Episode VII - The Force Awakens
## 6 Daryl Sabara John Carter
## num_voted_users cast_total_facebook_likes actor_3_name
## 1 886204 4834 Wes Studi
## 2 471220 48350 Jack Davenport
## 3 275868 11700 Stephanie Sigman
## 4 1144337 106759 Joseph Gordon-Levitt
## 5 8 143
## 6 212204 1873 Polly Walker
## facenumber_in_poster
## 1 0
## 2 0
## 3 1
## 4 0
## 5 0
## 6 1
## plot_keywords
## 1 avatar|future|marine|native|paraplegic
## 2 goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3 bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5
## 6 alien|american civil war|male nipple|mars|princess
## movie_imdb_link num_user_for_reviews
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1 3054
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1 1238
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1 994
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1 2701
## 5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1 NA
## 6 http://www.imdb.com/title/tt0401729/?ref_=fn_tt_tt_1 738
## language country content_rating budget title_year actor_2_facebook_likes
## 1 English USA PG-13 237000000 2009 936
## 2 English USA PG-13 300000000 2007 5000
## 3 English UK PG-13 245000000 2015 393
## 4 English USA PG-13 250000000 2012 23000
## 5 NA NA 12
## 6 English USA PG-13 263700000 2012 632
## imdb_score aspect_ratio movie_facebook_likes
## 1 7.9 1.78 33000
## 2 7.1 2.35 0
## 3 6.8 2.35 85000
## 4 8.5 2.35 164000
## 5 7.1 NA 0
## 6 6.6 2.35 24000
I will focus on a subset that meets the exam requirement (≥4 variables, mostly numeric plus ≥1 categorical):
imdb_score: IMDB Score of the movie on IMDB (numeric, average rating 0–10),
duration: Duration in minutes (numeric),
budget: P roduction budget of the movie in Dollars (numeric),
gross: Gross earnings of the movie in Dollars (numeric),
color: Film colorization (categorical, “Black and White” or “Color”)
#Data cleaning and manipulation
# Keep only variables we actually use later (gross, budget, duration, imdb_score, color, language)
movies_clean <-
movies %>%
select(gross, budget, duration, imdb_score, color, language) %>%
# Remove rows with missing or unusable key values
filter(
!is.na(imdb_score),
!is.na(duration),
!is.na(budget),
!is.na(gross),
!is.na(color)) %>%
# Create ROI (return on investment); protect against zero budgets
mutate(roi = ifelse(budget > 0, gross / budget, NA_real_)) %>%
# Rename for convenience
rename(imdb = imdb_score) %>%
# Recode color to a binary factor: Color = 0, Black and White = 1
mutate(
ColorBinary = factor(
color,
levels = c("Color", "Black and White"),
labels = c(0, 1)))
# 6) Subset: English-language films (used later for comparison)
movies_eng <- movies_clean %>% filter(language == "English")
I removed rows with missing values in key variables to ensure consistency.
I created new variable: roi = gross/budget, a simple profitability measure.
I renamed imdb_score to imdb to simplify code.
I reformated categorical value color: 0 = Color, 1 = Black and White.
I converted content_rating into a factor so it can be used in grouped summaries and boxplots.
I prepared a subset of English-language films for comparison.
#Descriptive statistics
summary(movies_clean[, c("imdb","duration","budget","gross","roi", "ColorBinary")])
## imdb duration budget gross
## Min. :1.600 Min. : 34.0 Min. :2.180e+02 Min. : 162
## 1st Qu.:5.900 1st Qu.: 95.0 1st Qu.:1.000e+07 1st Qu.: 6844452
## Median :6.600 Median :106.0 Median :2.400e+07 Median : 27996968
## Mean :6.464 Mean :109.9 Mean :4.520e+07 Mean : 51068087
## 3rd Qu.:7.200 3rd Qu.:120.0 3rd Qu.:5.000e+07 3rd Qu.: 65406486
## Max. :9.300 Max. :330.0 Max. :1.222e+10 Max. :760505847
## roi ColorBinary
## Min. : 0.0000 0 :3757
## 1st Qu.: 0.4513 1 : 0
## Median : 1.0708 NA's: 133
## Mean : 6.2554
## 3rd Qu.: 2.2286
## Max. :7194.4855
Interpretations:
Mean and median forIMDB ratings: mean is aproximately 6.5 suggesting ratings are roughly symmetric around 6–7.Most films are rated positively, but the mean is slightly pulled down by a few poorly rated movies. Median IMDB rating is also about 6.5, very close to the mean — indicating that ratings are fairly symmetric around the center.
Quartiles for duration: Q1 ≈ 95 minutes and Q3 ≈ 120 minutes. This means that 50% of all movies fall between 95–120 minutes.
Minimum and maximum foe budget & Gross: The minimum budget is only a few hundred dollars, while the maximum is over $1 billion. This big gap creates extreme outliers (highly right-skewed).; a few blockbusters dominate the upper tail.
Standard deviation for budget and gross: The SDs are extremely large, confirming that financial variables are highly variable and spread out, with big differences between typical films and blockbuster outliers.
ROI: very skewed; most films near ROI ≈ 1, but some very high outliers indicate runaway hits.
Frequency for ColorBinary: Majority of films are Color = 0, while Black & White = 1 represents only a small share of the dataset.
#Graphs
##Histogram of IMDB ratings
library(forcats)
ggplot(movies_clean, aes(x = imdb)) +
geom_histogram(binwidth = 0.5, fill = "darkgoldenrod1", color = "mediumseagreen") +
labs(title = "Distribution of IMDB ratings",
x = "IMDB rating", y = "Count")
Interpretation: Asimetrical - scewed slightly to the left, unimodal.
Ratings concentrate between 6 and 8; tail is thinnest at very low
scores.
##Budget vs Gross
ggplot(movies_clean, aes(x = budget, y = gross)) +
geom_point(alpha = 0.4) +
labs(title = "Budget vs Gross revenue",
x = "Budget (USD)", y = "Gross (USD)")
Interpretation: A clear positive relationship with large
dispersion—higher budgets often earn more, but not always.
# Boxplot of ROI by ColorBinary (design-focused)
p95 <- quantile(movies_clean$roi, 0.95, na.rm = TRUE)
ggplot(movies_clean, aes(x = ColorBinary, y = roi, fill = ColorBinary)) +
geom_boxplot(
width = 0.6,
notch = TRUE, # notched median
outlier.shape = 21, # hollow circles for outliers
outlier.size = 2,
outlier.fill = "white",
alpha = 0.8,
color = "black"
) +
stat_summary(fun = mean, geom = "point",
shape = 23, size = 3, fill = "red") + # red diamond mean
coord_cartesian(ylim = c(0, p95)) +
scale_fill_manual(
values = c("0" = "#1f77b4", # blue for Color
"1" = "#ff7f0e") # orange for Black & White
) +
labs(
title = "ROI by Film Color",
subtitle = "Boxplot trimmed at 95th percentile",
x = "Film color (0 = Color, 1 = Black & White)",
y = "ROI (gross / budget)",
caption = "◇ = mean • Notch ≈ 95% CI around median"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 14, color = "navy"),
plot.subtitle = element_text(size = 11, color = "grey30"),
axis.title.x = element_text(margin = margin(t = 10)),
axis.title.y = element_text(margin = margin(r = 10)),
panel.grid.major.y = element_line(color = "grey85", linetype = "dashed"),
panel.grid.minor = element_blank(),
legend.position = "none"
)
Interpretation: ROI distributions are highly skewed in both groups (long
upper tails). Medians can be compared directly; trimming at the 95th
percentile keeps the central pattern readable while acknowledging
extreme successes.
(Design:
Color palette: blue for Color = 0, orange for B&W = 1 (clear, colorblind-safe).
Box style: notched, semi-transparent, black border.
Outliers: hollow circles with white fill to stand out.
Mean: red diamond marker.
Theme: theme_minimal() with custom title, subtitle, dashed gridlines.)
#Importing the dataset and libraries
#Libraries
library(readxl)
#library(dplyr)
#library(ggplot2)
#library(psych) # describe()
#library(pastecs) # stat.desc()
#library(forcats) # fct_infreq, fct_reorder
# Import (first sheet)
bsdata <- readxl::read_excel("C:/Users/Žabica/Desktop/IMB/bootcamp/R Take Home Exam 2025/Task 2/Business School.xlsx")
head(bsdata)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Business 68.4 90.2
## 2 2 Computer Science 70.2 68.7
## 3 3 Finance 76.4 83.3
## 4 4 Business 82.6 88.7
## 5 5 Finance 76.9 75.4
## 6 6 Computer Science 83.3 82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## # `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
I loaded the Excel file into bsdata(BusinessSchool). In the following i will focus on three variables: Undergrad Degree (categorical), Annual Salary (numeric), MBA grade (numeric).
# Bar chart of undergrad degrees
library(forcats)
ggplot(bsdata, aes(x = fct_infreq(as.factor(`Undergrad Degree`)))) +
geom_bar(fill = "yellow3" , color = "violetred2", alpha = 0.9, width = 0.7) +
labs(
title = "Distribution of Undergrad Degrees",
x = "Undergrad Degree",
y = "Count"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 30, hjust = 1)
)
# Which degree is most common?
deg_counts <- bsdata %>%
count(`Undergrad Degree`, sort = TRUE)
deg_counts[1, ] # top category and its frequency
## # A tibble: 1 × 2
## `Undergrad Degree` n
## <chr> <int>
## 1 Business 35
The most common degree is Bussines, which is the dominant category.
# Descriptive statistics
summary(bsdata$`Annual Salary`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
# Detailed set (includes cv, median, etc.)
round(pastecs::stat.desc(bsdata$`Annual Salary`), 2)
## nbr.val nbr.null nbr.na min max range
## 1.000000e+02 0.000000e+00 0.000000e+00 2.000000e+04 3.400000e+05 3.200000e+05
## sum median mean SE.mean CI.mean.0.95 var
## 1.090580e+07 1.035000e+05 1.090580e+05 4.150150e+03 8.234800e+03 1.722373e+09
## std.dev coef.var
## 4.150149e+04 3.800000e-01
Interpretation: Because the mean is higher than the median, the distribution is right-skewed.
From the quartiles and IQR we can see that half of the cohort earns between 87,125 and 124,000.
Standard deviation: salaries vary by €41k around the mean.
Wide range and the large SD confirms heavy right tail and probably high-end outliers.
Average salary is estimated at €109k with a 95% CI of €101k–€117k.
# Choose a data-driven binwidth: 30 bins across the observed range
binw <- diff(range(bsdata$`Annual Salary`, na.rm = TRUE))/30
ggplot(bsdata, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = binw, fill = "#10B981", color = "black", alpha = 0.9) +
labs(title = "Distribution of Annual Salary",
x = "Annual Salary",
y = "Count") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
Interpretation: Asimetrical- right scewed, unimodal, peak around
90k-120k; matches the median and mean.
Long right tail with small amount of high salaries - outliers.
Peak around 140-160k possibly represents subgroup.
# Optional diagnostic visuals
ggplot(bsdata, aes(x = `MBA Grade`)) +
geom_histogram(binwidth = 2, fill = "#F59E0B", color = "black", alpha = 0.9) +
labs(title = "MBA Grade: distribution check", x = "MBA Grade", y = "Count") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"))
qqnorm(bsdata$`MBA Grade`, main = "Q-Q plot: MBA Grade"); qqline(bsdata$`MBA Grade`, col = 2)
# One-sample two-sided t-test against last year's mean = 74
tt <- t.test(bsdata$`MBA Grade`, mu = 74, alternative = "two.sided")
tt
##
## One Sample t-test
##
## data: bsdata$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
# Effect size: Cohen's d for one-sample test
m_mean <- mean(bsdata$`MBA Grade`, na.rm = TRUE)
m_sd <- sd(bsdata$`MBA Grade`, na.rm = TRUE)
d <- (m_mean - 74) / m_sd
list(mean = m_mean, sd = m_sd, cohen_d = d,
conf.low = tt$conf.int[1], conf.high = tt$conf.int[2],
t = unname(tt$statistic), df = unname(tt$parameter), p = tt$p.value)
## $mean
## [1] 76.04055
##
## $sd
## [1] 7.675114
##
## $cohen_d
## [1] 0.2658658
##
## $conf.low
## [1] 74.51764
##
## $conf.high
## [1] 77.56346
##
## $t
## [1] 2.658658
##
## $df
## [1] 99
##
## $p
## [1] 0.009149922
Hypotheses: 𝐻0:𝜇=74 𝐻1:𝜇≠74
p<α p<0.05 0.009149922 < 0.05….We can reject the H0. The mean MBA grade differs significantly from 74.
The current cohort’s average MBA grade (76) is higher than last year’s 74. The effect size is small which means the improvement is small.
# Libraries
#library(dplyr)
#library(ggplot2)
# Optional tools we’ll use later
#library(psych) # describe()
library(car) # scatterplotMatrix(), vif()
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
# Import dataset
library(readxl)
apart <- read_excel("C:/Users/Žabica/Desktop/IMB/bootcamp/R data/Bootcamp1-15-9-25/Apartments.xlsx")
#View(apart)
Description:
apart <- apart %>%
mutate(
ParkingF = factor(Parking, levels = c(0,1), labels = c("No","Yes")),
BalconyF = factor(Balcony, levels = c(0,1), labels = c("No","Yes")))
str(apart)
## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
## $ ParkingF: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ BalconyF: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
ParkingF and BalconyF are now factors (No/Yes) so models treat them as categorical. This also makes boxplots and model summaries easier to interpret. I kept the original columns Parking and Balcony (0/1) in case i would need them later.
# One-sample t-test
t_price <- t.test(apart$Price, mu = 1900, alternative = "two.sided")
print(t_price)
##
## One Sample t-test
##
## data: apart$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
Interpretation: H0:μ=1900
H1:μ≠1900
p<α p<0.05 0.00473 < 0.05
I am rejecting the null hypothesis at the 5% level. The mean apartment price is significantly different from 1900 - higher by 119 on average.
fit1 <- lm(Price ~ Age, data = apart)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Interpretation: Each additional year is on average associated with 8,98 lower apartment price - a weak negative linear association. The correlation is not highly significant. About 5.3% of the variation in price is explained by age alone; the other ~94.7% is due to other factors.
car::scatterplotMatrix(apart[, c("Price","Age","Distance")], smooth = FALSE)
Interpretation: The scatterplot matrix shows no multicollinearity.The
scatterplot matrix for Price, Age, and Distance helps assess linear form
and outliers.
fit2 <- lm(Price ~ Age + Distance, data = apart)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
Interpretation: Holding distance fixed, each extra year of age is associated with about 7.9 lower price on average (statistically significant). Holding age fixed, each extra unit of distance lowers expected price by about 20.7 (highly significant). About 44% of the variation in prices is explained by age and distance. I conclude there is strong evidence that at least one - age or sistance helps explain the price (overall model is significant).
vif_vals <- car::vif(fit2)
vif_vals
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
Interpretation: VIFs ~ 1 and mean VIF ~ 1 tell me there is no evidence of multicollinearity between age and distance. Coefficient estimates should be stable and their standard errors are unaffected by correlation among the regressors.
# Standardized residuals and Cook's distances for fit2
apart$StdResid <- round(rstandard(fit2), 3)
apart$CooksD <- cooks.distance(fit2)
# Histogram of standardized residuals
hist(apart$StdResid,
col = "lightblue",
border = "blue",
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals (fit2)")
Interpretation: Residuals are centered near 0 with most values within
±2, indicating no major systematic bias. A mild right tail and ~1
observation near +3 suggest limited outlier presence. Using Cook’s D
with the 4/n rule, there are potentially problematic units: at least one
outlier (StdResid ≈ +3).If Cook’s D > 4/n for the same unit, then
it’s both an outlier and influential.
# Cook's Distance for fit2
apart$CooksD <- round(cooks.distance(fit2), 3)
# Histogram of Cook's Distance
hist(apart$CooksD,
col = "magenta1",
border = "magenta4",
xlab = "Cook's Distance",
ylab = "Frequency",
main = "Histogram of Cook's Distances (fit2)")
# Show top cases with highest Cook's Distance
head(apart[order(-apart$CooksD), ])
## # A tibble: 6 × 9
## Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 5 45 2180 1 1 Yes Yes 2.58 0.32
## 2 43 37 1740 0 0 No No 1.44 0.104
## 3 2 11 2790 1 0 Yes No 2.05 0.069
## 4 7 2 1760 0 1 No Yes -2.15 0.066
## 5 37 3 2540 1 1 Yes Yes 1.58 0.061
## 6 40 2 2400 0 1 No Yes 1.09 0.038
Interpretation: There is one clear outlier number 38 at 0.33, which is highly influential and worth inspecting if there are more.
# --- First chunk: Calculate Cook's D and show top 10 ---
apart$ID <- seq_len(nrow(apart))
apart$CooksD <- cooks.distance(fit2)
head(apart[order(-apart$CooksD),
c("Age","Distance","Price","StdResid","CooksD")], 10)
## # A tibble: 10 × 5
## Age Distance Price StdResid CooksD
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 5 45 2180 2.58 0.320
## 2 43 37 1740 1.44 0.104
## 3 2 11 2790 2.05 0.0691
## 4 7 2 1760 -2.15 0.0663
## 5 37 3 2540 1.58 0.0609
## 6 40 2 2400 1.09 0.0375
## 7 8 2 2820 1.66 0.0365
## 8 8 26 2300 1.57 0.0341
## 9 10 1 2810 1.60 0.0320
## 10 18 1 2800 1.78 0.0304
Interpretation: Here we have the possible outliers. I will presume the rule of D > 4/n. For my dataset that means I will remove all the cases with D > 0.04.
#Remove flagged cases ---
library(dplyr)
# Example: replace these with IDs flagged above
drop_ids <- c(38, 55, 33, 53, 22)
apart <- apart %>% filter(!ID %in% drop_ids)
#View(apart)
Interpretation: Previously mentioned cases with D > 0.04 are filtered out.
# Multiple regression model (already created as fit2, shown again for clarity)
fit2 <- lm(Price ~ Age + Distance, data = apart)
# Standardized residuals and standardized fitted values
apart$StdResid <- round(rstandard(fit2), 3)
apart$StdFitValues <- scale(fit2$fitted.values)
# Residuals vs fitted plot
library(car)
scatterplot(y = apart$StdResid,
x = apart$StdFitValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE,
main = "Standardized residuals vs fitted values (fit2)")
Interpretation: The scatter suggests heteroskedasticity increasing with
fitted values. I will have to use robust SEs and check that results are
robust after removing the few influential points I identified.
# Standardized residuals for fit2
apart$StdResid <- round(rstandard(fit2), 3)
# Histogram of standardized residuals
hist(apart$StdResid,
col = "lightblue",
border = "blue",
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals (fit2)")
Interpretation: Removing those IDs reduced the influence of a handful of
extreme points, yielding residuals more consistent with model
assumptions (normality and constant variance). Parameter estimates and
standard errors should be less distorted.
# Shapiro-Wilk test of normality for standardized residuals
shapiro.test(apart$StdResid)
##
## Shapiro-Wilk normality test
##
## data: apart$StdResid
## W = 0.94154, p-value = 0.001166
Interpretation: Even after dropping those IDs, residuals aren’t normal enough for a clean Shapiro pass. I would have to do the procedure of finding and eliminating the ouliers again.
fit2 <- lm(Price ~ Age + Distance, data = apart)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
Interpretation: Each additional year is on average associated with 8,67 lower apartment price - a weak negative linear association. The correlation is significant. About 5.3% of the variation in price is explained by age alone; the other ~94.7% is due to other factors.
sqrt(summary(fit2)$r.squared)
## [1] 0.732187
Interpretation: This is R, the multiple correlation between the observed response and the model’s fitted values. My result 0.732 means the model’s predictions correlate about 0.73 with actual prices. Squaring it gives back R² ≈ 0.536, i.e., the model explains ~53.6% of the variance.
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = apart)
print(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = apart)
##
## Coefficients:
## (Intercept) Age Distance ParkingFYes BalconyFYes
## 2393.316 -7.970 -21.961 128.700 6.032
Interpretation: Age (−7.97): each 1-unit increase in Age is associated with ~8 units lower Price, other variables fixed.. Distance (−21.96): each 1-unit increase in Distance lowers Price by ~22 units, other variables fixed. Parking (+128.7): apartments with parking are about 129 units more expensive than those without, other variables fixed. Balcony (+6.03): a balcony adds ~6 units on average, other variables fixed.
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
Interpretation: Since p = 0.12 > 0.05, I do not have evidence that adding Parking and Balcony improves fit beyond Age and Distance.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = apart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingFYes 128.700 60.801 2.117 0.0376 *
## BalconyFYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Interpretation: Apartments with parking are estimated to cost about 129 units more than those without parking, other variables fixed (statistically significant at 5%). Having a balcony is associated with ~6 units higher price, other variables fixed, but this effect is not statistically important.
# Fitted value and residual for apartment with ID = 2
Fitted_ID2 <- fitted(fit3)[apart$ID == 2]
Residual_ID2 <- resid(fit3)[apart$ID == 2]
round(c(Fitted = Fitted_ID2, Residual = Residual_ID2), 3)
## Fitted.2 Residual.2
## 2356.597 443.403
Interpretation: A positive residual means the apartment’s actual price is higher than the model predicts. This apartment was sold for ≈ 2356.597 + 443.403 = 2800 (same units as Price), i.e., about 443 above the model’s expectation given its Age, Distance, Parking, and Balcony.