Although not absolutely necessary, I like to check my working directory to avoid issues before importing datasets. I have also chosen to separate any comments I make from the markdown outputs using “#” characters. This, again, is not strictly necessary but I prefer to do this in order to help with reading the contents of this R analysis.
getwd()
## [1] "/home/ll-main"
###################################################################################################
###################################################################################################
###################################################################################################
# #
# TTTTT AAAAA SSSSS K K OOOOO N N EEEEE #
# T A A S K K O O NN N E #
# T AAAAA SSSSS KKK O O N N N EEEE #
# T A A S K K O O N NN E #
# T A A SSSSS K K OOOOO N N EEEEE #
# #
###################################################################################################
###################################################################################################
#`##################################################################################################
The Dataset I have chosen for Task 1 is called “car_prices”. I downloaded this .csv file from Kaggle under the following link: https://www.kaggle.com/datasets/syedanwarafridi/vehicle-sales-data?resource=download
The description given for this dataset is the following:
The “Vehicle Sales and Market Trends Dataset” provides a comprehensive collection of information pertaining to the sales transactions of various vehicles. This dataset encompasses details such as the year, make, model, trim, body type, transmission type, VIN (Vehicle Identification Number), state of registration, condition rating, odometer reading, exterior and interior colors, seller information, Manheim Market Report (MMR) values, selling prices, and sale dates.
car_prices <- read.csv("/home/ll-main/Downloads/car_prices.csv")
head(car_prices) #Inspect the first 6 rows
## year make model trim body transmission
## 1 2015 Kia Sorento LX SUV automatic
## 2 2015 Kia Sorento LX SUV automatic
## 3 2014 BMW 3 Series 328i SULEV Sedan automatic
## 4 2015 Volvo S60 T5 Sedan automatic
## 5 2014 BMW 6 Series Gran Coupe 650i Sedan automatic
## 6 2015 Nissan Altima 2.5 S Sedan automatic
## vin state condition odometer color interior
## 1 5xyktca69fg566472 ca 5 16639 white black
## 2 5xyktca69fg561319 ca 5 9393 white beige
## 3 wba3c1c51ek116351 ca 45 1331 gray black
## 4 yv1612tb4f1310987 ca 41 14282 white black
## 5 wba6b2c57ed129731 ca 43 2641 gray black
## 6 1n4al3ap1fn326013 ca 1 5554 gray black
## seller mmr sellingprice
## 1 kia motors america inc 20500 21500
## 2 kia motors america inc 20800 21500
## 3 financial services remarketing (lease) 31900 30000
## 4 volvo na rep/world omni 27500 27750
## 5 financial services remarketing (lease) 66000 67000
## 6 enterprise vehicle exchange / tra / rental / tulsa 15350 10900
## saledate
## 1 Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
## 2 Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
## 3 Thu Jan 15 2015 04:30:00 GMT-0800 (PST)
## 4 Thu Jan 29 2015 04:30:00 GMT-0800 (PST)
## 5 Thu Dec 18 2014 12:30:00 GMT-0800 (PST)
## 6 Tue Dec 30 2014 12:00:00 GMT-0800 (PST)
I now use the str() function to show the types of variables in this dataset. It includes 558837 observations for 16 variables in total, as shown below. They’re a mix of numeric and categorical variables that describe various types of information about vehicles sold between 1982 and 2015.
The 16 variables and their types/informaiton can be determined to be:
output below:
str(car_prices)
## 'data.frame': 558837 obs. of 16 variables:
## $ year : int 2015 2015 2014 2015 2014 2015 2014 2014 2014 2014 ...
## $ make : chr "Kia" "Kia" "BMW" "Volvo" ...
## $ model : chr "Sorento" "Sorento" "3 Series" "S60" ...
## $ trim : chr "LX" "LX" "328i SULEV" "T5" ...
## $ body : chr "SUV" "SUV" "Sedan" "Sedan" ...
## $ transmission: chr "automatic" "automatic" "automatic" "automatic" ...
## $ vin : chr "5xyktca69fg566472" "5xyktca69fg561319" "wba3c1c51ek116351" "yv1612tb4f1310987" ...
## $ state : chr "ca" "ca" "ca" "ca" ...
## $ condition : int 5 5 45 41 43 1 34 2 42 3 ...
## $ odometer : int 16639 9393 1331 14282 2641 5554 14943 28617 9557 4809 ...
## $ color : chr "white" "white" "gray" "white" ...
## $ interior : chr "black" "beige" "black" "black" ...
## $ seller : chr "kia motors america inc" "kia motors america inc" "financial services remarketing (lease)" "volvo na rep/world omni" ...
## $ mmr : int 20500 20800 31900 27500 66000 15350 69000 11900 32100 26300 ...
## $ sellingprice: int 21500 21500 30000 27750 67000 10900 65000 9800 32250 17500 ...
## $ saledate : chr "Tue Dec 16 2014 12:30:00 GMT-0800 (PST)" "Tue Dec 16 2014 12:30:00 GMT-0800 (PST)" "Thu Jan 15 2015 04:30:00 GMT-0800 (PST)" "Thu Jan 29 2015 04:30:00 GMT-0800 (PST)" ...
I will now start with some data manipulations.
First, I will check for missing values in the whole dataset, count them by column and get a percentage of how many are missing for each variable.
# Count total missing values in the dataset
sum(is.na(car_prices))
## [1] 11964
# Count missing values by column
colSums(is.na(car_prices))
## year make model trim body transmission
## 0 0 0 0 0 0
## vin state condition odometer color interior
## 0 0 11820 94 0 0
## seller mmr sellingprice saledate
## 0 38 12 0
# Quick overview of missing data with percentages
colMeans(is.na(car_prices)) * 100
## year make model trim body transmission
## 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
## vin state condition odometer color interior
## 0.000000000 0.000000000 2.115106910 0.016820647 0.000000000 0.000000000
## seller mmr sellingprice saledate
## 0.000000000 0.006799836 0.002147317 0.000000000
This output shows that the dataset has 11,964 missing values in total, mostly in the condition column (~2.1%), with very small amounts in odometer, mmr, and sellingprice, while all other variables have no missing data.
I will now remove those observations from the dataset car_prices.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Remove all rows with any NA values
car_prices_clean <- car_prices %>%
filter(!is.na(condition), !is.na(odometer), !is.na(mmr), !is.na(sellingprice))
# Check dimensions before and after
dim(car_prices) # original
## [1] 558837 16
dim(car_prices_clean) # cleaned
## [1] 546976 16
The output above shows that upon removing the observations with na values, we are left with 16 variables (unchanged), while the new dataset “car_prices_clean” has 546976 observations vs 558837 observations in the original “car_prices” dataset.
I will now rename the variables to make them clearer for analysis:
car_prices_clean <- car_prices_clean %>%
rename(
year_model = year,
make_brand = make,
model_name = model,
sale_date = saledate,
price_mmr = mmr,
price_sold = sellingprice
)
names(car_prices_clean)
## [1] "year_model" "make_brand" "model_name" "trim" "body"
## [6] "transmission" "vin" "state" "condition" "odometer"
## [11] "color" "interior" "seller" "price_mmr" "price_sold"
## [16] "sale_date"
I will now create a new variables in the dataset car_prices_clean.
The variable will be called price_per_mile, which divides the selling price (price_sold) by the odometer reading. This provides a simple measure of how much buyers paid per mile on the car, giving a rough proxy of value relative to usage. I will only include cars which have an odometer reading greater than 0 miles and prices of more than 0.
car_prices_clean <- car_prices_clean %>%
mutate(price_per_mile = ifelse(odometer > 0 & price_sold > 0,
price_sold / odometer,
NA_real_)
)
# Quick summary of the new variable
summary(car_prices_clean$price_per_mile)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.08 0.27 7.43 0.65 37000.00
The median is 0.27 and the 1st and 3rd quartiles are 0.08 and 0.65, which means that most vehicles cost only a few cents per mile of usage. However, the mean is 7.43, far above the median, because a small number of cars have very high price-per-mile values. The maximum of 37,000 strongly suggests the presence of severe outliers — likely vehicles with very low mileage (odometer close to zero) but relatively high selling prices. This large gap between the median and mean confirms that the overall distribution is dominated by these extreme cases. Given this, I will now investigate the outliers more closely to understand their characteristics and whether they represent data entry issues or genuine but rare cases.
iqr_outlier_bounds <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower <- max(0, Q1 - 1.5 * IQR) # ensure lower bound is never below 0
upper <- Q3 + 1.5 * IQR
return(c(lower = lower, upper = upper))
}
bounds_ppm <- iqr_outlier_bounds(car_prices_clean$price_per_mile)
bounds_ppm
## lower upper.75%
## 0.000000 1.512487
Using the 1.5 × IQR rule, the calculated lower bound would be negative (–0.78), which is not meaningful for this analysis since price_per_mile cannot be less than zero. Therefore, only the upper cutoff of ~1.51 is relevant for identifying outliers in this variable.
I will now plot a histogram to show the distribution of the variable price per mile.
#install.packages("ggplot2")
library(ggplot2)
# Histogram of price_per_mile
ggplot(car_prices_clean, aes(x = price_per_mile)) +
geom_histogram(binwidth = 0.1, fill = "skyblue", color = "black") +
coord_cartesian(xlim = c(0, 5)) + # zoom in to ignore extreme outliers
labs(title = "Histogram of Price per Mile",
x = "Price per Mile ($ / mile)",
y = "Count")
The histogram shows that most vehicles cost just a few cents per mile, with a heavily right-skewed distribution and a small number of extreme high price-per-mile outliers.
I will now plot 2 more distributions, one for odometer, and one for price in order to investigate the variables odometer and price sold further. I will also add descriptive statistics for each to investigate the origin of the histogram above.
# Descriptive stats for odometer and selling price
odometer_stats <- car_prices_clean %>%
summarise(
n = n(),
mean = round(mean(odometer, na.rm = TRUE), 2),
median = round(median(odometer, na.rm = TRUE), 2),
sd = round(sd(odometer, na.rm = TRUE), 2),
min = round(min(odometer, na.rm = TRUE), 2),
max = round(max(odometer, na.rm = TRUE), 2)
)
price_stats <- car_prices_clean %>%
summarise(
n = n(),
mean = round(mean(price_sold, na.rm = TRUE), 2),
median = round(median(price_sold, na.rm = TRUE), 2),
sd = round(sd(price_sold, na.rm = TRUE), 2),
min = round(min(price_sold, na.rm = TRUE), 2),
max = round(max(price_sold, na.rm = TRUE), 2)
)
# Histogram of odometer
ggplot(car_prices_clean, aes(x = odometer)) +
geom_histogram(binwidth = 5000, fill = "red", color = "black") +
coord_cartesian(xlim = c(0, 200000)) + # zoom in to ignore extreme outliers
labs(title = "Histogram of Odometer",
x = "Odometer (miles)",
y = "Count")
# Histogram of selling price
ggplot(car_prices_clean, aes(x = price_sold)) +
geom_histogram(binwidth = 1000, fill = "green", color = "black") +
coord_cartesian(xlim = c(0, 100000)) + # zoom in for readability
labs(title = "Histogram of Selling Price",
x = "Selling Price ($)",
y = "Count")
print(odometer_stats)
## n mean median sd min max
## 1 546976 67263.26 51223 52793.56 1 999999
print(price_stats)
## n mean median sd min max
## 1 546976 13764.43 12300 9747.83 1 230000
What can be observed is the following:
Odometer
The odometer values range from 1 to 999,999 miles, with a mean of 67,263 and a median of 51,223. The histogram shows the highest concentration of vehicles between about 20,000 and 60,000 miles, with steadily fewer observations as mileage increases, producing a right-skewed distribution. The number of vehicles with around 60,000 and 100,000 miles seems to be quite level though, and then decreases gradually as mileage increases further.
Selling Price
Selling prices range from 1 to 230,000 dollars, with a mean of 13,764 and a median of 12,300. The histogram shows that most vehicles fall between roughly 5,000 and 20,000 dollars, with progressively fewer observations at higher prices, also forming a right-skewed distribution. The histogram also shows a minor peak at a lower price and a more pronounced one at a slightly higher price.
Relation to Price per Mile
Taken together, these patterns help explain the distribution of price per mile. Most vehicles, with moderate mileage and moderate selling prices, fall below 1 dollar per mile, where the median is 0.27. A smaller set of cases with unusually low mileage or unusually high selling prices account for the extreme right tail in the price per mile distribution.
###################################################################################################
###################################################################################################
###################################################################################################
# #
# TTTTT AAAAA SSSSS K K TTTTT W W OOOOO #
# T A A S K K T W W O O #
# T AAAAA SSSSS KKK T W W W O O #
# T A A S K K T W W W O O #
# T A A SSSSS K K T W W OOOOO #
# #
###################################################################################################
###################################################################################################
###################################################################################################
I will now proceed with task 2. First, I will import the dataset.
#install.packages(c("readxl","ggplot2","effectsize"))
library(readxl)
library(ggplot2)
library(effectsize)
#Load data
mba <- read_excel("/home/ll-main/Downloads/Task 2/Business School.xlsx")
names(mba) # inspect column names
## [1] "Student ID" "Undergrad Degree" "Undergrad Grade"
## [4] "MBA Grade" "Work Experience" "Employability (Before)"
## [7] "Employability (After)" "Status" "Annual Salary"
#I renamed some variables to make them a bit more "user friendly" for myself.
deg_col <- "Undergrad Degree"
sal_col <- "Annual Salary"
grade_col <- "MBA Grade"
I now see the column names printed above and will proceed with plotting and finding the most common degree.
# Plot distribution of undergrad degrees
ggplot(mba, aes(x = `Undergrad Degree`)) +
geom_bar() +
coord_flip() + # makes long labels readable
theme_minimal() +
labs(title = "Undergrad Degrees",
x = NULL, y = "Count")
# Which degree is most common?
most_common <- names(which.max(table(mba$`Undergrad Degree`)))
cat("Most common undergrad degree:", most_common, "\n")
## Most common undergrad degree: Business
As per the output, the most common degree is business, followed by Finance and Computer Science
Now I will investigate the distribution of annual salary and the descriptive statistics for it.
# Descriptive statistics
summary(mba$`Annual Salary`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 87125 103500 109058 124000 340000
sd(mba$`Annual Salary`, na.rm = TRUE) # standard deviation
## [1] 41501.49
# Histogram
ggplot(mba, aes(x = `Annual Salary`)) +
geom_histogram(binwidth = 5000, fill = "magenta", color = "black") +
theme_minimal() +
labs(title = "Distribution of Annual Salary",
x = "Annual Salary",
y = "Count")
The output shows that annual salaries ranged from 20,000 to 340,000, with a mean of 109,058 and a median of 103,500. The majority of salaries clustered between about 87,000 and 124,000. The distribution is right-skewed, with a few very high salaries acting as outliers and pulling the mean above the median.
I will now test the hypothesis H0 : μMBA Grade = 74. I will use the t test and cohen’s d for this.
# Extract MBA grades
grades <- mba$`MBA Grade`
# One-sample t-test against 74
ttest <- t.test(grades, mu = 74)
print(ttest)
##
## One Sample t-test
##
## data: grades
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
# Effect size: Cohen's d (vs constant)
cohend <- effectsize::cohens_d(grades, mu = 74)
print(cohend)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
INTERPRETATION OF THE RESULTS: The sample mean MBA grade was 76.04, compared to the previous year’s average of 74. A one-sample t-test indicated this difference was statistically significant, t(99) = 2.66, p = 0.009, 95% CI [74.52, 77.56]. The effect size was Cohen’s d = 0.27, which reflects a small practical difference.
In other words, the current cohort performed slightly (but in a statistically significant way) better than last year’s cohort.
###################################################################################################
###################################################################################################
###################################################################################################
# #
# TTTTT AAAAA SSSSS K K TTTTT H H RRRR EEEEE EEEEE #
# T A A S K K T H H R R E E #
# T AAAAA SSSSS KKK T HHHHH RRRR EEEE EEEE #
# T A A S K K T H H R R E E #
# T A A SSSSS K K T H H R R EEEEE EEEEE #
# #
###################################################################################################
###################################################################################################
###################################################################################################
#Load data
apartment_data <- read_excel("/home/ll-main/Downloads/Task 3/Apartments.xlsx") # set correct path if different
str(apartment_data)
## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
## $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
## $ Price : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
## $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
## $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
Description:
# Convert 0/1 to factors with labels
apartment_data$Parking <- factor(apartment_data$Parking, levels = c(0, 1), labels = c("No", "Yes"))
apartment_data$Balcony <- factor(apartment_data$Balcony, levels = c(0, 1), labels = c("No", "Yes"))
#Check if it worked
table(apartment_data$Parking)
##
## No Yes
## 42 43
table(apartment_data$Balcony)
##
## No Yes
## 48 37
str(apartment_data[, c("Parking","Balcony")])
## tibble [85 × 2] (S3: tbl_df/tbl/data.frame)
## $ Parking: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
## $ Balcony: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...
# One-sample t-test against 1900 for price
t_price <- t.test(apartment_data$Price, mu = 1900)
t_price
##
## One Sample t-test
##
## data: apartment_data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
# I used Cohen's d for effect size
d_price <- (mean(apartment_data$Price) - 1900) / sd(apartment_data$Price)
d_price
## [1] 0.314791
sd(apartment_data$Price)
## [1] 377.8417
The average apartment price per m-square in the Ljubljana sample was 2018.94 € (SD = 377.84, n = 85). A one-sample t-test indicated that this was significantly higher than the benchmark value of 1900 euros, t(84) = 2.90, p = 0.0047, 95% CI [1937.44, 2100.44]. The effect size was Cohen’s d = 0.31, suggesting a small-to-moderate practical difference from the benchmark.
#Price as a function of Age
fit1 <- lm(Price ~ Age, data = apartment_data)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = apartment_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
#Correlation coefficient (r) between Price and Age
r_age <- cor(apartment_data$Price, apartment_data$Age)
r_age
## [1] -0.230255
#Coefficient of determination r square
r2_age <- r_age^2
r2_age
## [1] 0.05301737
The regression shows that the average price per m² goes down by about 9 € for every extra year of apartment age. The intercept (around 2185 €) just means that a brand-new apartment would be expected to cost about that much per m².
The correlation between age and price is about –0.23, which is a weak negative relationship: older apartments tend to be a bit cheaper.
The R² value is about 0.05, so age only explains around 5% of the differences in apartment prices — most of the variation comes from other factors.
pairs(~ Price + Age + Distance,
data = apartment_data,
main = "Scatterplot Matrix")
There seems to be some possible multicollinearity of price with distance, but I don’t believe there is any of Price with Age.
# Multiple regression: Price as a function of Age and Distance
fit2 <- lm(Price ~ Age + Distance, data = apartment_data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartment_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
The regression shows that both apartment age and distance from the city center have a negative effect on price per m². On average, each year of age lowers the price by about 8 €, and each kilometer farther from the city center lowers it by about 21 €. Together, age and distance explain about 44% of the differences in apartment prices.
# Regression of Age on Distance
fit_age_on_dist <- lm(Age ~ Distance, data = apartment_data)
r2_age <- summary(fit_age_on_dist)$r.squared
vif_age <- 1 / (1 - r2_age)
# Regression of Distance on Age
fit_dist_on_age <- lm(Distance ~ Age, data = apartment_data)
r2_dist <- summary(fit_dist_on_age)$r.squared
vif_dist <- 1 / (1 - r2_dist)
vif_age
## [1] 1.001845
vif_dist
## [1] 1.001845
Unfortunately I ran into an issue with the CAR package on my computer due to some issues with r on Linux. I would normally have used libary(car) and then vif. I computed manually by choice this time to avoid pc issues again.
Interpretation: The VIF values for Age and Distance are both very close to 1. This means there is essentially no multicollinearity between the two predictors. In other words, Age and Distance are not strongly correlated with each other, so both can safely be included in the regression model.
#Standardized residuals
std_res <- rstandard(fit2)
#Cook's distances
cook_d <- cooks.distance(fit2)
#Combined into a data frame for presentation
resid_df <- data.frame(
id = 1:length(std_res),
std_res = std_res,
cook_d = cook_d
)
head(resid_df)
## id std_res cook_d
## 1 1 -0.6653487 0.007386569
## 2 2 1.7832876 0.030365432
## 3 3 -0.5937629 0.005882612
## 4 4 0.7543794 0.008299153
## 5 5 -1.0733987 0.005112584
## 6 6 -0.7775190 0.004900891
#Identify potential problems. I will use outliers, influential and problematic as names here,
outliers <- which(abs(std_res) > 3) # standardized residuals > 3 in abs value
influential <- which(cook_d > (4 / nrow(apartment_data))) # rule of thumb
problematic <- union(outliers, influential)
problematic
## [1] 22 33 38 53 55
Some apartments stand out as unusual compared to the rest. Five units (IDs 22, 33, 38, 53, and 55) had either large residuals or high influence on the regression results. To avoid these outliers distorting the analysis, I remove them and continue with the cleaned dataset:
# Remove problematic cases
apartment_data_clean <- apartment_data[-c(22, 33, 38, 53, 55), ]
nrow(apartment_data) # original sample size
## [1] 85
nrow(apartment_data_clean) # cleaned sample size
## [1] 80
The output confirms 5 units were successfully removed when comparing the original vs. cleaned datasets.
# Refit the model with the cleaned data
fit2_clean <- lm(Price ~ Age + Distance, data = apartment_data_clean)
# Standardized residuals
std_res <- rstandard(fit2_clean)
# Standardized fitted values
std_fit <- scale(fitted(fit2_clean))[,1]
# Scatterplot
plot(std_fit, std_res,
xlab = "Standardized Fitted Values",
ylab = "Standardized Residuals",
main = "Residuals vs Fitted (Standardized)",
pch = 19, col = "steelblue")
abline(h = 0, lty = 2, col = "red")
I don’t believe there is any strong heteroskedacity in this model, I don’t see an obvious cone shape and everything seems farily well clustered randomly around 0.
# Standardized residuals
std_res <- rstandard(fit2_clean)
# Histogram with normal curve
hist(std_res, breaks = 10, freq = FALSE,
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals",
col = "skyblue", border = "white")
# Overlay of whart a standard normal curve would look like to make the histogram easier to read
x <- seq(min(std_res), max(std_res), length = 100)
lines(x, dnorm(x, mean = 0, sd = 1), col = "red", lwd = 2)
# Formal test: Shapiro-Wilk
shapiro.test(std_res)
##
## Shapiro-Wilk normality test
##
## data: std_res
## W = 0.94156, p-value = 0.001168
The histogram of standardized residuals looks roughly bell-shaped and centered around zero, which is what we’d expect if they were normally distributed. However, the Shapiro-Wilk test was significant (W = 0.94, p = 0.001), meaning we can formally reject the assumption of perfect normality.
#Re-fit the model without problematic cases
fit2_clean <- lm(Price ~ Age + Distance, data = apartment_data_clean)
summary(fit2_clean)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = apartment_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
After removing the outliers, the model fits better. Both apartment age and distance from the city center significantly reduce the price per m². On average, each extra year of age lowers the price by around 9 €, and each additional kilometer from the center lowers the price by about 24 €. Together, these two factors explain about 54% of the variation in prices, showing the model has reasonably strong power for explaining this.
# Multiple regression with Age, Distance, Parking, and Balcony
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartment_data_clean)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartment_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
The extended model shows that both age and distance still lower the price per m², while parking raises it. On average, apartments with parking are about 129 € per m² more expensive, even after controlling for age and distance. Having a balcony does not make a meaningful difference. The model explains about 56% of the variation in prices, which is a reasonably strong fit, a bit better than before.
anova(fit2_clean, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
The ANOVA test shows that adding Parking and Balcony to the model does not significantly improve the fit compared to using only Age and Distance (p = 0.11). This means that, although Parking was significant on its own in fit3, the two predictors together don’t improve the overall model enough to be considered statistically better than fit2.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartment_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
ParkingYes (128.7): Apartments with parking cost on average about 129 € per m² more than apartments without parking, when holding age and distance constant. This effect is statistically significant (p 0.038).
BalconyYes (6.03): Apartments with a balcony cost on average about 6 € per m² more than those without, but this difference is not significant (p 0.92)
Null hypothesis (H0): All regression coefficients (except the intercept) are equal to zero. In other words, Age, Distance, Parking, and Balcony together have no effect on apartment prices.
Alternative hypothesis (H1): At least one predictor has a non-zero coefficient (the model has predictive value).
# Add fitted values and residuals
apartment_data_clean$fitted_fit3 <- fitted(fit3)
apartment_data_clean$residual_fit3 <- resid(fit3)
# Show results for the 2nd observation
apartment_data_clean[2, c("Price","fitted_fit3","residual_fit3")]
## # A tibble: 1 × 3
## Price fitted_fit3 residual_fit3
## <dbl> <dbl> <dbl>
## 1 2800 2357. 443.
For apartment 2, the actual price per m² was 2800 €, while the model predicted a price of about 2357 €. The residual is +443 €, meaning this apartment is more expensive than what the model expected by around 443 € per m². (That can be considered quite a lot on a student budget… ahahah)