R_take_home_exam_Lars

Although not absolutely necessary, I like to check my working directory to avoid issues before importing datasets. I have also chosen to separate any comments I make from the markdown outputs using “#” characters. This, again, is not strictly necessary but I prefer to do this in order to help with reading the contents of this R analysis.

getwd()

## [1] "/home/ll-main"

###################################################################################################
###################################################################################################
###################################################################################################
#                                                                                                 #
#   TTTTT   AAAAA   SSSSS  K   K        OOOOO  N   N  EEEEE                                       #
#     T     A   A   S      K  K         O   O  NN  N  E                                           #
#     T     AAAAA   SSSSS  KKK           O   O  N N N  EEEE                                       #
#     T     A   A       S  K  K         O   O  N  NN  E                                           #
#     T     A   A   SSSSS  K   K        OOOOO  N   N  EEEEE                                       #
#                                                                                                 #
###################################################################################################
###################################################################################################
#`##################################################################################################

The Dataset I have chosen for Task 1 is called “car_prices”. I downloaded this .csv file from Kaggle under the following link: https://www.kaggle.com/datasets/syedanwarafridi/vehicle-sales-data?resource=download

The description given for this dataset is the following:

The “Vehicle Sales and Market Trends Dataset” provides a comprehensive collection of information pertaining to the sales transactions of various vehicles. This dataset encompasses details such as the year, make, model, trim, body type, transmission type, VIN (Vehicle Identification Number), state of registration, condition rating, odometer reading, exterior and interior colors, seller information, Manheim Market Report (MMR) values, selling prices, and sale dates.

car_prices <- read.csv("/home/ll-main/Downloads/car_prices.csv")
head(car_prices) #Inspect the first 6 rows

##   year   make               model       trim  body transmission
## 1 2015    Kia             Sorento         LX   SUV    automatic
## 2 2015    Kia             Sorento         LX   SUV    automatic
## 3 2014    BMW            3 Series 328i SULEV Sedan    automatic
## 4 2015  Volvo                 S60         T5 Sedan    automatic
## 5 2014    BMW 6 Series Gran Coupe       650i Sedan    automatic
## 6 2015 Nissan              Altima      2.5 S Sedan    automatic
##                 vin state condition odometer color interior
## 1 5xyktca69fg566472    ca         5    16639 white    black
## 2 5xyktca69fg561319    ca         5     9393 white    beige
## 3 wba3c1c51ek116351    ca        45     1331  gray    black
## 4 yv1612tb4f1310987    ca        41    14282 white    black
## 5 wba6b2c57ed129731    ca        43     2641  gray    black
## 6 1n4al3ap1fn326013    ca         1     5554  gray    black
##                                               seller   mmr sellingprice
## 1                            kia motors america  inc 20500        21500
## 2                            kia motors america  inc 20800        21500
## 3             financial services remarketing (lease) 31900        30000
## 4                            volvo na rep/world omni 27500        27750
## 5             financial services remarketing (lease) 66000        67000
## 6 enterprise vehicle exchange / tra / rental / tulsa 15350        10900
##                                  saledate
## 1 Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
## 2 Tue Dec 16 2014 12:30:00 GMT-0800 (PST)
## 3 Thu Jan 15 2015 04:30:00 GMT-0800 (PST)
## 4 Thu Jan 29 2015 04:30:00 GMT-0800 (PST)
## 5 Thu Dec 18 2014 12:30:00 GMT-0800 (PST)
## 6 Tue Dec 30 2014 12:00:00 GMT-0800 (PST)

I now use the str() function to show the types of variables in this dataset. It includes 558837 observations for 16 variables in total, as shown below. They’re a mix of numeric and categorical variables that describe various types of information about vehicles sold between 1982 and 2015.

The 16 variables and their types/informaiton can be determined to be:

year: Numeric (discrete) — represents the car’s model year. Could also be treated as categorical if I want to analyze by year groups.
make: Categorical (nominal) — manufacturer/brand of the car (e.g., Kia, BMW, Volvo).
model: Categorical (nominal) — specific product line or model of the car (e.g., Sorento, 3 Series).
trim: Categorical (nominal) — specific sub-version or configuration of the model, also known as “trim level”. E.g. a base model vs Luxury trim for the same model car.
body: Categorical (nominal) — type of vehicle body (SUV, Sedan, etc.).
transmission: Categorical (nominal) — type of transmission (automatic, manual).
vin: Categorical (identifier) — unique vehicle identification number. Can be treated as an ID for the specific vehicle sold.
state: Categorical (nominal) — state abbreviation where the vehicle is listed.
condition: Ordinal (integer scale) — a numeric code representing condition, but not a continuous measure; it likely follows an ordered scale of quality or wear.
odometer: Numeric (continuous) — mileage of the car, measured in miles.
color: Categorical (nominal) — exterior color of the car.
interior: Categorical (nominal) — interior color of the car.
seller: Categorical (nominal) — name of the seller or dealer.
mmr: Numeric (continuous) — “Manheim Market Report” wholesale price estimate (dollar value).
sellingprice: Numeric (continuous) — actual selling price of the car (dollar value).
saledate: Categorical (date-time string) — recorded as character, but semantically a date/time variable.

output below:

str(car_prices)

## 'data.frame':    558837 obs. of  16 variables:
##  $ year        : int  2015 2015 2014 2015 2014 2015 2014 2014 2014 2014 ...
##  $ make        : chr  "Kia" "Kia" "BMW" "Volvo" ...
##  $ model       : chr  "Sorento" "Sorento" "3 Series" "S60" ...
##  $ trim        : chr  "LX" "LX" "328i SULEV" "T5" ...
##  $ body        : chr  "SUV" "SUV" "Sedan" "Sedan" ...
##  $ transmission: chr  "automatic" "automatic" "automatic" "automatic" ...
##  $ vin         : chr  "5xyktca69fg566472" "5xyktca69fg561319" "wba3c1c51ek116351" "yv1612tb4f1310987" ...
##  $ state       : chr  "ca" "ca" "ca" "ca" ...
##  $ condition   : int  5 5 45 41 43 1 34 2 42 3 ...
##  $ odometer    : int  16639 9393 1331 14282 2641 5554 14943 28617 9557 4809 ...
##  $ color       : chr  "white" "white" "gray" "white" ...
##  $ interior    : chr  "black" "beige" "black" "black" ...
##  $ seller      : chr  "kia motors america  inc" "kia motors america  inc" "financial services remarketing (lease)" "volvo na rep/world omni" ...
##  $ mmr         : int  20500 20800 31900 27500 66000 15350 69000 11900 32100 26300 ...
##  $ sellingprice: int  21500 21500 30000 27750 67000 10900 65000 9800 32250 17500 ...
##  $ saledate    : chr  "Tue Dec 16 2014 12:30:00 GMT-0800 (PST)" "Tue Dec 16 2014 12:30:00 GMT-0800 (PST)" "Thu Jan 15 2015 04:30:00 GMT-0800 (PST)" "Thu Jan 29 2015 04:30:00 GMT-0800 (PST)" ...

I will now start with some data manipulations.

First, I will check for missing values in the whole dataset, count them by column and get a percentage of how many are missing for each variable.

# Count total missing values in the dataset
sum(is.na(car_prices))

## [1] 11964

# Count missing values by column
colSums(is.na(car_prices))

##         year         make        model         trim         body transmission 
##            0            0            0            0            0            0 
##          vin        state    condition     odometer        color     interior 
##            0            0        11820           94            0            0 
##       seller          mmr sellingprice     saledate 
##            0           38           12            0

# Quick overview of missing data with percentages
colMeans(is.na(car_prices)) * 100

##         year         make        model         trim         body transmission 
##  0.000000000  0.000000000  0.000000000  0.000000000  0.000000000  0.000000000 
##          vin        state    condition     odometer        color     interior 
##  0.000000000  0.000000000  2.115106910  0.016820647  0.000000000  0.000000000 
##       seller          mmr sellingprice     saledate 
##  0.000000000  0.006799836  0.002147317  0.000000000

This output shows that the dataset has 11,964 missing values in total, mostly in the condition column (~2.1%), with very small amounts in odometer, mmr, and sellingprice, while all other variables have no missing data.

I will now remove those observations from the dataset car_prices.

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Remove all rows with any NA values
car_prices_clean <- car_prices %>%
  filter(!is.na(condition), !is.na(odometer), !is.na(mmr), !is.na(sellingprice))

# Check dimensions before and after
dim(car_prices)       # original

## [1] 558837     16

dim(car_prices_clean) # cleaned

## [1] 546976     16

The output above shows that upon removing the observations with na values, we are left with 16 variables (unchanged), while the new dataset “car_prices_clean” has 546976 observations vs 558837 observations in the original “car_prices” dataset.

I will now rename the variables to make them clearer for analysis:

car_prices_clean <- car_prices_clean %>%
  rename(
    year_model   = year,
    make_brand   = make,
    model_name   = model,
    sale_date    = saledate,
    price_mmr    = mmr,
    price_sold   = sellingprice
  )
names(car_prices_clean)

##  [1] "year_model"   "make_brand"   "model_name"   "trim"         "body"        
##  [6] "transmission" "vin"          "state"        "condition"    "odometer"    
## [11] "color"        "interior"     "seller"       "price_mmr"    "price_sold"  
## [16] "sale_date"

I will now create a new variables in the dataset car_prices_clean.

The variable will be called price_per_mile, which divides the selling price (price_sold) by the odometer reading. This provides a simple measure of how much buyers paid per mile on the car, giving a rough proxy of value relative to usage. I will only include cars which have an odometer reading greater than 0 miles and prices of more than 0.

car_prices_clean <- car_prices_clean %>%
  mutate(price_per_mile = ifelse(odometer > 0 & price_sold > 0,
                                 price_sold / odometer,
                                 NA_real_)
  )

# Quick summary of the new variable
summary(car_prices_clean$price_per_mile)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.08     0.27     7.43     0.65 37000.00

The median is 0.27 and the 1st and 3rd quartiles are 0.08 and 0.65, which means that most vehicles cost only a few cents per mile of usage. However, the mean is 7.43, far above the median, because a small number of cars have very high price-per-mile values. The maximum of 37,000 strongly suggests the presence of severe outliers — likely vehicles with very low mileage (odometer close to zero) but relatively high selling prices. This large gap between the median and mean confirms that the overall distribution is dominated by these extreme cases. Given this, I will now investigate the outliers more closely to understand their characteristics and whether they represent data entry issues or genuine but rare cases.

iqr_outlier_bounds <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower <- max(0, Q1 - 1.5 * IQR)  # ensure lower bound is never below 0
  upper <- Q3 + 1.5 * IQR
  return(c(lower = lower, upper = upper))
}

bounds_ppm <- iqr_outlier_bounds(car_prices_clean$price_per_mile)
bounds_ppm

##     lower upper.75% 
##  0.000000  1.512487

Using the 1.5 × IQR rule, the calculated lower bound would be negative (–0.78), which is not meaningful for this analysis since price_per_mile cannot be less than zero. Therefore, only the upper cutoff of ~1.51 is relevant for identifying outliers in this variable.

I will now plot a histogram to show the distribution of the variable price per mile.

#install.packages("ggplot2")
library(ggplot2)

# Histogram of price_per_mile
ggplot(car_prices_clean, aes(x = price_per_mile)) +
  geom_histogram(binwidth = 0.1, fill = "skyblue", color = "black") +
  coord_cartesian(xlim = c(0, 5)) +  # zoom in to ignore extreme outliers
  labs(title = "Histogram of Price per Mile",
       x = "Price per Mile ($ / mile)",
       y = "Count")

The histogram shows that most vehicles cost just a few cents per mile, with a heavily right-skewed distribution and a small number of extreme high price-per-mile outliers.

I will now plot 2 more distributions, one for odometer, and one for price in order to investigate the variables odometer and price sold further. I will also add descriptive statistics for each to investigate the origin of the histogram above.

# Descriptive stats for odometer and selling price
odometer_stats <- car_prices_clean %>%
  summarise(
    n      = n(),
    mean   = round(mean(odometer, na.rm = TRUE), 2),
    median = round(median(odometer, na.rm = TRUE), 2),
    sd     = round(sd(odometer, na.rm = TRUE), 2),
    min    = round(min(odometer, na.rm = TRUE), 2),
    max    = round(max(odometer, na.rm = TRUE), 2)
  )

price_stats <- car_prices_clean %>%
  summarise(
    n      = n(),
    mean   = round(mean(price_sold, na.rm = TRUE), 2),
    median = round(median(price_sold, na.rm = TRUE), 2),
    sd     = round(sd(price_sold, na.rm = TRUE), 2),
    min    = round(min(price_sold, na.rm = TRUE), 2),
    max    = round(max(price_sold, na.rm = TRUE), 2)
  )

# Histogram of odometer
ggplot(car_prices_clean, aes(x = odometer)) +
  geom_histogram(binwidth = 5000, fill = "red", color = "black") +
  coord_cartesian(xlim = c(0, 200000)) +  # zoom in to ignore extreme outliers
  labs(title = "Histogram of Odometer",
       x = "Odometer (miles)",
       y = "Count")

# Histogram of selling price
ggplot(car_prices_clean, aes(x = price_sold)) +
  geom_histogram(binwidth = 1000, fill = "green", color = "black") +
  coord_cartesian(xlim = c(0, 100000)) +  # zoom in for readability
  labs(title = "Histogram of Selling Price",
       x = "Selling Price ($)",
       y = "Count")

print(odometer_stats)

##        n     mean median       sd min    max
## 1 546976 67263.26  51223 52793.56   1 999999

print(price_stats)

##        n     mean median      sd min    max
## 1 546976 13764.43  12300 9747.83   1 230000

What can be observed is the following:

Odometer

The odometer values range from 1 to 999,999 miles, with a mean of 67,263 and a median of 51,223. The histogram shows the highest concentration of vehicles between about 20,000 and 60,000 miles, with steadily fewer observations as mileage increases, producing a right-skewed distribution. The number of vehicles with around 60,000 and 100,000 miles seems to be quite level though, and then decreases gradually as mileage increases further.

Selling Price

Selling prices range from 1 to 230,000 dollars, with a mean of 13,764 and a median of 12,300. The histogram shows that most vehicles fall between roughly 5,000 and 20,000 dollars, with progressively fewer observations at higher prices, also forming a right-skewed distribution. The histogram also shows a minor peak at a lower price and a more pronounced one at a slightly higher price.

Relation to Price per Mile

Taken together, these patterns help explain the distribution of price per mile. Most vehicles, with moderate mileage and moderate selling prices, fall below 1 dollar per mile, where the median is 0.27. A smaller set of cases with unusually low mileage or unusually high selling prices account for the extreme right tail in the price per mile distribution.

###################################################################################################
###################################################################################################
###################################################################################################
#                                                                                                 #
#   TTTTT   AAAAA   SSSSS  K   K        TTTTT   W   W   OOOOO                                     #
#     T     A   A   S      K  K           T     W   W   O   O                                     #
#     T     AAAAA   SSSSS  KKK            T     W W W   O   O                                     #
#     T     A   A       S  K  K           T     W W W   O   O                                     #
#     T     A   A   SSSSS  K   K          T     W   W   OOOOO                                     #
#                                                                                                 #
###################################################################################################
###################################################################################################
###################################################################################################

I will now proceed with task 2. First, I will import the dataset.

#install.packages(c("readxl","ggplot2","effectsize"))
library(readxl)
library(ggplot2)
library(effectsize)


#Load data
mba <- read_excel("/home/ll-main/Downloads/Task 2/Business School.xlsx")  
names(mba)  # inspect column names

## [1] "Student ID"             "Undergrad Degree"       "Undergrad Grade"       
## [4] "MBA Grade"              "Work Experience"        "Employability (Before)"
## [7] "Employability (After)"  "Status"                 "Annual Salary"

#I renamed some variables to make them a bit more "user friendly" for myself.
deg_col   <- "Undergrad Degree"
sal_col   <- "Annual Salary"
grade_col <- "MBA Grade"

I now see the column names printed above and will proceed with plotting and finding the most common degree.

# Plot distribution of undergrad degrees
ggplot(mba, aes(x = `Undergrad Degree`)) +
  geom_bar() +
  coord_flip() +                 # makes long labels readable
  theme_minimal() +
  labs(title = "Undergrad Degrees",
       x = NULL, y = "Count")

# Which degree is most common?
most_common <- names(which.max(table(mba$`Undergrad Degree`)))
cat("Most common undergrad degree:", most_common, "\n")

## Most common undergrad degree: Business

As per the output, the most common degree is business, followed by Finance and Computer Science

Now I will investigate the distribution of annual salary and the descriptive statistics for it.

# Descriptive statistics
summary(mba$`Annual Salary`)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20000   87125  103500  109058  124000  340000

sd(mba$`Annual Salary`, na.rm = TRUE)  # standard deviation

## [1] 41501.49

# Histogram
ggplot(mba, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 5000, fill = "magenta", color = "black") +
  theme_minimal() +
  labs(title = "Distribution of Annual Salary",
       x = "Annual Salary",
       y = "Count")

The output shows that annual salaries ranged from 20,000 to 340,000, with a mean of 109,058 and a median of 103,500. The majority of salaries clustered between about 87,000 and 124,000. The distribution is right-skewed, with a few very high salaries acting as outliers and pulling the mean above the median.

I will now test the hypothesis H0 : μMBA Grade = 74. I will use the t test and cohen’s d for this.

# Extract MBA grades
grades <- mba$`MBA Grade`

# One-sample t-test against 74
ttest <- t.test(grades, mu = 74)
print(ttest)

## 
##  One Sample t-test
## 
## data:  grades
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

# Effect size: Cohen's d (vs constant)
cohend <- effectsize::cohens_d(grades, mu = 74)
print(cohend)

## Cohen's d |       95% CI
## ------------------------
## 0.27      | [0.07, 0.46]
## 
## - Deviation from a difference of 74.

INTERPRETATION OF THE RESULTS: The sample mean MBA grade was 76.04, compared to the previous year’s average of 74. A one-sample t-test indicated this difference was statistically significant, t(99) = 2.66, p = 0.009, 95% CI [74.52, 77.56]. The effect size was Cohen’s d = 0.27, which reflects a small practical difference.

In other words, the current cohort performed slightly (but in a statistically significant way) better than last year’s cohort.

###################################################################################################
###################################################################################################
###################################################################################################
#                                                                                                 #
#   TTTTT   AAAAA   SSSSS  K   K        TTTTT   H   H  RRRR   EEEEE  EEEEE                         #
#     T     A   A   S      K  K           T     H   H  R   R  E      E                             #
#     T     AAAAA   SSSSS  KKK            T     HHHHH  RRRR   EEEE   EEEE                          #
#     T     A   A       S  K  K           T     H   H  R  R   E      E                             #
#     T     A   A   SSSSS  K   K          T     H   H  R   R  EEEEE  EEEEE                         #
#                                                                                                 #
###################################################################################################
###################################################################################################
###################################################################################################

Import the dataset Apartments.xlsx

#Load data
apartment_data <- read_excel("/home/ll-main/Downloads/Task 3/Apartments.xlsx")  # set correct path if different
str(apartment_data)

## tibble [85 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

# Convert 0/1 to factors with labels
apartment_data$Parking <- factor(apartment_data$Parking, levels = c(0, 1), labels = c("No", "Yes"))
apartment_data$Balcony <- factor(apartment_data$Balcony, levels = c(0, 1), labels = c("No", "Yes"))

#Check if it worked
table(apartment_data$Parking)

## 
##  No Yes 
##  42  43

table(apartment_data$Balcony)

## 
##  No Yes 
##  48  37

str(apartment_data[, c("Parking","Balcony")])

## tibble [85 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Parking: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ Balcony: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

# One-sample t-test against 1900 for price
t_price <- t.test(apartment_data$Price, mu = 1900)
t_price

## 
##  One Sample t-test
## 
## data:  apartment_data$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

# I used Cohen's d for effect size
d_price <- (mean(apartment_data$Price) - 1900) / sd(apartment_data$Price)
d_price

## [1] 0.314791

sd(apartment_data$Price)

## [1] 377.8417

The average apartment price per m-square in the Ljubljana sample was 2018.94 € (SD = 377.84, n = 85). A one-sample t-test indicated that this was significantly higher than the benchmark value of 1900 euros, t(84) = 2.90, p = 0.0047, 95% CI [1937.44, 2100.44]. The effect size was Cohen’s d = 0.31, suggesting a small-to-moderate practical difference from the benchmark.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

#Price as a function of Age
fit1 <- lm(Price ~ Age, data = apartment_data)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = apartment_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

#Correlation coefficient (r) between Price and Age
r_age <- cor(apartment_data$Price, apartment_data$Age)
r_age

## [1] -0.230255

#Coefficient of determination r square
r2_age <- r_age^2
r2_age

## [1] 0.05301737

The regression shows that the average price per m² goes down by about 9 € for every extra year of apartment age. The intercept (around 2185 €) just means that a brand-new apartment would be expected to cost about that much per m².

The correlation between age and price is about –0.23, which is a weak negative relationship: older apartments tend to be a bit cheaper.

The R² value is about 0.05, so age only explains around 5% of the differences in apartment prices — most of the variation comes from other factors.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

pairs(~ Price + Age + Distance,
      data = apartment_data,
      main = "Scatterplot Matrix")

There seems to be some possible multicollinearity of price with distance, but I don’t believe there is any of Price with Age.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

# Multiple regression: Price as a function of Age and Distance
fit2 <- lm(Price ~ Age + Distance, data = apartment_data)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartment_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

The regression shows that both apartment age and distance from the city center have a negative effect on price per m². On average, each year of age lowers the price by about 8 €, and each kilometer farther from the city center lowers it by about 21 €. Together, age and distance explain about 44% of the differences in apartment prices.

Chech the multicolinearity with VIF statistics. Explain the findings.

# Regression of Age on Distance
fit_age_on_dist <- lm(Age ~ Distance, data = apartment_data)
r2_age <- summary(fit_age_on_dist)$r.squared
vif_age <- 1 / (1 - r2_age)

# Regression of Distance on Age
fit_dist_on_age <- lm(Distance ~ Age, data = apartment_data)
r2_dist <- summary(fit_dist_on_age)$r.squared
vif_dist <- 1 / (1 - r2_dist)

vif_age

## [1] 1.001845

vif_dist

## [1] 1.001845

Unfortunately I ran into an issue with the CAR package on my computer due to some issues with r on Linux. I would normally have used libary(car) and then vif. I computed manually by choice this time to avoid pc issues again.

Interpretation: The VIF values for Age and Distance are both very close to 1. This means there is essentially no multicollinearity between the two predictors. In other words, Age and Distance are not strongly correlated with each other, so both can safely be included in the regression model.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

#Standardized residuals
std_res <- rstandard(fit2)

#Cook's distances
cook_d <- cooks.distance(fit2)

#Combined into a data frame for presentation
resid_df <- data.frame(
  id = 1:length(std_res),
  std_res = std_res,
  cook_d = cook_d
)

head(resid_df)

##   id    std_res      cook_d
## 1  1 -0.6653487 0.007386569
## 2  2  1.7832876 0.030365432
## 3  3 -0.5937629 0.005882612
## 4  4  0.7543794 0.008299153
## 5  5 -1.0733987 0.005112584
## 6  6 -0.7775190 0.004900891

#Identify potential problems. I will use outliers, influential and problematic as names here,
outliers <- which(abs(std_res) > 3)       # standardized residuals > 3 in abs value
influential <- which(cook_d > (4 / nrow(apartment_data)))  # rule of thumb

problematic <- union(outliers, influential)
problematic

## [1] 22 33 38 53 55

Some apartments stand out as unusual compared to the rest. Five units (IDs 22, 33, 38, 53, and 55) had either large residuals or high influence on the regression results. To avoid these outliers distorting the analysis, I remove them and continue with the cleaned dataset:

# Remove problematic cases
apartment_data_clean <- apartment_data[-c(22, 33, 38, 53, 55), ]
nrow(apartment_data)       # original sample size

## [1] 85

nrow(apartment_data_clean) # cleaned sample size

## [1] 80

The output confirms 5 units were successfully removed when comparing the original vs. cleaned datasets.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

# Refit the model with the cleaned data
fit2_clean <- lm(Price ~ Age + Distance, data = apartment_data_clean)

# Standardized residuals
std_res <- rstandard(fit2_clean)

# Standardized fitted values
std_fit <- scale(fitted(fit2_clean))[,1]

# Scatterplot
plot(std_fit, std_res,
     xlab = "Standardized Fitted Values",
     ylab = "Standardized Residuals",
     main = "Residuals vs Fitted (Standardized)",
     pch = 19, col = "steelblue")
abline(h = 0, lty = 2, col = "red")

I don’t believe there is any strong heteroskedacity in this model, I don’t see an obvious cone shape and everything seems farily well clustered randomly around 0.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

# Standardized residuals
std_res <- rstandard(fit2_clean)

# Histogram with normal curve
hist(std_res, breaks = 10, freq = FALSE,
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals",
     col = "skyblue", border = "white")

# Overlay of whart a standard normal curve would look like to make the histogram easier to read
x <- seq(min(std_res), max(std_res), length = 100)
lines(x, dnorm(x, mean = 0, sd = 1), col = "red", lwd = 2)

# Formal test: Shapiro-Wilk
shapiro.test(std_res)

## 
##  Shapiro-Wilk normality test
## 
## data:  std_res
## W = 0.94156, p-value = 0.001168

The histogram of standardized residuals looks roughly bell-shaped and centered around zero, which is what we’d expect if they were normally distributed. However, the Shapiro-Wilk test was significant (W = 0.94, p = 0.001), meaning we can formally reject the assumption of perfect normality.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

#Re-fit the model without problematic cases
fit2_clean <- lm(Price ~ Age + Distance, data = apartment_data_clean)
summary(fit2_clean)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = apartment_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

After removing the outliers, the model fits better. Both apartment age and distance from the city center significantly reduce the price per m². On average, each extra year of age lowers the price by around 9 €, and each additional kilometer from the center lowers the price by about 24 €. Together, these two factors explain about 54% of the variation in prices, showing the model has reasonably strong power for explaining this.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

# Multiple regression with Age, Distance, Parking, and Balcony
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = apartment_data_clean)
summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartment_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

The extended model shows that both age and distance still lower the price per m², while parking raises it. On average, apartments with parking are about 129 € per m² more expensive, even after controlling for age and distance. Having a balcony does not make a meaningful difference. The model explains about 56% of the variation in prices, which is a reasonably strong fit, a bit better than before.

With function anova check if model fit3 fits data better than model fit2.

anova(fit2_clean, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

The ANOVA test shows that adding Parking and Balcony to the model does not significantly improve the fit compared to using only Age and Distance (p = 0.11). This means that, although Parking was significant on its own in fit3, the two predictors together don’t improve the overall model enough to be considered statistically better than fit2.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = apartment_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

ParkingYes (128.7): Apartments with parking cost on average about 129 € per m² more than apartments without parking, when holding age and distance constant. This effect is statistically significant (p 0.038).

BalconyYes (6.03): Apartments with a balcony cost on average about 6 € per m² more than those without, but this difference is not significant (p 0.92)

Null hypothesis (H0): All regression coefficients (except the intercept) are equal to zero. In other words, Age, Distance, Parking, and Balcony together have no effect on apartment prices.

Alternative hypothesis (H1): At least one predictor has a non-zero coefficient (the model has predictive value).

Save fitted values and claculate the residual for apartment ID2.

# Add fitted values and residuals
apartment_data_clean$fitted_fit3   <- fitted(fit3)
apartment_data_clean$residual_fit3 <- resid(fit3)

# Show results for the 2nd observation
apartment_data_clean[2, c("Price","fitted_fit3","residual_fit3")]

## # A tibble: 1 × 3
##   Price fitted_fit3 residual_fit3
##   <dbl>       <dbl>         <dbl>
## 1  2800       2357.          443.

For apartment 2, the actual price per m² was 2800 €, while the model predicted a price of about 2357 €. The residual is +443 €, meaning this apartment is more expensive than what the model expected by around 443 € per m². (That can be considered quite a lot on a student budget… ahahah)

R_take_home_exam_Lars_Lenard

Lars Lenard

2025-09-28