TASK 1 - MEDICAL INSURANCE DATASET

mydata <- read.csv("~/Downloads/R data/IMB Bootcamp/Medical Insurance Dataset.csv",
                   header = TRUE,
                   sep = ",",
                   dec = ".")

head(mydata)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

str(mydata)

## 'data.frame':    3038 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

EXPLANATION OF THE DATASET & THE SAMPLE STATISTICS.

Variables in the dataset:
AGE Numeric variable.
Ranges from 18 to 64 years old.
Mean is 39.09.
Median also at 39.

SEX Categorical variable: “male” or “female”.

BMI (Body Mass Index). Numeric variable.
Range: 15.96 (very underweight) to 53.13 (obese).
Average BMI ≈ 30.7 (borderline obese).

CHILDREN - also covered in the price of the insurance -> cost of which seen in CHARGES. Numeric variable.
Range: 0 to 5 children.
Average ≈ 1.099.

SMOKER Categorical (yes/no).

REGION Categorical (character) variable: “northwest”, “northeast”, “southeast”, “southwest”.

CHARGES Numeric variable.
Range: 1,122 to 63,770 USD.
Average ≈ 13,336 USD.

This dataset contains 3,038 individuals with variables describing their demographics (age, sex, number of children, region), health status (BMI, smoker or not), and the resulting insurance charges.

CREATION OF A NEW VARIABLE

mydata$cost_per_person <- mydata$charges / (mydata$children + 1)
head(mydata)

##   age    sex    bmi children smoker    region   charges cost_per_person
## 1  19 female 27.900        0    yes southwest 16884.924      16884.9240
## 2  18   male 33.770        1     no southeast  1725.552        862.7762
## 3  28   male 33.000        3     no southeast  4449.462       1112.3655
## 4  33   male 22.705        0     no northwest 21984.471      21984.4706
## 5  32   male 28.880        0     no northwest  3866.855       3866.8552
## 6  31 female 25.740        0     no southeast  3756.622       3756.6216

DELETING SOME UNITS DUE TO MISSING DATA

I will not be deleting any units, because I have a full dataset.

anyNA(mydata)

## [1] FALSE

DELETING COLUMNS

I will be deleting the column of REGION as I have no interest in this data.

mydata<-mydata[ , -6]
head(mydata)

##   age    sex    bmi children smoker   charges cost_per_person
## 1  19 female 27.900        0    yes 16884.924      16884.9240
## 2  18   male 33.770        1     no  1725.552        862.7762
## 3  28   male 33.000        3     no  4449.462       1112.3655
## 4  33   male 22.705        0     no 21984.471      21984.4706
## 5  32   male 28.880        0     no  3866.855       3866.8552
## 6  31 female 25.740        0     no  3756.622       3756.6216

RENAMING OF VARIABLES

I will be renaming “sex” into “gender”.

colnames(mydata)[2] <- "gender"
head(mydata)

##   age gender    bmi children smoker   charges cost_per_person
## 1  19 female 27.900        0    yes 16884.924      16884.9240
## 2  18   male 33.770        1     no  1725.552        862.7762
## 3  28   male 33.000        3     no  4449.462       1112.3655
## 4  33   male 22.705        0     no 21984.471      21984.4706
## 5  32   male 28.880        0     no  3866.855       3866.8552
## 6  31 female 25.740        0     no  3756.622       3756.6216

CREATING A NEW DATASET WHERE I AM ORDERING THE UNITS ACCORDING TO THEIR AGE

mydata_order <- mydata[order(mydata$"age"), ]
head(mydata_order)

##    age gender    bmi children smoker   charges cost_per_person
## 2   18   male 33.770        1     no  1725.552        862.7762
## 23  18   male 34.100        0     no  1137.011       1137.0110
## 32  18 female 26.315        0     no  2198.190       2198.1899
## 47  18 female 38.665        2     no  3393.356       1131.1188
## 51  18 female 35.625        0     no  2211.131       2211.1307
## 58  18   male 31.680        2    yes 34303.167      11434.3891

CONVERSION OF GENDER INTO A FACTOR

mydata$gender <- factor(mydata$gender,
levels = c("male", "female"),
labels = c("M", "F"))

head(mydata)

##   age gender    bmi children smoker   charges cost_per_person
## 1  19      F 27.900        0    yes 16884.924      16884.9240
## 2  18      M 33.770        1     no  1725.552        862.7762
## 3  28      M 33.000        3     no  4449.462       1112.3655
## 4  33      M 22.705        0     no 21984.471      21984.4706
## 5  32      M 28.880        0     no  3866.855       3866.8552
## 6  31      F 25.740        0     no  3756.622       3756.6216

DESCRIBING THE NUMERICAL DATA

#install.packages("psych")
library(psych)
describe(mydata[ ,c("age", "bmi", "children")])

##          vars    n  mean    sd median trimmed   mad   min   max range skew
## age         1 3038 39.09 14.12   39.0   38.86 19.27 18.00 64.00 46.00 0.07
## bmi         2 3038 30.70  6.09   30.4   30.55  6.26 15.96 53.13 37.17 0.27
## children    3 3038  1.10  1.22    1.0    0.94  1.48  0.00  5.00  5.00 0.96
##          kurtosis   se
## age         -1.26 0.26
## bmi         -0.09 0.11
## children     0.26 0.02

1. Age Mean age = 39.09 years, close to median (39.0) → distribution is fairly symmetric. Standard deviation = 14.12, so ages are spread around ±14 years from the mean. Minimum age = 18 (no minors).

BMI (Body Mass Index) Mean BMI = 30.70, median = 30.4 → slightly right-skewed but fairly symmetric. Standard deviation = 6.09, moderate variability. Minimum BMI = 15.96 (underweight case).
Children Mean = 1.10, median = 1 → most people have 1 child. Standard deviation = 1.22 → some variation, but not huge. Minimum = 0 → some individuals have no children.

FIGURING OUT THE STATISTICS OF GENDER DISTRIBUTION AND THE NUMBER OF SMOKERS VS. NON-SMOKERS

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

gender_count <- mydata %>% count(gender)

barplot(gender_count$n,
        names.arg = gender_count$gender,
        col = "darkorange1",
        main = "Gender Distribution",
        xlab = "Gender",
        ylab = "Frequency")

The number of females and males are almost the same.

library(dplyr)
smoker_count <- mydata %>% count(smoker)

barplot(smoker_count$n,
        names.arg = smoker_count$smoker,
        col = "olivedrab",
        main = "NUMBER OF SMOKERS VS. NON-SMOKERS",
        xlab = "Smoking",
        ylab = "Frequency")

There are a lot more non-smokers than smokers.

SHOWING A POSSIBLE CORRELATION BETWEEN COST PER PERSON AND NUMBER OF CHILDREN

It would be logical the cost per person would be lower with every additional child.

summary(mydata[c("cost_per_person", "children")])

##  cost_per_person    children    
##  Min.   :  768   Min.   :0.000  
##  1st Qu.: 2198   1st Qu.:0.000  
##  Median : 4320   Median :1.000  
##  Mean   : 8300   Mean   :1.099  
##  3rd Qu.:11455   3rd Qu.:2.000  
##  Max.   :63770   Max.   :5.000

The statistics show that the average insurance cost per person is $8,300, with a median of $4,320. The mean being much higher than the median indicates that the distribution is skewed by a small number of very expensive cases, reaching a maximum of about $63,770. Most individuals have relatively few dependents: the median is 1 child, the mean is about 1.1 children, and the maximum observed is 5 children.

SHOWING A POSSIBLE CORRELATION BETWEEN COST PER PERSON AND NUMBER OF CHILDREN WITH A SCATTER GRAPH

#install.packages("ggplot")
library(ggplot2)
ggplot(mydata, aes(x = children, y = cost_per_person)) +
  geom_jitter(width = 0.2, alpha = 0.5, color = "darkslategray") +
  geom_smooth(method = "lm", se = FALSE, color = "darkgoldenrod4") +
  labs(x = "Number of Children", y = "Cost per Person",
       title = "CORRELATION BETWEEN COST PER PERSON & NUMBER OF CHILDREN") +
  theme_minimal()

The scatterplot suggests a weak negative association between the number of children and insurance cost per person: on average, individuals with more children tend to have slightly lower costs. However, the relationship is not strong, as costs vary widely across all groups, especially among those with no children.

FIGURING OUT CORRELATION BETWEEN COST PER PERSON AND THE OTHER VARIABLES

model <- lm(cost_per_person ~ age + gender + bmi + children + smoker, data = mydata)
summary(model)

## 
## Call:
## lm(formula = cost_per_person ~ age + gender + bmi + children + 
##     smoker, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10941  -2828  -1305   1225  38161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4381.195    595.384  -7.359 2.38e-13 ***
## age           183.292      7.316  25.054  < 2e-16 ***
## genderF        27.195    206.140   0.132    0.895    
## bmi           179.749     16.965  10.595  < 2e-16 ***
## children    -2878.691     84.454 -34.086  < 2e-16 ***
## smokeryes   15108.062    253.497  59.599  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5650 on 3032 degrees of freedom
## Multiple R-squared:  0.6408, Adjusted R-squared:  0.6402 
## F-statistic:  1082 on 5 and 3032 DF,  p-value: < 2.2e-16

Each additional year of age increases predicted cost by €183 (holding other factors constant). Being female increases cost by €27 compared to male, but not statistically significant (p ≈ 0.9). Each unit increase in BMI raises predicted cost by €180. Each additional child is associated with €2879 lower costs. Being a smoker increases predicted cost by €15,108 compared to non-smokers. Smoking has the largest effect in the model and is highly significant so I will explore its effect on price further.

FIGURING OUT CORRELATION BETWEEN COST PER PERSON AND SMOKING

library(ggplot2)

mydata$smoker_fact <- factor(mydata$smoker, levels = c("no", "yes"))

ggplot(mydata, aes(x = smoker, y = cost_per_person, fill = smoker)) +
  stat_summary(fun = "mean", geom = "bar", alpha = 0.5, width = 0.5, color = "black") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.5, color = "darkred") +
  scale_fill_manual(values = c("no" = "chartreuse4", "yes" = "darkslateblue")) +
  scale_x_discrete(labels = c("no" = "No", "yes" = "Yes")) +
  labs(title = "Effect of Smoking on Insurance Cost per Person",
       x = "Smoker", y = "Mean Cost per Person") +
  theme_minimal()

The graph shows a large difference in insurance costs between smokers and non-smokers. On average, smokers incur costs around $20,000, which is about four times higher than non-smokers (around $5,000). Smoking is therefore a very strong predictor of insurance cost in this dataset.

TASK 2 - MBA STUDENTS

#install.packages("readxl")

library(readxl)

Business_School <- read_excel("~/Downloads/R data/IMB Bootcamp/Business School.xlsx")

head(Business_School)

## # A tibble: 6 × 9
##   `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade`
##          <dbl> <chr>                          <dbl>       <dbl>
## 1            1 Business                        68.4        90.2
## 2            2 Computer Science                70.2        68.7
## 3            3 Finance                         76.4        83.3
## 4            4 Business                        82.6        88.7
## 5            5 Finance                         76.9        75.4
## 6            6 Computer Science                83.3        82.1
## # ℹ 5 more variables: `Work Experience` <chr>, `Employability (Before)` <dbl>,
## #   `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>

library(ggplot2)
ggplot(Business_School, aes(x = `Undergrad Degree`)) +
  geom_bar(colour = "white", fill = "chocolate") + 
  ylab("Frequency") +
  theme_minimal() +
  geom_text(stat = "count", 
            aes(label = ..count..), 
            vjust = -0.3)

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The graph shows that the most common degree is the Undergrad degree of Business.

library(psych)
describe(Business_School$`Annual Salary`)

##    vars   n   mean       sd median  trimmed     mad   min    max  range skew
## X1    1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22
##    kurtosis      se
## X1     9.41 4150.15

library(ggplot2)
library(scales)

## 
## Attaching package: 'scales'

## The following objects are masked from 'package:psych':
## 
##     alpha, rescale

ggplot(Business_School, aes(x = `Annual Salary`)) +
  geom_histogram(binwidth = 5000, boundary = 10, closed = "left",
                 color = "darkred", fill = "antiquewhite2") +
  labs(title = "Annual Salary - Histogram",
       x = "Annual Salary", y = "Count") +
  scale_x_continuous(
    labels = label_comma()) +
  theme_minimal()

In this histogram graph we can see the average salary is around 100 000. We can also notice some outliers who have significantly higher salaries, but I decide not to remove them since they are not there because of a mistake in the dataset - they are just extraordinary.

shapiro.test(Business_School$`MBA Grade`)

## 
##  Shapiro-Wilk normality test
## 
## data:  Business_School$`MBA Grade`
## W = 0.98799, p-value = 0.5078

I did this test to check whether the MBA grades are normally distributed. The null hypothesis is: MBA grades are normally distributed and the alternative hypothesis is: MBA grades are not normally distributed. The p-value is above the typical threshold of 0.05 so I conclude to not reject the null hypothesis - the MBA grades are normally distributed.

mba_grades <- Business_School$`MBA Grade`


t_result <- t.test(mba_grades, 
                   mu = 74,
                   alternative = "two.sided")


print(t_result)

## 
##  One Sample t-test
## 
## data:  mba_grades
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

Here I preformed the t-test, where I specified the numerical value (MBA grade) and used the argument mu = 74, which corresponds to the null hypothesis. The results shows the average grade of the students is 76.0. The p-value is calculated as 0.0092 which is less than 0.05 so I can reject the hypothesis.

TASK 3 - APARTMENTS

Import the dataset Apartments.xlsx

library(readxl)
Apartments <- read_excel("~/Downloads/R Take Home Exam 2025/Task 3/Apartments.xlsx")
View(Apartments)

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Change categorical variables into factors.

library(dplyr)
Apartments <- Apartments %>%
  mutate(
    ParkingF = factor(Parking, levels = c(0,1), labels = c("No","Yes")),
    BalconyF = factor(Balcony, levels = c(0,1), labels = c("No","Yes")))

str(Apartments)

## tibble [85 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Age     : num [1:85] 7 18 7 28 18 28 14 18 22 25 ...
##  $ Distance: num [1:85] 28 1 28 29 18 12 20 6 7 2 ...
##  $ Price   : num [1:85] 1640 2800 1660 1850 1640 1770 1850 1970 2270 2570 ...
##  $ Parking : num [1:85] 0 1 0 0 1 0 0 1 1 1 ...
##  $ Balcony : num [1:85] 1 0 0 1 1 1 1 1 0 0 ...
##  $ ParkingF: Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 2 2 2 ...
##  $ BalconyF: Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 2 1 1 ...

Test the hypothesis H0: Mu_Price = 1900 eur. What can you conclude?

tt <- t.test(Apartments$Price, 
             mu = 1900,
             alternative = "two.sided") 
print(tt)

## 
##  One Sample t-test
## 
## data:  Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

I am rejecting the null hypothesis at the 5% level. The mean apartment price is significantly different from 1900 - higher by 119 on average.

Estimate the simple regression function: Price = f(Age). Save results in object fit1 and explain the estimate of regression coefficient, coefficient of correlation and coefficient of determination.

fit1 <- lm(Price ~ Age, data = Apartments)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = Apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Each additional year is on average associated with 8,98 lower apartment price - a weak negative linear association. The correlation is not highly significant. About 5.3% of the variation in price is explained by age alone; the other ~94.7% is due to other factors.

Show the scateerplot matrix between Price, Age and Distance. Based on the matrix determine if there is potential problem with multicolinearity.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(Apartments[c("Price", "Age", "Distance")],
                  smooth = FALSE)

The scatterplot matrix shows no multicollinearity.

Estimate the multiple regression function: Price = f(Age, Distance). Save it in object named fit2.

fit2 <- lm(Price ~ Age + Distance, 
           data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

Holding distance fixed, each extra year of age is associated with about 7.9 lower price on average (statistically significant). Holding age fixed, each extra unit of distance lowers expected price by about 20.7 (highly significant). About 44% of the variation in prices is explained by age and distance. I conclude there is strong evidence that at least one - age or sistance helps explain the price (overall model is significant).

Chech the multicolinearity with VIF statistics. Explain the findings.

library(car)
fit2 <- lm(Price ~ Age + Distance, 
           data = Apartments)
vif(fit2)

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

VIFs ~ 1 and mean VIF ~ 1 tell me there is no evidence of multicollinearity between age and distance. Coefficient estimates should be stable and their standard errors are unaffected by correlation among the regressors.

Calculate standardized residuals and Cooks Distances for model fit2. Remove any potentially problematic units (outliers or units with high influence).

Apartments$StdResid <- round(rstandard(fit1),3)
hist(Apartments$StdResid,
     col = "coral",
     border = "brown",
     xlab = "Standardised residuals",
     ylab = "Frequency",
     main = "Histogram of standardised residuals")

Upon visual inspection, we can see the residuals are slightly asymmetrically distributed to the right - most standardized residuals sit within about ±2, with a few a bit above 2.

Apartments$CooksD <- round(cooks.distance(fit2), 3)

hist(Apartments$CooksD,
     col = "lightpink",
     border = "hotpink",
     xlab = "Cooks Distance",
     ylab = "Frequency",
     main = "Histogram of Cooks Distances")

There is one clear outlier around 0.33, which is highly influential and worth inspecting if there are more.

head(Apartments[order(-Apartments$CooksD),])

## # A tibble: 6 × 9
##     Age Distance Price Parking Balcony ParkingF BalconyF StdResid CooksD
##   <dbl>    <dbl> <dbl>   <dbl>   <dbl> <fct>    <fct>       <dbl>  <dbl>
## 1     5       45  2180       1       1 Yes      Yes         0.108  0.32 
## 2    43       37  1740       0       0 No       No         -0.168  0.104
## 3     2       11  2790       1       0 Yes      No          1.72   0.069
## 4     7        2  1760       0       1 No       Yes        -0.995  0.066
## 5    37        3  2540       1       1 Yes      Yes         1.91   0.061
## 6    40        2  2400       0       1 No       Yes         1.61   0.038

It looks like number 38 is the far right outlier.

Apartments$ID <- seq_len(nrow(Apartments))
Apartments$CooksD <- cooks.distance(fit2)
head(Apartments[order(-Apartments$CooksD),
             c("Age","Distance","Price","StdResid","CooksD")], 10)

## # A tibble: 10 × 5
##      Age Distance Price StdResid CooksD
##    <dbl>    <dbl> <dbl>    <dbl>  <dbl>
##  1     5       45  2180    0.108 0.320 
##  2    43       37  1740   -0.168 0.104 
##  3     2       11  2790    1.72  0.0691
##  4     7        2  1760   -0.995 0.0663
##  5    37        3  2540    1.91  0.0609
##  6    40        2  2400    1.61  0.0375
##  7     8        2  2820    1.94  0.0365
##  8     8       26  2300    0.51  0.0341
##  9    10        1  2810    1.95  0.0320
## 10    18        1  2800    2.11  0.0304

Here we have the possible outliers. I will presume the rule of D > 4/n. For my dataset that means I will remove all the cases with D > 0,04.

library(dplyr)

drop_ids <- c(38, 55, 33, 53, 22)

Apartments <- Apartments %>%
  filter(!ID %in% drop_ids)

View(Apartments)

Previously mentioned cases with D > 0,04 filtered out.

Check for potential heteroskedasticity with scatterplot between standarized residuals and standrdized fitted values. Explain the findings.

fit2 <- lm(Price ~ Age + Distance, data = Apartments)

Apartments$StdResid <- round(rstandard(fit2), 3)
Apartments$StdFitValues <- scale(fit2$fitted.values)

library(car)
scatterplot(y = Apartments$StdResid,
            x = Apartments$StdFitValues,
            ylab = "Standardised residuals",
            xlab = "Standardised fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

The scatter suggests heteroskedasticity increasing with fitted values. I will have to use robust SEs and check that results are robust after removing the few influential points I identified.

Are standardized residuals ditributed normally? Show the graph and formally test it. Explain the findings.

Apartments$StdResid <- round(rstandard(fit2),3)
hist(Apartments$StdResid,
     col = "coral",
     border = "brown",
     xlab = "Standardised residuals",
     ylab = "Frequency",
     main = "Histogram of standardised residuals")

Removing those IDs reduced the influence of a handful of extreme points, yielding residuals more consistent with model assumptions (normality and constant variance). Parameter estimates and standard errors should be less distorted.

shapiro.test(Apartments$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  Apartments$StdResid
## W = 0.94154, p-value = 0.001166

Even after dropping those IDs, residuals aren’t normal enough for a clean Shapiro pass. I would have to do the procedure of finding and eliminating the ouliers again.

Estimate the fit2 again without potentially excluded units and show the summary of the model. Explain all coefficients.

fit2 <- lm(Price ~ Age + Distance, 
           data = Apartments)
summary(fit2)

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

Each additional year is on average associated with 8,67 lower apartment price - a weak negative linear association. The correlation is significant. About 5.3% of the variation in price is explained by age alone; the other ~94.7% is due to other factors.

sqrt(summary(fit2)$r.squared)

## [1] 0.732187

This is R, the multiple correlation between the observed response and the model’s fitted values. My result 0.732 means the model’s predictions correlate about 0.73 with actual prices. Squaring it gives back R² ≈ 0.536, i.e., the model explains ~53.6% of the variance.

Estimate the linear regression function Price = f(Age, Distance, Parking and Balcony). Be careful to correctly include categorical variables. Save the object named fit3.

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
 data = Apartments)
print(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Coefficients:
## (Intercept)          Age     Distance      Parking      Balcony  
##    2393.316       -7.970      -21.961      128.700        6.032

Age (−7.97): each 1-unit increase in Age is associated with ~8 units lower Price, other variables fixed.. Distance (−21.96): each 1-unit increase in Distance lowers Price by ~22 units, other variables fixed. Parking (+128.7): apartments with parking are about 129 units more expensive than those without, other variables fixed. Balcony (+6.03): a balcony adds ~6 units on average, other variables fixed.

With function anova check if model fit3 fits data better than model fit2.

anova(fit2, fit3)

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

Since p = 0.12 > 0.05, I do not have evidence that adding Parking and Balcony improves fit beyond Age and Distance.

Show the results of fit3 and explain regression coefficient for both categorical variables. Can you write down the hypothesis which is being tested with F-statistics, shown at the bottom of the output?

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## Parking      128.700     60.801   2.117   0.0376 *  
## Balcony        6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Apartments with parking are estimated to cost about 129 units more than those without parking, other variables fixed (statistically significant at 5%). Having a balcony is associated with ~6 units higher price, other variables fixed, but this effect is not statistically important.

Save fitted values and claculate the residual for apartment ID2.

Fitted_ID2   <- fitted(fit3)[Apartments$ID == 2]
Residual_ID2 <- resid(fit3)[Apartments$ID == 2]
round(c(Fitted = Fitted_ID2, Residual = Residual_ID2), 3)

##   Fitted.2 Residual.2 
##   2356.597    443.403

A positive residual means the apartment’s actual price is higher than the model predicts. This apartment was sold for ≈ 2356.597 + 443.403 = 2800 (same units as Price), i.e., about 443 above the model’s expectation given its Age, Distance, Parking, and Balcony.

HW1

2025-09-17