Task 1

insurance <- read.csv("insurance.csv")

1.1 Dataset explaination

The dataset contains 1,338 observations with 7 variables related to health insurance costs:

Variable Type Description
age int Age of the insured individual (numeric, continuous).
sex categorical Gender of the insured, with values “male” or “female”.
bmi float Body Mass Index, a continuous measure of body weight adjusted for height.
children int Number of dependents (discrete count variable).
smoker categorical Indicates whether the person is a smoker (“yes”/“no”).
region categorical Residential region in the U.S. (northeast, northwest, southeast, southwest)
charges float The insurance premium cost billed to the individual (continuous, response variable).

Charges is the main dependent variable of interest, while the others act as explanatory variables that may explain variance in insurance costs. Continuous predictors (age, bmi) capture quantitative effects, and categorical predictors (sex, smoker, region) represent group differences.

1.2 Data manipulation

insurance$sex <- factor(insurance$sex)
insurance$smoker <- factor(insurance$smoker)
insurance$region <- factor(insurance$region)

insurance$age_group <- cut(insurance$age, breaks=c(17,25,35,45,Inf), labels=c("18-25","26-35","36-45","46+"))
insurance$bmi_category <- cut(insurance$bmi, breaks=c(0,18.5,25,30,Inf), labels=c("underweight","normal","overweight","obese"))

Dataset was complete, hence no deletion of units was performed.

smokers <- insurance %>%
  filter(smoker == "yes")
non_smokers <- insurance %>%
  filter(smoker == "no")

1.3 Descriptive statistics

select_if(insurance, is.numeric) %>% 
  describe() %>% 
  mutate_if(is.numeric, ~ round(., 2)) %>%
  gt() %>% 
  tab_header(
    title = "Descriptive statistics"
  ) %>% 
  tab_options(
    column_labels.font.weight = "bold"
  ) %>% 
  cols_label(
    vars = "Variable",
    n = "N",
    mean = "Mean",
    sd = "SD",
    median = "Median",
    min = "Min",
    max = "Max",
    range = "Range",
    skew = "Skew",
    kurtosis = "Kurtosis",
    se = "SE"
  )
Descriptive statistics
Variable N Mean SD Median trimmed mad Min Max Range Skew Kurtosis SE
1 1338 39.21 14.05 39.00 39.01 17.79 18.00 64.00 46.00 0.06 -1.25 0.38
2 1338 30.66 6.10 30.40 30.50 6.20 15.96 53.13 37.17 0.28 -0.06 0.17
3 1338 1.09 1.21 1.00 0.94 1.48 0.00 5.00 5.00 0.94 0.19 0.03
4 1338 13270.42 12110.01 9382.03 11076.02 7440.81 1121.87 63770.43 62648.55 1.51 1.59 331.07

Here’s some descriptions for the above-mentioned table.

Age: Mean = 39.21 → The average age of insured individuals is about 39 years; Median = 39.00 → The midpoint age is 39, showing a fairly symmetric distribution (mean ≈ median); Range = 46.00 (min 18, max 64) → The youngest person is 18, the oldest 64; Skew = 0.06 → Very close to 0, so the distribution of ages is almost symmetric.

BMI: Mean = 30.66, Median = 30.40 → Average BMI is around 30, which is in the overweight range; SD = 6.10 → There’s moderate spread in BMI values; Range = 37.17 (min 15.96, max 53.13) → Values span from underweight to severely obese.

Children: Mean = 1.09, Median = 1.00 → On average, each insured has about 1 child; Range = 5.00 (min 0, max 5) → Some have no children, the maximum is 5.;

Charges (Insurance Cost): Mean = 13,270.42, Median = 9,382.03 → Average insurance charges are about $13k, but the median is much lower at $9k. This suggests a right-skewed distribution (some very high-cost individuals); SD = 12,110.01 → Extremely high variability in costs; Range = 62,648.55 (min 1,121.87, max 63,770.43) → Very wide spread, indicating large disparities in charges.

Main takeaway: Charges are highly skewed and variable, with extreme outliers (likely linked to smokers or people with health issues).

1.4 Variable distribution

Histogram

ggplot(insurance, aes(x=charges)) +
  geom_histogram(bins=50, fill="skyblue", color="black") +
  scale_x_continuous(labels=scales::comma) +
  theme_bw() +
  labs(title="Hist", x="Charges", y="N")


We can see that the distrubtion is highy right-skewed.Most individuals have charges below $20,000, with the highest concentration between roughly $1,000 and $10,000. A noticeable spread extends up to about $60,000+, but relatively few people fall in this range. There are some extreme cases with very high charges, which aligns with the earlier skewness statistic.

Cost - age relation

ggplot(insurance, aes(x=age, y=charges)) +
  geom_point(alpha=0.4, color="blue") +
  scale_y_continuous(labels = scales::comma)+
  geom_smooth(method="loess", color="red") +
  theme_bw() +
  labs(title="cost age", x="Age", y="Charges")

The red line indicates that charges tend to increase with age, particularly after age 20–30. The curve steepens slightly with increasing age, suggesting that older individuals tend to incur higher medical costs. Some individuals have very high charges regardless of their age, which could be due to significant medical conditions or other risk factors.

Cost - smoking relation

ggplot(insurance, aes(x=smoker, y=charges)) +
  geom_boxplot(fill = "lightblue") +
  scale_y_continuous(labels=scales::comma) +
  theme_minimal() +
  labs(title="Insurance costs in regards to smoking", x="Smoking", y="Charges")


Smokers tend to have a higher median charge than non-smokers. The box for smokers is shifted upwards compared to the non-smokers. Smokers show more variability in charges, with a wider IQR and several outliers above $40,000. Non-smokers have a more concentrated distribution, with most values falling within a lower charge range, but there are also a few outliers. Generally, we could say that smoking is associated with higher medical insurance charges, and the variability in charges for smokers is greater.

Task 2

2.1 Undergrad degrees distribution

df <- read_excel("Business School.xlsx")

ggplot(df, aes(x = `Undergrad Degree`)) +
  geom_histogram(stat = "count", fill = "skyblue") +
  theme_bw()


Business is the most common degree.

2.2 Annual Salary

describe(df$`Annual Salary`) %>% 
  mutate_if(is.numeric, ~ round(., 2)) %>%
  gt() %>% 
  tab_header(
    title = "Descriptive statistics"
  ) %>% 
  tab_options(
    column_labels.font.weight = "bold"
  ) %>% 
  cols_label(
    vars = "Variable",
    n = "N",
    mean = "Mean",
    sd = "SD",
    median = "Median",
    min = "Min",
    max = "Max",
    range = "Range",
    skew = "Skew",
    kurtosis = "Kurtosis",
    se = "SE"
  )
Descriptive statistics
Variable N Mean SD Median trimmed mad Min Max Range Skew Kurtosis SE
1 100 109058 41501.49 103500 104600.2 25945.5 20000 340000 320000 2.22 9.41 4150.15
ggplot(df, aes(x = `Annual Salary`)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  scale_x_continuous(labels = scales::comma) +
  theme_bw()


The distribution is right-skewed — most salaries are concentrated at lower to mid values, with a long tail extending toward very high salaries. A few individuals earn much higher salaries (e.g., beyond $200,000), which stand out compared to the majority.

2.3 Hypothesis testing

h0 = 74
t.test(x = df$`MBA Grade`, mu = h0, 
       alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  df$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
##  74.51764 77.56346
## sample estimates:
## mean of x 
##  76.04055

The average MBA grade is significantly higher than 74, but the effect size is small (t ≈ 0.27), so that suggests the difference is not of in practical importance.

Task 3

Factor variables

df2 <- read_excel("Apartments.xlsx")
df2 <- df2 %>%
  mutate(
    Parking = factor(Parking, levels = c(0,1), labels = c("No","Yes")),
    Balcony = factor(Balcony, levels = c(0,1), labels = c("No","Yes"))
  )

Changes were made to variables Parking and Balcony.

Hypothesis testing

h0 = 1900
t.test(x = df2$Price, mu = h0, 
       alternative = "two.sided") #h1
## 
##  One Sample t-test
## 
## data:  df2$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

The one-sample t-test shows that the average apartment price (2018.94) is significantly higher than 1900. The confidence interval and p-value confirm this difference is statistically meaningful.

Simple regression function: Price = f(Age)

fit1 <- lm(df2$Price~df2$Age)
summary(fit1)
## 
## Call:
## lm(formula = df2$Price ~ df2$Age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## df2$Age       -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

The regression results indicate that as apartment age increases by one year, price drops by about €8.98 on average. However, age alone has a limited influence, explaining only 5.3% of price variation.The regression analysis shows that each additional year of apartment age lowers the price by around €8.98, and this effect is statistically significant. Still, age explains only 5.3% of the variation in prices, so other factors contribute more strongly.

Scaterplot matrix between Price, Age and Distance

df2 %>% 
  select(Price, Age, Distance) %>%
  scatterplotMatrix(smooth = F)

The scatterplot matrix shows Price, Age, and Distance relationships. Price decreases as both Age and Distance increase, while Age and Distance themselves show little direct correlation.

Multiple regression function: Price = f(Age, Distance)

fit2 <- lm(Price~Age + Distance, data = df2)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

The multiple regression shows that both greater age and greater distance from the city center significantly reduce apartment prices, with distance having the stronger effect. This model explains about 43% of the price variation, a substantial improvement over the simple regression.

Multicolinearity with VIF statistics

vif(fit2)
##      Age Distance 
## 1.001845 1.001845

The VIF statistics for Age and Distance are both close to 1, indicating there is no multicollinearity between these predictors. This means Age and Distance provide independent information in the regression model.

Standardized residuals and Cooks Distances

rstandard_fit2 <- rstandard(fit2)
cooks_fit2 <- cooks.distance(fit2)

diag_df <- data.frame(
  id = 1:nrow(df2),
  Price = df2$Price,
  Age = df2$Age,
  Distance = df2$Distance,
  std_resid = rstandard_fit2,
  cooks_d = cooks_fit2
)

cooks_threshold <- 4 / nrow(df2)

problematic <- diag_df %>%
  filter(abs(std_resid) > 2 | cooks_d > cooks_threshold)

df2_clean <- df2[-problematic$id, ]

Heteroskedasticity

# standarized residuals and standardized fitted
std_resid <- rstandard(fit2)
std_fitted <- as.numeric(scale(fitted(fit2)))

diag_df <- data.frame(std_fitted = std_fitted,
                      std_resid = std_resid,
                      Age = df2$Age,
                      Distance = df2$Distance,
                      Price = df2$Price)

ggplot(diag_df, aes(x = std_fitted, y = std_resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "standarized residuals and standardized fitted values",
       x = "standardized fitted values",
       y = "standarized residuals") +
  theme_minimal()

ols_test_breusch_pagan(fit2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##        Test Summary         
##  ---------------------------
##  DF            =    1 
##  Chi2          =    0.968106 
##  Prob > Chi2   =    0.325153

The Breusch-Pagan test indicates no evidence of heteroskedasticity, so the variance of the residuals is constant. The residuals vs. fitted plot shows no strong patterns, further supporting that model assumptions hold and the residual variance is stable.

Standardized residuals distribution

# Histogram
ggplot(data = data.frame(std_resid = std_resid), aes(x = std_resid)) +
  geom_histogram(aes(y = ..density..), bins = 30, alpha = 0.5, fill = "skyblue", color = "darkgrey") +
  geom_density() +
  labs(title = "Histogram of standardized residuals")+
  theme_bw()

shapiro.test(std_resid)
## 
##  Shapiro-Wilk normality test
## 
## data:  std_resid
## W = 0.95306, p-value = 0.00366

The histogram and Shapiro-Wilk test show the standardized residuals are not normally distributed (p = 0.0037). This suggests that the normality assumption for the regression residuals does not hold in the data.

Fit2 estimation

fit2 <- lm(Price ~ Age + Distance, data = df2_clean)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ Age + Distance, data = df2_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -411.50 -203.69  -45.24  191.11  492.56 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2502.467     75.024  33.356  < 2e-16 ***
## Age           -8.674      3.221  -2.693  0.00869 ** 
## Distance     -24.063      2.692  -8.939 1.57e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared:  0.5361, Adjusted R-squared:  0.524 
## F-statistic: 44.49 on 2 and 77 DF,  p-value: 1.437e-13

After cleaning the data, the updated regression model still finds that both age and distance significantly lower prices, with distance having the stronger effect. The fit improves, with age and distance together now explaining about 52% of the variation in apartment prices.

Linear regression function Price = f(Age, Distance, Parking and Balcony)

fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = df2_clean)
summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df2_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13

Adding parking to the model reveals it has a significant positive effect on apartment price, while balconies do not. The explained variance increases slightly, with age, distance, and parking now accounting for about 56% of price variation.

ANOVA test

anova(fit2, fit3)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     77 5077362                           
## 2     75 4791128  2    286234 2.2403 0.1135

The ANOVA test shows there is no significant improvement when parking and balcony are added to the model. Thus, the simpler model with just age and distance explains price almost as well as the extended version.The ANOVA test shows that including parking and balcony in the model does not significantly improve the explanation of apartment prices. The simpler model with age and distance is sufficient for predicting price.

Fit3 results

summary(fit3)
## 
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df2_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -390.93 -198.19  -53.64  186.73  518.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2393.316     93.930  25.480  < 2e-16 ***
## Age           -7.970      3.191  -2.498   0.0147 *  
## Distance     -21.961      2.830  -7.762 3.39e-11 ***
## ParkingYes   128.700     60.801   2.117   0.0376 *  
## BalconyYes     6.032     57.307   0.105   0.9165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared:  0.5623, Adjusted R-squared:  0.5389 
## F-statistic: 24.08 on 4 and 75 DF,  p-value: 7.764e-13
#h0 - regression coefficients except for the intercept equal to 0
#h1 - at least one coefficient different from 0
#getting rid of h0, p = 7.764e-13

In this final regression model, age and distance significantly decrease apartment price, and having parking raises price, while balcony has no significant effect. The model explains about 54% of price variation, similar to the previous model.

Fitted values and the residual for apartment ID2.

fitted_fit3 <- fitted(fit3)
resid_fit3   <- resid(fit3)

fitted_id2 <- fitted_fit3[2]
resid_id2   <- resid_fit3[2]
data.frame(fitted = fitted_id2, residual = resid_id2)
##     fitted residual
## 2 2356.597 443.4026

The fitted value and residual for apartment ID2 show the predicted price and the difference from the actual price, indicating how well the model fits this apartment.

Additional fit4 model

fit4 <- lm(Price ~ Age + Distance + Parking , data = df2_clean)
anova(fit2, fit4)
## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     77 5077362                              
## 2     76 4791836  1    285527 4.5285 0.03658 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is just an additional analysis, shows what could be an improvement for the model. The ANOVA test shows that adding parking to the age and distance model provides a significant improvement in explaining apartment price. Parking is a meaningful additional predictor in this case.The ANOVA test demonstrates that including parking alongside age and distance significantly improves the model’s explanation of apartment price (p = 0.0366). Parking thus adds valuable predictive information to the regression.