insurance <- read.csv("insurance.csv")
The dataset contains 1,338 observations with 7 variables related to health insurance costs:
| Variable | Type | Description |
|---|---|---|
age |
int | Age of the insured individual (numeric, continuous). |
sex |
categorical | Gender of the insured, with values “male” or “female”. |
bmi |
float | Body Mass Index, a continuous measure of body weight adjusted for height. |
children |
int | Number of dependents (discrete count variable). |
smoker |
categorical | Indicates whether the person is a smoker (“yes”/“no”). |
region |
categorical | Residential region in the U.S. (northeast, northwest, southeast, southwest) |
charges |
float | The insurance premium cost billed to the individual (continuous, response variable). |
Charges is the main dependent variable of interest, while the others act as explanatory variables that may explain variance in insurance costs. Continuous predictors (age, bmi) capture quantitative effects, and categorical predictors (sex, smoker, region) represent group differences.
insurance$sex <- factor(insurance$sex)
insurance$smoker <- factor(insurance$smoker)
insurance$region <- factor(insurance$region)
insurance$age_group <- cut(insurance$age, breaks=c(17,25,35,45,Inf), labels=c("18-25","26-35","36-45","46+"))
insurance$bmi_category <- cut(insurance$bmi, breaks=c(0,18.5,25,30,Inf), labels=c("underweight","normal","overweight","obese"))
Dataset was complete, hence no deletion of units was performed.
smokers <- insurance %>%
filter(smoker == "yes")
non_smokers <- insurance %>%
filter(smoker == "no")
select_if(insurance, is.numeric) %>%
describe() %>%
mutate_if(is.numeric, ~ round(., 2)) %>%
gt() %>%
tab_header(
title = "Descriptive statistics"
) %>%
tab_options(
column_labels.font.weight = "bold"
) %>%
cols_label(
vars = "Variable",
n = "N",
mean = "Mean",
sd = "SD",
median = "Median",
min = "Min",
max = "Max",
range = "Range",
skew = "Skew",
kurtosis = "Kurtosis",
se = "SE"
)
| Descriptive statistics | ||||||||||||
| Variable | N | Mean | SD | Median | trimmed | mad | Min | Max | Range | Skew | Kurtosis | SE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1338 | 39.21 | 14.05 | 39.00 | 39.01 | 17.79 | 18.00 | 64.00 | 46.00 | 0.06 | -1.25 | 0.38 |
| 2 | 1338 | 30.66 | 6.10 | 30.40 | 30.50 | 6.20 | 15.96 | 53.13 | 37.17 | 0.28 | -0.06 | 0.17 |
| 3 | 1338 | 1.09 | 1.21 | 1.00 | 0.94 | 1.48 | 0.00 | 5.00 | 5.00 | 0.94 | 0.19 | 0.03 |
| 4 | 1338 | 13270.42 | 12110.01 | 9382.03 | 11076.02 | 7440.81 | 1121.87 | 63770.43 | 62648.55 | 1.51 | 1.59 | 331.07 |
Here’s some descriptions for the above-mentioned table.
Age: Mean = 39.21 → The average age of insured individuals is about 39 years; Median = 39.00 → The midpoint age is 39, showing a fairly symmetric distribution (mean ≈ median); Range = 46.00 (min 18, max 64) → The youngest person is 18, the oldest 64; Skew = 0.06 → Very close to 0, so the distribution of ages is almost symmetric.
BMI: Mean = 30.66, Median = 30.40 → Average BMI is around 30, which is in the overweight range; SD = 6.10 → There’s moderate spread in BMI values; Range = 37.17 (min 15.96, max 53.13) → Values span from underweight to severely obese.
Children: Mean = 1.09, Median = 1.00 → On average, each insured has about 1 child; Range = 5.00 (min 0, max 5) → Some have no children, the maximum is 5.;
Charges (Insurance Cost): Mean = 13,270.42, Median = 9,382.03 → Average insurance charges are about $13k, but the median is much lower at $9k. This suggests a right-skewed distribution (some very high-cost individuals); SD = 12,110.01 → Extremely high variability in costs; Range = 62,648.55 (min 1,121.87, max 63,770.43) → Very wide spread, indicating large disparities in charges.
Main takeaway: Charges are highly skewed and variable, with extreme outliers (likely linked to smokers or people with health issues).
ggplot(insurance, aes(x=charges)) +
geom_histogram(bins=50, fill="skyblue", color="black") +
scale_x_continuous(labels=scales::comma) +
theme_bw() +
labs(title="Hist", x="Charges", y="N")
We can see that the distrubtion is highy right-skewed.Most
individuals have charges below $20,000, with the highest concentration
between roughly $1,000 and $10,000. A noticeable spread extends up to
about $60,000+, but relatively few people fall in this range. There are
some extreme cases with very high charges, which aligns with the earlier
skewness statistic.
ggplot(insurance, aes(x=age, y=charges)) +
geom_point(alpha=0.4, color="blue") +
scale_y_continuous(labels = scales::comma)+
geom_smooth(method="loess", color="red") +
theme_bw() +
labs(title="cost age", x="Age", y="Charges")
The red line indicates that charges tend to increase with age, particularly after age 20–30. The curve steepens slightly with increasing age, suggesting that older individuals tend to incur higher medical costs. Some individuals have very high charges regardless of their age, which could be due to significant medical conditions or other risk factors.
ggplot(insurance, aes(x=smoker, y=charges)) +
geom_boxplot(fill = "lightblue") +
scale_y_continuous(labels=scales::comma) +
theme_minimal() +
labs(title="Insurance costs in regards to smoking", x="Smoking", y="Charges")
Smokers tend to have a higher median charge than non-smokers. The
box for smokers is shifted upwards compared to the non-smokers. Smokers
show more variability in charges, with a wider IQR and several outliers
above $40,000. Non-smokers have a more concentrated distribution, with
most values falling within a lower charge range, but there are also a
few outliers. Generally, we could say that smoking is associated with
higher medical insurance charges, and the variability in charges for
smokers is greater.
df <- read_excel("Business School.xlsx")
ggplot(df, aes(x = `Undergrad Degree`)) +
geom_histogram(stat = "count", fill = "skyblue") +
theme_bw()
Business is the most common degree.
describe(df$`Annual Salary`) %>%
mutate_if(is.numeric, ~ round(., 2)) %>%
gt() %>%
tab_header(
title = "Descriptive statistics"
) %>%
tab_options(
column_labels.font.weight = "bold"
) %>%
cols_label(
vars = "Variable",
n = "N",
mean = "Mean",
sd = "SD",
median = "Median",
min = "Min",
max = "Max",
range = "Range",
skew = "Skew",
kurtosis = "Kurtosis",
se = "SE"
)
| Descriptive statistics | ||||||||||||
| Variable | N | Mean | SD | Median | trimmed | mad | Min | Max | Range | Skew | Kurtosis | SE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 100 | 109058 | 41501.49 | 103500 | 104600.2 | 25945.5 | 20000 | 340000 | 320000 | 2.22 | 9.41 | 4150.15 |
ggplot(df, aes(x = `Annual Salary`)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
scale_x_continuous(labels = scales::comma) +
theme_bw()
The distribution is right-skewed — most salaries are concentrated
at lower to mid values, with a long tail extending toward very high
salaries. A few individuals earn much higher salaries (e.g., beyond
$200,000), which stand out compared to the majority.
h0 = 74
t.test(x = df$`MBA Grade`, mu = h0,
alternative = "two.sided")
##
## One Sample t-test
##
## data: df$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
The average MBA grade is significantly higher than 74, but the effect size is small (t ≈ 0.27), so that suggests the difference is not of in practical importance.
df2 <- read_excel("Apartments.xlsx")
df2 <- df2 %>%
mutate(
Parking = factor(Parking, levels = c(0,1), labels = c("No","Yes")),
Balcony = factor(Balcony, levels = c(0,1), labels = c("No","Yes"))
)
Changes were made to variables Parking and
Balcony.
h0 = 1900
t.test(x = df2$Price, mu = h0,
alternative = "two.sided") #h1
##
## One Sample t-test
##
## data: df2$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
The one-sample t-test shows that the average apartment price (2018.94) is significantly higher than 1900. The confidence interval and p-value confirm this difference is statistically meaningful.
fit1 <- lm(df2$Price~df2$Age)
summary(fit1)
##
## Call:
## lm(formula = df2$Price ~ df2$Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## df2$Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
The regression results indicate that as apartment age increases by one year, price drops by about €8.98 on average. However, age alone has a limited influence, explaining only 5.3% of price variation.The regression analysis shows that each additional year of apartment age lowers the price by around €8.98, and this effect is statistically significant. Still, age explains only 5.3% of the variation in prices, so other factors contribute more strongly.
df2 %>%
select(Price, Age, Distance) %>%
scatterplotMatrix(smooth = F)
The scatterplot matrix shows Price, Age, and Distance relationships.
Price decreases as both Age and Distance increase, while Age and
Distance themselves show little direct correlation.
fit2 <- lm(Price~Age + Distance, data = df2)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
The multiple regression shows that both greater age and greater distance from the city center significantly reduce apartment prices, with distance having the stronger effect. This model explains about 43% of the price variation, a substantial improvement over the simple regression.
vif(fit2)
## Age Distance
## 1.001845 1.001845
The VIF statistics for Age and Distance are both close to 1, indicating there is no multicollinearity between these predictors. This means Age and Distance provide independent information in the regression model.
rstandard_fit2 <- rstandard(fit2)
cooks_fit2 <- cooks.distance(fit2)
diag_df <- data.frame(
id = 1:nrow(df2),
Price = df2$Price,
Age = df2$Age,
Distance = df2$Distance,
std_resid = rstandard_fit2,
cooks_d = cooks_fit2
)
cooks_threshold <- 4 / nrow(df2)
problematic <- diag_df %>%
filter(abs(std_resid) > 2 | cooks_d > cooks_threshold)
df2_clean <- df2[-problematic$id, ]
# standarized residuals and standardized fitted
std_resid <- rstandard(fit2)
std_fitted <- as.numeric(scale(fitted(fit2)))
diag_df <- data.frame(std_fitted = std_fitted,
std_resid = std_resid,
Age = df2$Age,
Distance = df2$Distance,
Price = df2$Price)
ggplot(diag_df, aes(x = std_fitted, y = std_resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "standarized residuals and standardized fitted values",
x = "standardized fitted values",
y = "standarized residuals") +
theme_minimal()
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ---------------------------
## DF = 1
## Chi2 = 0.968106
## Prob > Chi2 = 0.325153
The Breusch-Pagan test indicates no evidence of heteroskedasticity, so the variance of the residuals is constant. The residuals vs. fitted plot shows no strong patterns, further supporting that model assumptions hold and the residual variance is stable.
# Histogram
ggplot(data = data.frame(std_resid = std_resid), aes(x = std_resid)) +
geom_histogram(aes(y = ..density..), bins = 30, alpha = 0.5, fill = "skyblue", color = "darkgrey") +
geom_density() +
labs(title = "Histogram of standardized residuals")+
theme_bw()
shapiro.test(std_resid)
##
## Shapiro-Wilk normality test
##
## data: std_resid
## W = 0.95306, p-value = 0.00366
The histogram and Shapiro-Wilk test show the standardized residuals are not normally distributed (p = 0.0037). This suggests that the normality assumption for the regression residuals does not hold in the data.
fit2 <- lm(Price ~ Age + Distance, data = df2_clean)
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = df2_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -411.50 -203.69 -45.24 191.11 492.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2502.467 75.024 33.356 < 2e-16 ***
## Age -8.674 3.221 -2.693 0.00869 **
## Distance -24.063 2.692 -8.939 1.57e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.8 on 77 degrees of freedom
## Multiple R-squared: 0.5361, Adjusted R-squared: 0.524
## F-statistic: 44.49 on 2 and 77 DF, p-value: 1.437e-13
After cleaning the data, the updated regression model still finds that both age and distance significantly lower prices, with distance having the stronger effect. The fit improves, with age and distance together now explaining about 52% of the variation in apartment prices.
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony, data = df2_clean)
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df2_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
Adding parking to the model reveals it has a significant positive effect on apartment price, while balconies do not. The explained variance increases slightly, with age, distance, and parking now accounting for about 56% of price variation.
anova(fit2, fit3)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 75 4791128 2 286234 2.2403 0.1135
The ANOVA test shows there is no significant improvement when parking and balcony are added to the model. Thus, the simpler model with just age and distance explains price almost as well as the extended version.The ANOVA test shows that including parking and balcony in the model does not significantly improve the explanation of apartment prices. The simpler model with age and distance is sufficient for predicting price.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = df2_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -390.93 -198.19 -53.64 186.73 518.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2393.316 93.930 25.480 < 2e-16 ***
## Age -7.970 3.191 -2.498 0.0147 *
## Distance -21.961 2.830 -7.762 3.39e-11 ***
## ParkingYes 128.700 60.801 2.117 0.0376 *
## BalconyYes 6.032 57.307 0.105 0.9165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 252.7 on 75 degrees of freedom
## Multiple R-squared: 0.5623, Adjusted R-squared: 0.5389
## F-statistic: 24.08 on 4 and 75 DF, p-value: 7.764e-13
#h0 - regression coefficients except for the intercept equal to 0
#h1 - at least one coefficient different from 0
#getting rid of h0, p = 7.764e-13
In this final regression model, age and distance significantly decrease apartment price, and having parking raises price, while balcony has no significant effect. The model explains about 54% of price variation, similar to the previous model.
fitted_fit3 <- fitted(fit3)
resid_fit3 <- resid(fit3)
fitted_id2 <- fitted_fit3[2]
resid_id2 <- resid_fit3[2]
data.frame(fitted = fitted_id2, residual = resid_id2)
## fitted residual
## 2 2356.597 443.4026
The fitted value and residual for apartment ID2 show the predicted price and the difference from the actual price, indicating how well the model fits this apartment.
fit4 <- lm(Price ~ Age + Distance + Parking , data = df2_clean)
anova(fit2, fit4)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 77 5077362
## 2 76 4791836 1 285527 4.5285 0.03658 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is just an additional analysis, shows what could be an improvement for the model. The ANOVA test shows that adding parking to the age and distance model provides a significant improvement in explaining apartment price. Parking is a meaningful additional predictor in this case.The ANOVA test demonstrates that including parking alongside age and distance significantly improves the model’s explanation of apartment price (p = 0.0366). Parking thus adds valuable predictive information to the regression.