Insurance

Παρουσίαση του Dataset

Το dataset αφορά το κόστος ιατρικής ασφάλισης και περιλαμβάνει πληροφορίες για 1.338 άτομα στις Ηνωμένες Πολιτείες. Κάθε εγγραφή αντιστοιχεί σε ένα άτομο και περιλαμβάνει δημογραφικά και ιατρικά χαρακτηριστικά, καθώς και το ποσό που χρεώθηκε από την ασφαλιστική εταιρεία.

Οι μεταβλητές που περιλαμβάνονται είναι:

age: Ηλικία του ασφαλισμένου (σε έτη)
sex: Φύλο (male ή female)
bmi: Δείκτης Μάζας Σώματος (Body Mass Index)
children: Αριθμός παιδιών που καλύπτονται από το ασφαλιστικό πλάνο
smoker: Αν είναι καπνιστής (yes ή no)
region: Περιοχή κατοικίας (northeast, northwest, southeast, southwest)
charges: Ποσό σε δολάρια που χρεώθηκε (εξαρτημένη μεταβλητή)

Το dataset είναι ιδανικό για την εφαρμογή μοντέλων γραμμικής παλινδρόμησης, καθώς περιλαμβάνει τόσο αριθμητικές όσο και κατηγορικές μεταβλητές, χωρίς ελλιπή δεδομένα (NA).

# Εισαγωγή των δεδομένων
insurance <- read.csv("insurance.csv")

# Πρώτες γραμμές
head(insurance)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

# Δομή dataset
str(insurance)

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

# Ελέγχουμε για ελλιπείς τιμές
colSums(is.na(insurance))

##      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0

ggplot(insurance, aes(x = charges)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +
  labs(title = "Κατανομή Χρεώσεων Ασφάλισης", x = "Χρέωση ($)", y = "Αριθμός Παρατηρήσεων") +
  theme_minimal()

Κατανομή Χρεώσεων Ασφάλισης (Histogram)

Η κατανομή των χρεώσεων ασφάλισης είναι ιδιαίτερα ασύμμετρη προς τα δεξιά (right-skewed). Η πλειοψηφία των παρατηρήσεων συγκεντρώνεται μεταξύ 2.000 και 15.000 δολαρίων, ενώ υπάρχουν λίγες εγγραφές με πολύ υψηλές τιμές, που λειτουργούν ως outliers. Αυτό υποδεικνύει ότι η μεταβλητή charges έχει μεγάλες διακυμάνσεις, πιθανώς λόγω διαφορών σε παράγοντες όπως καπνιστική συνήθεια, ηλικία ή BMI.

ggplot(insurance, aes(x = smoker, y = charges, fill = smoker)) +
  geom_boxplot() +
  labs(title = "Χρεώσεις ανά Καπνιστική Συνήθεια", x = "Καπνιστής", y = "Χρέωση ($)") +
  theme_minimal()

Χρεώσεις ανά Καπνιστική Συνήθεια (Boxplot)

Το γράφημα αποκαλύπτει έντονη διαφορά στις χρεώσεις μεταξύ καπνιστών και μη καπνιστών. Οι καπνιστές (yes) έχουν σημαντικά υψηλότερες μέσες και μέγιστες χρεώσεις, με το διάμεσο να ξεπερνά τα $35.000. Αντίθετα, οι μη καπνιστές (no) έχουν σαφώς χαμηλότερες τιμές, με τη διάμεσο να βρίσκεται περίπου στα $7.000–$8.000.

ggplot(insurance, aes(x = bmi, y = charges)) +
  geom_point(alpha = 0.5, color = "darkgreen") +
  labs(title = "BMI vs Χρέωση Ασφάλισης", x = "BMI", y = "Χρέωση ($)") +
  theme_minimal()

BMI vs Χρέωση Ασφάλισης (Scatterplot)

Το γράφημα παρουσιάζει τη σχέση μεταξύ του δείκτη μάζας σώματος (BMI) και των χρεώσεων ασφάλισης. Παρατηρούμε ότι:

Για BMI έως περίπου 30, οι χρεώσεις παραμένουν σχετικά χαμηλές και συμπυκνωμένες.

Από BMI > 30, αρχίζουν να εμφανίζονται υψηλότερες χρεώσεις, με αυξημένη διασπορά.

Ορισμένες από τις πιο υψηλές χρεώσεις παρατηρούνται σε άτομα με υψηλό BMI, αλλά δεν υπάρχει σαφής γραμμική τάση για όλο το δείγμα.

ggplot(insurance, aes(x = region, y = charges, fill = region)) +
  stat_summary(fun = mean, geom = "bar") +
  labs(title = "Μέση Χρέωση ανά Περιοχή", x = "Περιοχή", y = "Μέση Χρέωση ($)") +
  theme_minimal()

Μέση Χρέωση ανά Περιοχή (Bar Chart)

Το γράφημα δείχνει τη μέση χρέωση ασφάλισης για κάθε περιοχή κατοικίας. Παρατηρούμε ότι:

Οι υψηλότερες μέσες χρεώσεις εμφανίζονται στην περιοχή southeast, ακολουθούμενη από την northeast. Οι περιοχές northwest και southwest έχουν τις χαμηλότερες μέσες τιμές. Οι διαφορές υπάρχουν, αλλά δεν είναι τόσο έντονες όσο σε άλλες μεταβλητές (π.χ. smoker).

Συμπέρασμα: Η μεταβλητή region ενδέχεται να έχει ήπια επίδραση στο κόστος.

model_age <- lm(charges ~ age, data = insurance)
summary(model_age)

## 
## Call:
## lm(formula = charges ~ age, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8059  -6671  -5939   5440  47829 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3165.9      937.1   3.378 0.000751 ***
## age            257.7       22.5  11.453  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared:  0.08941,    Adjusted R-squared:  0.08872 
## F-statistic: 131.2 on 1 and 1336 DF,  p-value: < 2.2e-16

model_bmi <- lm(charges ~ bmi, data = insurance)
summary(model_bmi)

## 
## Call:
## lm(formula = charges ~ bmi, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20956  -8118  -3757   4722  49442 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1192.94    1664.80   0.717    0.474    
## bmi           393.87      53.25   7.397 2.46e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11870 on 1336 degrees of freedom
## Multiple R-squared:  0.03934,    Adjusted R-squared:  0.03862 
## F-statistic: 54.71 on 1 and 1336 DF,  p-value: 2.459e-13

model_smoker <- lm(charges ~ smoker, data = insurance)
summary(model_smoker)

## 
## Call:
## lm(formula = charges ~ smoker, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19221  -5042   -919   3705  31720 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8434.3      229.0   36.83   <2e-16 ***
## smokeryes    23616.0      506.1   46.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7470 on 1336 degrees of freedom
## Multiple R-squared:  0.6198, Adjusted R-squared:  0.6195 
## F-statistic:  2178 on 1 and 1336 DF,  p-value: < 2.2e-16

insurance$sex <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$region <- as.factor(insurance$region)

model_multi <- lm(charges ~ age + bmi + smoker + sex + region, data = insurance)
summary(model_multi)

## 
## Call:
## lm(formula = charges ~ age + bmi + smoker + sex + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11852.6  -3010.9   -987.8   1515.8  29467.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11556.96     985.63 -11.725   <2e-16 ***
## age                258.54      11.94  21.658   <2e-16 ***
## bmi                340.46      28.71  11.857   <2e-16 ***
## smokeryes        23862.91     414.82  57.526   <2e-16 ***
## sexmale           -111.57     334.26  -0.334   0.7386    
## regionnorthwest   -304.10     478.01  -0.636   0.5248    
## regionsoutheast  -1039.20     480.65  -2.162   0.0308 *  
## regionsouthwest   -916.44     479.72  -1.910   0.0563 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6087 on 1330 degrees of freedom
## Multiple R-squared:  0.7487, Adjusted R-squared:  0.7474 
## F-statistic:   566 on 7 and 1330 DF,  p-value: < 2.2e-16

model_age_smoker <- lm(charges ~ age + smoker, data = insurance)
summary(model_age_smoker)

## 
## Call:
## lm(formula = charges ~ age + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16088.1  -2046.8  -1336.4   -212.7  28760.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2391.63     528.30  -4.527 6.52e-06 ***
## age           274.87      12.46  22.069  < 2e-16 ***
## smokeryes   23855.30     433.49  55.031  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6397 on 1335 degrees of freedom
## Multiple R-squared:  0.7214, Adjusted R-squared:  0.721 
## F-statistic:  1728 on 2 and 1335 DF,  p-value: < 2.2e-16

# Μοντέλο με BMI + Smoker
model_bmi_smoker <- lm(charges ~ bmi + smoker, data = insurance)
summary(model_bmi_smoker)

## 
## Call:
## lm(formula = charges ~ bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15992.7  -4600.2   -802.4   3636.2  30677.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3459.10     998.28  -3.465 0.000547 ***
## bmi           388.02      31.79  12.207  < 2e-16 ***
## smokeryes   23593.98     480.18  49.136  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7088 on 1335 degrees of freedom
## Multiple R-squared:  0.6579, Adjusted R-squared:  0.6574 
## F-statistic:  1284 on 2 and 1335 DF,  p-value: < 2.2e-16

model_sex <- lm(charges ~ sex, data = insurance)
summary(model_sex)

## 
## Call:
## lm(formula = charges ~ sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12835  -8435  -3980   3476  51201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12569.6      470.1  26.740   <2e-16 ***
## sexmale       1387.2      661.3   2.098   0.0361 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12090 on 1336 degrees of freedom
## Multiple R-squared:  0.003282,   Adjusted R-squared:  0.002536 
## F-statistic:   4.4 on 1 and 1336 DF,  p-value: 0.03613

model_smoker_sex <- lm(charges ~ smoker + sex, data = insurance)
summary(model_smoker_sex)

## 
## Call:
## lm(formula = charges ~ smoker + sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19193  -5074   -909   3739  31682 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8466.04     303.54   27.89   <2e-16 ***
## smokeryes   23622.13     507.74   46.52   <2e-16 ***
## sexmale       -65.38     409.81   -0.16    0.873    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7473 on 1335 degrees of freedom
## Multiple R-squared:  0.6198, Adjusted R-squared:  0.6192 
## F-statistic:  1088 on 2 and 1335 DF,  p-value: < 2.2e-16

model_age_sex <- lm(charges ~ age + sex, data = insurance)
summary(model_age_sex)

## 
## Call:
## lm(formula = charges ~ age + sex, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8821  -6947  -5511   5443  48203 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2343.62     994.35   2.357   0.0186 *  
## age           258.87      22.47  11.523   <2e-16 ***
## sexmale      1538.83     631.08   2.438   0.0149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11540 on 1335 degrees of freedom
## Multiple R-squared:  0.09344,    Adjusted R-squared:  0.09209 
## F-statistic:  68.8 on 2 and 1335 DF,  p-value: < 2.2e-16

model_age_bmi_smoker_sex <- lm(charges ~ age + bmi + smoker + sex, data = insurance)
summary(model_age_bmi_smoker_sex)

## 
## Call:
## lm(formula = charges ~ age + bmi + smoker + sex, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12364.7  -2972.2   -983.2   1475.8  29018.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11633.49     947.27 -12.281   <2e-16 ***
## age            259.45      11.94  21.727   <2e-16 ***
## bmi            323.05      27.53  11.735   <2e-16 ***
## smokeryes    23833.87     414.19  57.544   <2e-16 ***
## sexmale       -109.04     334.66  -0.326    0.745    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6094 on 1333 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7467 
## F-statistic: 986.5 on 4 and 1333 DF,  p-value: < 2.2e-16

model_age_bmi_smoker <- lm(charges ~ age + bmi + smoker, data = insurance)
summary(model_age_bmi_smoker)

## 
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12415.4  -2970.9   -980.5   1480.0  28971.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11676.83     937.57  -12.45   <2e-16 ***
## age            259.55      11.93   21.75   <2e-16 ***
## bmi            322.62      27.49   11.74   <2e-16 ***
## smokeryes    23823.68     412.87   57.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6092 on 1334 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7469 
## F-statistic:  1316 on 3 and 1334 DF,  p-value: < 2.2e-16

model_stats <- tibble(
  Model = c("Age + Smoker", "BMI + Smoker", "Age + BMI + Smoker", "Full Model"),
  R2 = c(0.721, 0.6574, 0.7475, 0.7487),
  RSE = c(6397, 7088, 6092, 6087)
)

model_stats

## # A tibble: 4 × 3
##   Model                 R2   RSE
##   <chr>              <dbl> <dbl>
## 1 Age + Smoker       0.721  6397
## 2 BMI + Smoker       0.657  7088
## 3 Age + BMI + Smoker 0.748  6092
## 4 Full Model         0.749  6087

Σχολιασμός:

Παρατηρείται ότι όσο προστίθενται περισσότερες μεταβλητές, το R² αυξάνεται, ενώ το SSE (ή Residual Standard Error) μειώνεται. Το μοντέλο με τις μεταβλητές Age + BMI + Smoker έχει σχεδόν το ίδιο R² με το πλήρες μοντέλο, αλλά με παρόμοιο RSE, γεγονός που υποδεικνύει ότι η προσθήκη των υπολοίπων μεταβλητών προσθέτει λίγη ή καθόλου επιπλέον ερμηνευτική ισχύ.

library(tibble)
library(ggplot2)

# Δημιουργία πίνακα με τα R-squared για όλα τα μοντέλα
data_r2_updated <- tibble(
  Model = c(
    "BMI", "Age", "Smoker",
    "BMI + Smoker", "Age + Smoker",
    "Smoker + Sex", "Age + Sex",
    "Age + BMI + Smoker",
    "Age + BMI + Smoker + Sex",
    "Full Model"
  ),
  R2 = c(
    0.039, 0.089, 0.620,
    0.657, 0.721,
    0.6198, 0.09344,
    0.748, 0.749, 0.749
  )
)

# Δημιουργία γραφήματος
ggplot(data_r2_updated, aes(x = reorder(Model, R2), y = R2, fill = Model)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = round(R2, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Σύγκριση R-squared μεταξύ Γραμμικών Μοντέλων",
    x = "Μοντέλο Παλινδρόμησης",
    y = "R-squared"
  ) +
  ylim(0, 1) +
  theme_minimal(base_size = 13) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Σχολιασμός: Σύγκριση R-squared μεταξύ Μοντέλων

Το γράφημα απεικονίζει την ερμηνευτική ικανότητα διαφόρων γραμμικών μοντέλων ως προς το ποσοστό διακύμανσης στο κόστος ασφάλισης (charges) που εξηγείται από τις ανεξάρτητες μεταβλητές:

 Τα μοντέλα με μόνο μία μεταβλητή (BMI, Age, Sex) έχουν πολύ χαμηλά R² (< 0.1), με την sex να έχει το χαμηλότερο (0.003!).

 Το μοντέλο Smoker ξεχωρίζει με R² = 0.62, δείχνοντας ότι ο καπνιστής είναι ο πιο ισχυρός ατομικός παράγοντας.

Ο συνδυασμός Age + Smoker και BMI + Smoker αυξάνει σημαντικά την ερμηνευτική δύναμη του μοντέλου (R² = 0.721 και 0.657 αντίστοιχα).

Η προσθήκη περισσότερων μεταβλητών οδηγεί σε οριακή βελτίωση του R²: από 0.748 (3 μεταβλητές) σε 0.749 (Full Model).

 Συμπέρασμα: Μετά τις 3 κύριες μεταβλητές (Age, BMI, Smoker), η βελτίωση είναι αμελητέα

# Βάζουμε τα Residual Standard Errors για όλα τα μοντέλα
sse_df_updated <- tibble(
  Model = c(
    "BMI", "Age", "Sex",
    "Smoker", "Smoker + Sex", "Age + Sex",
    "BMI + Smoker", "Age + Smoker",
    "Age + BMI + Smoker", "Age + BMI + Smoker + Sex", "Full Model"
  ),
  RSE = c(
    summary(model_bmi)$sigma,
    summary(model_age)$sigma,
    summary(model_sex)$sigma,
    summary(model_smoker)$sigma,
    summary(model_smoker_sex)$sigma,
    summary(model_age_sex)$sigma,
    summary(model_bmi_smoker)$sigma,
    summary(model_age_smoker)$sigma,
    summary(model_age_bmi_smoker)$sigma,
    summary(lm(charges ~ age + bmi + smoker + sex, data = insurance))$sigma,
    summary(model_multi)$sigma
  )
)

# Γράφημα
ggplot(sse_df_updated, aes(x = reorder(Model, RSE), y = RSE)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = round(RSE, 1)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Σύγκριση Residual Standard Error (SSE) ανά Μοντέλο",
    x = "Μοντέλο Παλινδρόμησης",
    y = "Residual Standard Error"
  ) +
  theme_minimal(base_size = 13) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Σχολιασμός: Σύγκριση Residual Standard Error (SSE) μεταξύ Μοντέλων

Το Residual Standard Error (RSE) εκφράζει το μέσο μέγεθος του σφάλματος πρόβλεψης – όσο μικρότερη η τιμή, τόσο πιο ακριβές θεωρείται το μοντέλο.

🔹 Παρατηρήσεις από το γράφημα:

Το "Full Model" (που περιλαμβάνει όλες τις μεταβλητές) έχει τη μικρότερη τιμή RSE (~6086), υποδεικνύοντας την καλύτερη ακρίβεια πρόβλεψης.

 Τα μοντέλα "Age + BMI + Smoker" και "Age + BMI + Smoker + Sex" έχουν σχεδόν ίδια απόδοση με το πλήρες μοντέλο, αποδεικνύοντας ότι η μεταβλητή sex προσφέρει ελάχιστη επιπλέον βελτίωση.

 Το μοντέλο "Smoker" από μόνο του έχει σημαντικά χαμηλότερο RSE (~7470) σε σχέση με Age, BMI και Sex, επιβεβαιώνοντας ότι είναι η πιο προβλεπτική μεταβλητή.

Τα μοντέλα με μόνο μία μεταβλητή (Age, BMI, Sex) έχουν τις χειρότερες επιδόσεις, με RSE > 11.500.

 Το μοντέλο "Age + Smoker" (RSE ~6397) υπερέχει ξεκάθαρα έναντι απλούστερων διμεταβλητών μοντέλων, αποδεικνύοντας ότι ο συνδυασμός ισχυρής + μέτριας μεταβλητής προσφέρει καλύτερη πρόβλεψη.

Insurance_erg05

Anastasios Politis

2025-04-09

Παρουσίαση του Dataset

Οι μεταβλητές που περιλαμβάνονται είναι: