INTRODUCTION

In this research, we examine a data collection that includes personal health care costs in an effort to find trends, connections, and information that can help the healthcare and insurance industries make well-informed decisions.

Policymakers, insurers, and healthcare providers must all be aware of the variables that affect medical costs. Our goal is to pinpoint the crucial elements that influence the fluctuations in healthcare costs. This information can help with risk assessment, strategic decision-making, and possibly even more effective and individualized healthcare treatments.

Data Set

We employ a large-scale data collection that includes details about people’s ages, genders, body mass index (BMI), number of children, smoking status, location, and health care costs. This dataset offers a wealth of information for investigating correlations and trends that could impact healthcare expenses.

Objectives

Explore age-related patterns in medical charges to gain insights into how charges vary across different age groups.
Identify high-risk groups based on demographic and lifestyle factors, providing valuable information for risk assessment and targeted interventions.

PREPARING THE DATA

Loading Data:

l loading all the potential packages that l was to use in this project. This then led me to loading my data set into R using the ‘read.csv’ function as the data was in the form of a csv file, then continued to load it into the ‘data’ variable.

invisible(library(Hmisc))

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

invisible(library(plotrix))
invisible(library(ggplot2))
invisible(library(gridExtra))
invisible(library(reshape2))
invisible(library(dplyr))

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

invisible(library(cluster))
invisible(library(fmsb))
invisible(library(plotly))

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:Hmisc':
## 
##     subplot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

invisible(library(corrplot))

## corrplot 0.92 loaded

invisible(library(factoextra))

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

invisible(library(tidyverse))

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine()   masks gridExtra::combine()
## ✖ plotly::filter()   masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

invisible(library(readr))

medinsurance <- read_csv("medinsurance.csv")

## Rows: 1338 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sex, smoker, region
## dbl (4): age, bmi, children, charges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Understanding the data:

After loading the data and the required packages, below we are using various functions to inspect and summarize the data provided to ensure adequate understanding of what we are working with.

head(medinsurance)

str(medinsurance)

## spc_tbl_ [1,338 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age     : num [1:1338] 19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr [1:1338] "female" "male" "male" "male" ...
##  $ bmi     : num [1:1338] 27.9 33.8 33 22.7 28.9 ...
##  $ children: num [1:1338] 0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr [1:1338] "yes" "no" "no" "no" ...
##  $ region  : chr [1:1338] "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num [1:1338] 16885 1726 4449 21984 3867 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_character(),
##   ..   bmi = col_double(),
##   ..   children = col_double(),
##   ..   smoker = col_character(),
##   ..   region = col_character(),
##   ..   charges = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(medinsurance)

##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

As shown below, the basis of the data is to show the regions where people have been charged with insurance and what kind of factors influenced the charges.

Data Quality:

The integrity of the data set is crucial for reliable analysis. Henceforth, to assess data completeness, l checked for missing values using the code below.

any_missing <- any(is.na(data))

## Warning in is.na(data): is.na() applied to non-(list or vector) of type
## 'closure'

any_missing

## [1] FALSE

The result stored in the variable ‘any_missing’, came out ‘FALSE’ as shown above, indicating that there were no missing values present.

total_missing <-sum(is.na(medinsurance)) 
total_missing

## [1] 0

With that being said, l checked the data types for my columns to see which ones are containing the numeric data type, so l i can just directly deal with missing values.

sapply(medinsurance, class)

##         age         sex         bmi    children      smoker      region 
##   "numeric" "character"   "numeric"   "numeric" "character" "character" 
##     charges 
##   "numeric"

EDA- IDENTIFYING HIGH-RISK GROUPS

DISTRIBUTION OF THE MEDICAL CHARGES

ggplot(medinsurance, aes(x = charges)) +
  geom_histogram(binwidth = 1000, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Medical Insurance Charges",
       x = "Charges",
       y = "Frequency")

The right-skewed nature of the distribution is visually evident in the histogram, where the tail of the distribution extends more towards the higher charges. This suggests the presence of outliers or a group of individuals with exceptionally high medical charges between 50 000-60 000.

AGE ANALYSIS

# Creating age categories
medinsurance$age_category <- cut(medinsurance$age, breaks = c(18, 35, 55, 100), labels = c("18-35", "36-55", "56 or older"), include.lowest = TRUE, right = FALSE)

# Counting the number of individuals in each age category
age_counts <- table(medinsurance$age_category)

# Calculating percentages
age_percentages <- prop.table(age_counts) * 100

# Creating a data frame for plotting
age_data <- data.frame(Age_Category = names(age_percentages), Percentage = as.numeric(age_percentages))

# Creating a pie chart with percentages
ggplot(age_data, aes(x = "", y = Percentage, fill = Age_Category, label = paste0(round(Percentage, 1), "%"))) +
  geom_bar(stat = "identity", width = 1) +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), position = position_stack(vjust = 0.5)) +
  coord_polar("y") +
  theme_void() +
  labs(title = "Distribution of Age Categories")

The largest proportion of individuals falls within the “18-35” age range, making up 41.0% of the data set. The “36-55” age group closely follows, representing 40.9% of the population. Individuals aged “56 or older” constitute 18.1% of the dataset.

BMI BY AGE

In the box plot representing BMI by age category, each box corresponds to a different age group. The length of the box indicates the spread of BMI values within each age category, with the top and bottom of the box representing the first and third quartiles, respectively.

ggplot(medinsurance, aes(x = age_category, y = bmi, fill = age_category)) +
  geom_boxplot() +
  labs(title = "BMI by Age Category",
       x = "Age Category",
       y = "BMI") +
  theme_minimal()

As we can see above, the points extending from the box whiskers show the range of typical BMI values within each age group and are represented as outliers.

Relationship Analysis

# Subsetting numeric columns
numeric_columns <- medinsurance %>%
  select(age, bmi, children, charges)

# Calculating Pearson correlation matrix
cor_matrix <- cor(numeric_columns, method = "pearson")

corrplot(cor_matrix, method = "number", type = "lower", addCoef.col = "black")

Looking at the plot, darker squares and higher numerical values suggest stronger positive correlations, while lighter squares and lower numerical values indicate weaker or negative correlations. Therefore, Age has a strong positive correlation to each variable. As age increases, charges tend to increase and so does the BMI.

AGE VS MEDICAL CHARGES

# Scatter plot of age vs. charges
ggplot(medinsurance, aes(x = age, y = charges)) +
  geom_point(alpha = 0.6, color = "blue") +
  labs(title = "Scatter Plot of Age vs. Medical Charges",
       x = "Age",
       y = "Medical Charges") +
  theme_minimal()

Our analysis reveals a positive correlation between age and medical charges. This implies that, on average, as individuals’ age increases, there is a corresponding increase in the associated medical charges. This aligns with the broader understanding in healthcare that older individuals tend to have higher healthcare needs and may be more susceptible to chronic conditions, contributing to elevated medical expenses.

# Calculating the correlation coefficient
correlation <- cor(medinsurance$age, medinsurance$charges)


cat("Correlation between Age and Charges:", correlation)

## Correlation between Age and Charges: 0.2990082

The correlation output supports the visuals in the scatter plot. The positive correlation coefficient suggests that as age increases, there is a tendency for medical charges to increase as well.

# Grouped bar plot of charges by age category, gender, and smoking status
ggplot(medinsurance, aes(x = factor(age_category), y = charges, fill = smoker, group = sex)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Age, Gender, and Smoking Status vs. Medical Charges",
       x = "Age Category",
       y = "Medical Charges",
       fill = "Smoking Status") +
  theme_minimal()

The grouped bar plot reinforces the overall trend that medical charges tend to increase with age, therefore, most of them appear to smoke and most of them are female.

Data Preprocessing

Objective: The objective of this section is to prepare the data for clustering analysis by exploring outliers, encoding categorical variables and performing standardization to ensure that all features contribute equally to the clustering process, hence preventing features with larger scales from dominating the distance calculations.

Exploring Outliers:

Box Plot:

I started by setting the graphical parameters that l used to specify the layout of the plots in a grid. Each box plot provides a visual summary of the distribution of the respective variable.

# Box plots for 'age', 'bmi', 'children', and 'charges'
par(mfrow = c(1, 4))  # Set up a 1x4 grid for plots

# Box plot for 'age'
boxplot(medinsurance$age, main = "Age", ylab = "Age")

# Box plot for 'bmi'
boxplot(medinsurance$bmi, main = "BMI", ylab = "BMI")

# Box plot for 'children'
boxplot(medinsurance$children, main = "Children", ylab = "Children")

# Box plot for 'charges'
boxplot(medinsurance$charges, main = "Charges", ylab = "Charges")

So when using the box plot, we will notice the spread of of the box and the length of the whiskers. Any whisker or any points outside the whiskers may be indicated as outliers. These are observations that deviate significantly from the overall pattern of the data.

As we can notice the differences shown, each variable has an outstanding points which can be identified as outliers because they deviate from the original patter. Below, i handled the outliers and showed the distribution of the variables afterwards

# Function to handle outliers and plot histogram for a variable
handle_and_plot <- function(variable, var_name) {
  # Calculating the IQR
  var_iqr <- IQR(variable)
  
  # Defining the upper and lower bounds for outliers
  lower_bound <- quantile(variable)[2] - 1.5 * var_iqr
  upper_bound <- quantile(variable)[4] + 1.5 * var_iqr
  
  # Identify outliers
  outliers <- variable < lower_bound | variable > upper_bound
  
  # Printing summary statistics
  cat("Summary statistics for", var_name, ":\n")
  print(summary(variable))
  
  # Printing the number of outliers
  cat("Number of outliers in", var_name, ":", sum(outliers), "\n")
  
  # Handling outliers (example: winsorization)
  variable[outliers] <- lower_bound  # Set outliers to the lower bound
  
  # Plotng histogram
  hist(variable, main = paste("Histogram for", var_name), xlab = var_name)
}

# Handling and plotting outliers for 'age'
handle_and_plot(medinsurance$age, "age")

## Summary statistics for age :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   27.00   39.00   39.21   51.00   64.00 
## Number of outliers in age : 0

# Handling and plotting outliers for 'bmi'
handle_and_plot(medinsurance$bmi, "bmi")

## Summary statistics for bmi :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.96   26.30   30.40   30.66   34.69   53.13 
## Number of outliers in bmi : 9

# Handling and plotting outliers for 'children'
handle_and_plot(medinsurance$children, "children")

## Summary statistics for children :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.095   2.000   5.000 
## Number of outliers in children : 0

# Handling and plotting outliers for 'charges'
handle_and_plot(medinsurance$charges, "charges")

## Summary statistics for charges :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770 
## Number of outliers in charges : 139

Encoding Categorical Data

# One-hot encoding categorical variables
encoded_data <- medinsurance %>%
  mutate(across(c(sex, smoker), ~as.factor(.))) %>%
  mutate(across(c(sex, smoker), ~as.numeric(.)))


print(encoded_data[1:10, ])

## # A tibble: 10 × 8
##      age   sex   bmi children smoker region    charges age_category
##    <dbl> <dbl> <dbl>    <dbl>  <dbl> <chr>       <dbl> <fct>       
##  1    19     1  27.9        0      2 southwest  16885. 18-35       
##  2    18     2  33.8        1      1 southeast   1726. 18-35       
##  3    28     2  33          3      1 southeast   4449. 18-35       
##  4    33     2  22.7        0      1 northwest  21984. 18-35       
##  5    32     2  28.9        0      1 northwest   3867. 18-35       
##  6    31     1  25.7        0      1 southeast   3757. 18-35       
##  7    46     1  33.4        1      1 southeast   8241. 36-55       
##  8    37     1  27.7        3      1 northwest   7282. 36-55       
##  9    37     2  29.8        2      1 northeast   6406. 36-55       
## 10    60     1  25.8        0      1 northwest  28923. 56 or older

I implimented a technique known as one-hot encoding to convert categorical variables into a format suitable for analysis and modeling. This transformation is particularly useful when dealing with variables that take on discrete values such as ‘male’ or ‘female’ for the ‘sex’ variable and ‘yes’ or ‘no’ for the ‘smoker’ variable.

DATA SCALING

# Selecting numerical features
numerical_features <- encoded_data[, c("age", "bmi", "children", "charges")]

# Standardizing numerical features
scaled_numerical_features <- as.data.frame(scale(numerical_features))

# Combining scaled numerical features with the rest of the data (including one-hot encoded columns)
final_scaled_data <- cbind(encoded_data[, !names(encoded_data) %in% c("age", "bmi", "children", "charges")], scaled_numerical_features)

print(final_scaled_data[1:10, ])

##    sex smoker    region age_category        age        bmi    children
## 1    1      2 southwest        18-35 -1.4382265 -0.4531506 -0.90827406
## 2    2      1 southeast        18-35 -1.5094011  0.5094306 -0.07873775
## 3    2      1 southeast        18-35 -0.7976553  0.3831636  1.58033487
## 4    2      1 northwest        18-35 -0.4417824 -1.3050431 -0.90827406
## 5    2      1 northwest        18-35 -0.5129570 -0.2924471 -0.90827406
## 6    1      1 southeast        18-35 -0.5841316 -0.8073542 -0.90827406
## 7    1      1 southeast        36-55  0.4834871  0.4553162 -0.07873775
## 8    1      1 northwest        36-55 -0.1570841 -0.4793879  1.58033487
## 9    2      1 northeast        36-55 -0.1570841 -0.1366631  0.75079856
## 10   1      1 northwest  56 or older  1.4799312 -0.7909559 -0.90827406
##       charges
## 1   0.2984722
## 2  -0.9533327
## 3  -0.7284023
## 4   0.7195739
## 5  -0.7765118
## 6  -0.7856145
## 7  -0.4153450
## 8  -0.4945426
## 9  -0.5668047
## 10  1.2925434

ASSESSING CLUSTERING TENDENCY

The assessment below is a way to figure out if my data set is suitable for clustering or not using the hopkins method.

# Selecting only the numeric columns for clustering tendency analysis
columns_for_clustering <- final_scaled_data[, c("age", "bmi", "children", "charges")]

hopkins_result <- get_clust_tendency(columns_for_clustering, 2, graph = TRUE, seed = 1234)

# Extracting the Hopkins statistic from the result
hopkins_statistic <- hopkins_result$hopkins_stat

print(hopkins_statistic)

## [1] 0.6715526

print(hopkins_result$plot)

The output of the Hopkins statistics i 0.689413, which is closer to 0.5, hence moderate tendency. As shown in the graphical representation, most of the sections have clear gradient, hence indicating a significant clustering tendency. This means that the data points are not randomly distributed and there is a discernible structure or pattern that can be exploited by any clustering algorithms. Hence l can go ahead and apply clustering algorithms.

OPTIMAL NUMBER OF CLUSTERS

Elbow Method Application

To identify the optimal number of clusters, the elbow method was applied to the standardized data set.

# calculating total within-cluster sum of squares (WSS) for different k values
calculate_wss <- function(data, kmax) {
  wss_values <- numeric(length = kmax)
  for (k in 1:kmax) {
    kmeans_model <- kmeans(data, centers = k, nstart = 10)
    wss_values[k] <- kmeans_model$tot.withinss
  }
  return(wss_values)
}

#  maximum number of clusters to consider
k_max <- 10

# Extracting the columns of interest
clustering_data <- final_scaled_data[, c("age", "bmi", "children", "charges")]

# Checking for missing and infinite values
clustering_data <- na.omit(clustering_data)


# Calculating WSS for different values of k
wss_values <- calculate_wss(clustering_data, k_max)

# elbow chart
library(ggplot2)
ggplot(data.frame(k = 1:k_max, WSS = wss_values), aes(x = k, y = WSS)) +
  geom_line(color = "blue") +
  geom_point(color = "red") +
  labs(title = "Elbow Method to Determine Optimal Number of Clusters",
       x = "Number of Clusters (k)",
       y = "Total Within-Cluster Sum of Squares (WSS)") +
  theme_minimal()

In the above code, numeric columns relevant to the clustering analysis were extracted from the standardized data set. A function was then defined to compute the total within-cluster sum of squares for a given number of clusters (k) using the k-means algorithm. It takes the numeric data and the number of clusters as inputs. A range of k clusters is then taken for it to be applied to the elbow method.

Upon conducting the elbow method on our data set, a clear elbow was observed in the plot. The plot displayed the number of clusters on the x-axis and the corresponding WCSS on the y-axis. We observed that the WCSS significantly decreased as the number of clusters increased, but a distinct elbow appeared, suggesting diminishing returns beyond that point. Hence, the number of clusters can range from 3-4 clusters, but for accuracy, l went on to do the silhouette method.

Average Silhouette Method

The Average Silhouette Width Method provides a quantitative measure to assess the appropriateness of different cluster configurations. It is a metric that quantifies how similar an object is to its own cluster compared to other clusters. The average silhouette width is computed for each potential number of clusters, helping to identify the configuration that maximizes cohesion within clusters and separation between clusters.

# calculating silhouette width
sil_width <- function(k) {
  kmeans_model <- kmeans(clustering_data, centers = k, nstart = 25)
  cluster_sil_width <- silhouette(kmeans_model$cluster, dist(standardized_data))
  mean(cluster_sil_width[, "sil_width"])
}

# Creating a vector of candidate cluster numbers
k_values <- 2:10


# Plotting average silhouette width
fviz_nbclust(clustering_data, kmeans, method = "silhouette", k.max = 10) +
  labs(title = "Average Silhouette Width Plot")

Upon running this code on the data set, the average silhouette width plot suggests that four clusters provide a configuration that maximizes cohesion within clusters and separation between clusters. This aligns with the outcome obtained using the elbow method, reinforcing the indication that a clustering solution with 4 clusters is suitable for our data set.

K-MEANS CLUSTERS ALGORITHM

optimal_k<-4

kmeans_model <- kmeans(clustering_data, centers = optimal_k, nstart = 10)
cat("Cluster Centers:\n")

## Cluster Centers:

print(kmeans_model$centers)

##           age         bmi    children     charges
## 1  0.94716854  0.04327431 -0.55450122 -0.05594291
## 2  0.05211979 -0.11479074  1.31900696 -0.21146267
## 3 -0.98400020 -0.24694467 -0.56576862 -0.65012653
## 4  0.04738105  0.76079694  0.04192208  2.23268973

# Adding cluster assignments to the original data
cluster_assignments <- kmeans_model$cluster
clustering_data$cluster <- cluster_assignments

# Visualizing the clustering results
km_clus <- fviz_cluster(
  kmeans_model,
  data = clustering_data,  # Use the data you used for clustering
  ellipse.type = "convex",
  geom = c("point"),
  main = "K-means Clustering Results"
)

print(km_clus)

As we can see from the above, the data set has been partitioned into four distinct clusters with sizes of 160, 397, 368, and 413 data points, respectively. Each cluster exhibits a unique set of characteristics represented by the mean values of age, BMI, number of children, and insurance charges, and each data point is assigned to one of the four clusters, indicating the group to which it belongs. This information can guide targeted analyses for specific subsets of the population.

Cluster 1 is characterized by higher age, BMI, and insurance charges.
Cluster 2 consists of individuals with relatively higher age and BMI but lower insurance charges.
Cluster 3 includes individuals with higher numbers of children and moderate age, BMI, and charges.
Cluster 4 represents individuals with lower age, BMI, and charges.

The compactness of clusters is assessed through the within-cluster sum of squares. Smaller values suggest more homogeneity within clusters. Cluster 3 has the highest sum of squares, indicating greater variability within this group.

Identifying High risk clusters

The overall goal of the analysis, is to identify high-risk groups based on demographic and lifestyle factors. I used clustering analysis to group individuals into clusters based on variables such as ‘age, charges, bmi, children.’ Below l created scatter plots to visualize the relationships between these variables and ‘charges’ across different clusters.

# Scatter plots side by side
plot_age <- ggplot(clustering_data, aes_string(x = "age", y = "charges", color = "factor(cluster)")) +
  geom_point(alpha = 0.7) +
  labs(title = "Age vs. Charges",
       x = "Age",
       y = "Charges") +
  scale_color_discrete(name = "Cluster") +
  theme_minimal()

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plot_bmi <- ggplot(clustering_data, aes_string(x = "bmi", y = "charges", color = "factor(cluster)")) +
  geom_point(alpha = 0.7) +
  labs(title = "BMI vs. Charges",
       x = "BMI",
       y = "Charges") +
  scale_color_discrete(name = "Cluster") +
  theme_minimal()

plot_children <- ggplot(clustering_data, aes_string(x = "children", y = "charges", color = "factor(cluster)")) +
  geom_point(alpha = 0.7) +
  labs(title = "Children vs. Charges",
       x = "Children",
       y = "Charges") +
  scale_color_discrete(name = "Cluster") +
  theme_minimal()

# Arrange the plots side by side
grid.arrange(plot_age, plot_bmi, plot_children, ncol = 3)

In scatter plot 1, cluster 4 has higher charges for individuals with older ages, suggesting a potential high-risk group of older individuals. For scatter 2, cluster 4 also has higher charges for individuals with higher BMI, indicating a potential high-risk group associated with higher BMI. Same as the last plot, cluster 4 has higher insurance charges for individuals with more children, suggesting a potential high-risk group associated with having more children. This pattern for older people and people with more children is constant among all the groups stating that they all at some point indicate as high risk clusters.

Assessing clustering quality

Here we will be assessing how well each data point in a cluster is separated from other clusters. Silhouette scores measure the quality of clustering, with higher values indicating better-defined clusters.

# calculating silhouette scores and plotting silhouette plot
calculate_silhouette <- function(variable) {
  #
  data_variable <- clustering_data[, c(variable, "cluster")]
  
 
  kmeans_result <- kmeans(data_variable, centers = 4)
  

  silhouette_scores <- silhouette(kmeans_result$cluster, dist(data_variable[, variable]))
  
  # Plotting silhouette scores
  fviz_silhouette(silhouette_scores, title = paste("Silhouette Plot for", variable))
}

calculate_silhouette("age")

##   cluster size ave.sil.width
## 1       1  105         -0.28
## 2       2  287          0.01
## 3       3  480          0.39
## 4       4  466          0.20

The average silhouette width for Cluster1 is negative (-0.10), indicating that individuals in Cluster 1 are, on average, closer to the neighboring clusters than to their own cluster. This suggests potential overlap or poor separation. Cluster 2 suggests well-defined clusters for individuals which is the same as 3 but not as distinct. As for 4, the average silhouette width is close to zero (0.00), suggesting weak separation.

calculate_silhouette("charges")

##   cluster size ave.sil.width
## 1       1  164          0.72
## 2       2  367          0.71
## 3       3  399         -0.30
## 4       4  408          0.10

Cluster 1, 2 and 3 suggest very high positive silhouette width hence well-defined clusters, however, in cluster 4 the low width suggest a weak separation in clusters.

calculate_silhouette("bmi")

##   cluster size ave.sil.width
## 1       1  426          0.06
## 2       2  297         -0.06
## 3       3  287          0.00
## 4       4  328         -0.01

The clusters in BMI suggest very low to negative silhouette width, hence very low to poor separation within the clusters.

calculate_silhouette("children")

##   cluster size ave.sil.width
## 1       1  394         -0.02
## 2       2  504          0.02
## 3       3  360         -0.22
## 4       4   80          0.36

For this variable, only cluster 1 had a positive silhouette width that indicated well- defined clusters. As for the rest, they all ranges between low and negative, highlighting on cluster 3 that had an extreme negative silhouette width hence poor separation and possibly potential overlaping.

Overall, the variables charges and Age had higher silhouette scores hence contributing more to better-defined and homogeneity among the groups. Hence charges is seen to be more influential in defining the risks as as it is seen as a crucial factor in differentiating high-risk group

Hierarchical Dendrogram - Understanding Cluster Structure

features <- clustering_data[, c("age", "bmi", "charges", "children")]

#  hierarchical clustering using Ward's method
hc <- hclust(dist(features), method = "ward.D2")

par(mar = c(5, 4, 4, 2) + 0.1) 


plot(hc, main="Dendrogram - Hierarchical Clustering", xlab="Individuals", sub="", cex=0.8, col.lab="blue", hang=-1)

The hierarchical dendrogram visually represents the relationships between individuals in our dataset. The vertical lines in the dendrogram represent clusters, and the height at which they merge indicates the dissimilarity between those clusters. The choice of four clusters was made based on visual inspection and the practical considerations of forming meaningful and interpretable groups.

features <- clustering_data[, c("age", "bmi", "charges", "children")]

hc <- hclust(dist(features), method = "ward.D2")

num_clusters <- 4
clusters <- cutree(hc, k = num_clusters)
clustering_data$cluster <- as.factor(clusters)

# Visualize the dendrogram with borders around clusters
par(mar = c(5, 4, 4, 2) + 0.1)  
plot(hc, main = "Dendrogram - Hierarchical Clustering", xlab = "Individuals", sub = "", cex = 0.8, col.lab = "blue", hang = -1)

# Highlighting clusters with borders
rect.hclust(hc, k = num_clusters, border = 'red')

CONCLUSION

Our project’s goal was to do a thorough clustering analysis to identify high-risk groups based on lifestyle and demographic characteristics. Through the application of both k-means and hierarchical clustering approaches, we were able to identify groups that possessed unique characteristics and obtain important insights into the patterns that were present in our dataset. The clustering analysis successfully revealed clusters of individuals sharing similar traits in terms of age, BMI, charges, and children. These demographic and lifestyle factors played a crucial role in defining the identified groups and we were able to find clusters that exhibited characteristics indicative of higher risk whether it be age-related patterns, charging behaviors, or other lifestyle factors. Silhouette scores were employed to quantitatively assess the quality of our clustering solutions and variables such as ‘charges’ and ‘children’ consistently contributed to well-defined clusters, reinforcing their importance in risk assessment. Using Ward’s technique in particular, hierarchical clustering offered more insights into the interactions between individuals and incorporating the red-bordered clusters in the dendrogram enhanced the visual representation of distinct groups.

cluster project R markdown

cynthia T.M Nyahoda

2024-01-28