---
title: "Insurance-Analysis-using-R"
author: "Adebola Alaba"
date: "`r Sys.Date()`"
output:
flexdashboard::flex_dashboard:
orientation: rows
vertical_layout: fill
social: menu
source_code: embed
---
```{r setup, Import Libraries, message=FALSE, warning=FALSE, paged.print=FALSE, include=FALSE}
# **Library Import**
library(flexdashboard)
library(ggplot2) #EDA plot
library(ggpubr) #ggplot customization
library(corrplot) #Correlation Matrix Plot
library(dplyr) #Data manipulation
library(tidyverse) #Data Interaction
library(rmarkdown) #Knitting report to Word/pdf
library(e1071) #Skewness Check
library(forcats)
```
<!-- ============================ START OF SIDEBAR =================================== -->
Sidebar {.sidebar}
=======================================================================
### **Description**
This summary captures the application of statistical techniques on a dataset for accurate insights and inferences. The statistical technique used are Normality tests, non-parametric test, correlation. The Github repository for this project is available [here](https://github.com/adebola-alaba/Health-Insurance-Analysis-using-R).
Author: [Adebola Alaba](https://mavenanalytics.io/profile/siraug)
Data sources: [Yash Gupta](https://www.kaggle.com/datasets/yashgupta011/insurance), Kaggle
<!-- ============================ END OF SIDE BAR ================================= -->
<!-- ============================ START OF NEXT PAGE ============================== -->
# **Exploratory Data Analysis**
<!-- ============================ START OF ROW ============================== -->
```{r Import Dataset, warning=FALSE}
Insurance <- read.csv("insurance.csv")
#head(Insurance)
#sprintf("Dataset size: [%s]", toString(dim(Insurance)))
```
```{r Check for Null Values, warning=FALSE}
# Check for null values in the Insurance object
null <- is.null(Insurance)
#sprintf("Any null values? - %s", null)
```
<!-- ============================ END OF ROW ================================= -->
<!-- ============================ START OF ROW ============================== -->
Row {data-height=1000}
-----------------------------------------------------------------------
### **Regional Distribution**
```{r Regional Distribution}
ggplot(Insurance, aes(x = fct_infreq(region))) +
geom_bar(fill = "green",
color = "black") +
labs(x = "Region",
y = "Frequency",
title = "Count by Region")
```
### **Region by Charges**
```{r Region by Charges}
# Summarize total charges by region and sex
charges_summary <- Insurance %>%
group_by(region, sex) %>%
summarise(total_charges = sum(charges), .groups = "drop")
# Reorder region by total charges (sum over both sexes)
charges_summary <- charges_summary %>%
group_by(region) %>%
mutate(region_total = sum(total_charges)) %>%
ungroup() %>%
mutate(region = fct_reorder(region, region_total, .desc = TRUE))
# Plot
ggplot(charges_summary, aes(x = region, y = total_charges, fill = sex)) +
geom_bar(stat = "identity", position = "stack") +
#coord_flip() +
labs(title = "Total Charges by Region and Sex",
x = "Region", y = "Total Charges")
#regionbycharges <- ggplot(Insurance, aes(x = region, y = charges, fill = sex)) +
# geom_bar(data = subset(Insurance, sex == "female"), stat = "identity") +
#geom_bar(data = subset(Insurance, sex == "male"), stat = "identity") +
#coord_flip()
#regionbycharges
```
<!-- ============================ END OF ROW ================================= -->
<!-- ============================ START OF ROW ============================== -->
Row {data-height=1000}
-----------------------------------------------------------------------
### **Age Distribution**
```{r Age Distribution}
ggplot(Insurance, aes(x = age)) +
geom_density(aes(y = ..count..), fill = "yellow") +
geom_vline(aes(xintercept = mean(age)),
linetype = "dashed", linewidth = 0.6,
color = "#FC4E07")
```
<!-- ============================ END OF PAGE ================================= -->
<!-- ============================ START OF NEXT PAGE ============================== -->
# **Statistical Analysis**
<!-- ============================ START OF ROW ============================== -->
Row {data-height=500}
-----------------------------------------------------------------------
### **BMI Skewness**
```{r Skewness, warning=FALSE}
sprintf("Skewness: [%s]", toString(skewness(Insurance$bmi)))
```
### **Shapiro Wilk's Test**
```{r "Shapiro Wilk's Test for BMI Normality", warning=FALSE}
#Research Question: is the BMI variable normally distributed?
#H0: The BMI variable is normally distributed\
#HA: The BMI variable is not normally distributed\
#Confidence Interval: If the p-value is greater than 0.05, the null hypothesis will fail to be rejected and if it is lower than 0.05, the null hypothesis will be rejected.\
shapiro.test(Insurance$bmi)
```
### **Charges Skewness**
```{r Skewness_Charges, warning=FALSE}
sprintf("Skewness: [%s]", toString(skewness(Insurance$charges)))
```
<!-- ============================ END OF ROW ================================= -->
<!-- ============================ START OF ROW ============================== -->
Row {data-height=1000}
-----------------------------------------------------------------------
### **Skewness Check for BMI**
```{r Skewness Check for BMI, warning=FALSE}
ggplot(Insurance, aes(x=bmi)) +
geom_density(alpha=.3, fill="blue", color="blue", size=1)+
geom_vline(aes(xintercept=mean(bmi)), size=1, color ="black")+
ggtitle("Distribution density of BMI") +
theme(text = element_text(size = 15))
```
### **QQPlot for BMI Normality**
```{r QQPlot Check for BMI Normality, warning=FALSE}
qqnorm(Insurance$bmi, main = "Normal QQPlot of BMI",)
qqline(Insurance$bmi)
```
### **Histogram Plot for BMI Normality**
```{r Histogram for BMI Normality, warning=FALSE}
hist(Insurance$bmi, main = "Histogram of BMI", prob = TRUE, ylim = c(0, 0.07))
lines (density(Insurance$bmi))
```
### **BMI Boxplot Check**
```{r BMI Boxplot Check}
boxplot(Insurance$bmi,
ylab = "bmi",
main = "Boxplot of BMI",
col= "blue",
outcol="blue")
```
<!-- ============================ END OF ROW ================================= -->
<!-- ============================ START OF ROW ============================== -->
Row {data-height=1000}
-----------------------------------------------------------------------
### **Skewness Check for Charges**
```{r Skewness Check for Charges, warning=FALSE}
ggplot(Insurance, aes(x=charges)) +
geom_density(alpha=.3, fill="red", color="red", size=1)+
geom_vline(aes(xintercept=mean(charges)), size=1, color ="black")+
ggtitle("Distribution density of Charges") +
theme(text = element_text(size = 15))
```
### **QQPlot Check for charges Normality**
```{r QQPlot Check for charges Normality, warning=FALSE}
qqnorm(Insurance$charges, main = "Normal QQPlot of Charges",)
qqline(Insurance$charges)
```
### **Histogram Plot for charges Normality**
```{r Histogram Check for charges Normality, warning=FALSE}
hist(Insurance$charges, main = "Histogram of Charges", prob = TRUE)
lines (density(Insurance$charges))
```
### **Charges Boxplot Check**
```{r Charges Boxplot Check, warning=FALSE}
boxplot(Insurance$charges,
ylab = "charges",
main = "Boxplot of Insurance Charges",
col= "red",
outcol="red")
```
<!-- ============================ END OF PAGE ================================= -->
<!-- ============================ START OF NEXT PAGE ============================== -->
# **Non-Parametric Tests**
<!-- ============================ START OF ROW ============================== -->
Row {data-height=500}
-----------------------------------------------------------------------
### **Two Sample Independent T-test: Hypothesis One**
```{r}
library(grid)
grid.newpage()
grid.text("Hypothesis:
H0: The median BMI of smokers > non smokers
HA: The median BMI of smokers < non smokers
Inference:
The p-value of [0.5321] > significant level of 0.05,
there is enough evidence for the null hypothesis
to fail to be rejected.
The non-smokers mean bmi isn't within the ideal range.",
x = 0.5, y = 0.5,
gp = gpar(col = "black", fontsize = 16, fill = "lightblue"))
#grid.rect(x = 0.5, y = 0.5, width = 0.6, height = 0.2, gp = gpar(fill = "lightblue"))
#grid.text("This is a textbox", x = 0.5, y = 0.5)
```
### **SOLUTION TO HYPOTHESIS ONE**
```{r Hypothesis One, warning=FALSE}
smoker_bmi <- Insurance$bmi[Insurance$smoker == 'yes'] # extract bmi where smoker equals yes
non_smoker_bmi <- Insurance$bmi[Insurance$smoker == 'no'] # extract bmi where smoker equals no
wilcox.test(smoker_bmi,non_smoker_bmi, alternative = "less", conf.int = TRUE)
```
```{r Hypothesis OneA, warning=FALSE}
#To Confirm what the exact values are:
#cat("Mean & Median BMI by Smoking Status:\n\n")
group_by(Insurance,smoker) %>%
summarise(
median = median(bmi, na.rm = TRUE),
mean = mean(bmi, na.rm = TRUE))
```
<!-- ============================ END OF ROW ================================= -->
<!-- ============================ START OF ROW ============================== -->
Row {data-height=500}
-----------------------------------------------------------------------
### **Two Sample Independent T-test: Hypothesis Two**
```{r}
library(grid)
grid.newpage()
grid.text ("Hypothesis Formulation:
HO: The insurance claims of smokers and non-smokers are similar
HA: The insurance claims of smokers and non-smokers are not similar
Inference:
The p-value is below 0.05.
Therefore, there is no sufficient evidence to fail to reject
the claim that the charges of those who smoke,
and non-smokers are the same.
The null hypothesis is hereby rejected as result
shows that the claims are indeed different.",
x = 0.5, y = 0.5,
gp = gpar(col = "black", fontsize = 16, fill = "lightblue"))
#grid.rect(x = 0.5, y = 0.5, width = 0.6, height = 0.2, gp = gpar(fill = "lightblue"))
#grid.text("This is a textbox", x = 0.5, y = 0.5)
```
### **SOLUTION TO HYPOTHESIS TWO**
```{r Hypothesis Two, warning=FALSE}
charges_smoking <- Insurance$charges[Insurance$smoker == 'yes'] # extract charges where smoker equals yes
charges_no_smoking <- Insurance$charges[Insurance$smoker == 'no'] # extract charges where smoker equals no
# Calculate and print the sums
sum_smokers <- sum(charges_smoking)
sum_nonsmokers <- sum(charges_no_smoking)
#cat("Total charges for smokers: ", sum_smokers, "\n")
#cat("Total charges for non-smokers: ", sum_nonsmokers, "\n")
# Perform Wilcoxon test
wilcox.test(charges_smoking, charges_no_smoking, conf.int = TRUE)
```
```{r Hypothesis TwoA, warning=FALSE}
#To Confirm what the exact values are:
#cat("Total Charges by Smoking Status:\n\n")
Insurance %>%
group_by(smoker) %>%
summarise(total_charges = sum(charges), .groups = "drop")
```
<!-- ============================ END OF ROW ================================= -->
<!-- ============================ START OF ROW ============================== -->
Row {data-height=500}
-----------------------------------------------------------------------
### **Barplot of Region and Smoker Status,**
```{r barplot of Region and Smoker status, warning=FALSE}
#barplot of Region and Smoker status
ggplot(Insurance) +
aes(x = region, fill = smoker) +
geom_bar()
```
### **Pearson Chi-Squared Test**
```{r}
library(grid)
grid.newpage()
grid.text ("Hypothesis Formulation:
H0: Region and smoking status are independent\
HA: Region and smoking status are dependent\
Inference:
The chi-squared test shows that region and smoker
are independent as the p-value is greater than the
significance level of 0.05 thus
the null hypothesis fails to be rejected.",
x = 0.5, y = 0.5,
gp = gpar(col = "black", fontsize = 16, fill = "lightblue"))
#grid.rect(x = 0.5, y = 0.5, width = 0.6, height = 0.2, gp = gpar(fill = "lightblue"))
#grid.text("This is a textbox", x = 0.5, y = 0.5)
```
```{r Chi Squared Test: Table and Plot, warning=FALSE}
#Create a Contingency table of the variables Region and Smoker Status
region_smoker <- table(Insurance$region, Insurance$smoker)
#region_smoker
```
### **Chi Squared Test Result**
```{r Chi Squared Test, warning=FALSE}
chisq_reg_smk <- chisq.test(region_smoker)
chisq_reg_smk
```
### **Pearson Chi-Squared Test - Residual Plot**
```{r Residual Plot, warning=FALSE}
corrplot(chisq_reg_smk$residuals, is.cor = FALSE)
```
<!-- ============================ END OF PAGE ================================= -->
<!-- ============================ START OF NEXT PAGE ============================== -->
# **Correlation Test**
<!-- ============================ START OF ROW ============================== -->
Row {.column data-width=1000}
-----------------------------------------------------------------------
### **Correlation Hypothesis**
```{r}
library(grid)
grid.newpage()
grid.text ("Research Question:
Are any of the variables correlated?
Inference:
The results show that some variables are positively correlated.
Smoker and Charges variables have a correlation coefficient of 0.66
Age and Charges also have a correlation coefficient of 0.53.",
x = 0.5, y = 0.5,
gp = gpar(col = "black", fontsize = 14, fill = "lightblue"))
# Result Interpretation:
#- Strong negative correlation (-1) means that whenever x rises, y falls.\
#- 0 indicates no correlation exists between the two variables (x and y).\
#- A strong positive correlation of 1 means that y rises as x does.\
# Age, BMI, Children and Charges are all numerical while sex, smoker and region are categorical.\
# For this test, all categorical will be changed to numerical.
# **For sex:**\
# - male becomes 1\
# - female becomes 2\
# **For smoker:**\
# - no becomes 0\
# - yes becomes 1\
# **For region:**\
# - southeast becomes 1\
# - southwest becomes 2\
# - northeast becomes 3\
# - northwest becomes 4\
#grid.rect(x = 0.5, y = 0.5, width = 0.6, height = 0.2, gp = gpar(fill = "lightblue"))
#grid.text("This is a textbox", x = 0.5, y = 0.5)
```
```{r Correlation, warning=FALSE}
## Make a copy of the data
insurance_copy = Insurance
#head(insurance_copy)
## Replacing values for sex
insurance_copy['sex'][insurance_copy['sex'] == 'male'] <- 1
insurance_copy['sex'][insurance_copy['sex'] == 'female'] <- 2
## Replacing values for smoker
insurance_copy['smoker'][insurance_copy['smoker'] == 'no'] <- 0
insurance_copy['smoker'][insurance_copy['smoker'] == 'yes'] <- 1
## Replacing values for region
insurance_copy['region'][insurance_copy['region'] == 'southeast'] <- 1
insurance_copy['region'][insurance_copy['region'] == 'southwest'] <- 2
insurance_copy['region'][insurance_copy['region'] == 'northeast'] <- 3
insurance_copy['region'][insurance_copy['region'] == 'northwest'] <- 4
## Print Changes
#head(insurance_copy)
# check the datatype of the variables
#class(insurance_copy$age)
#class(insurance_copy$sex)
#class(insurance_copy$bmi)
#class(insurance_copy$children)
#class(insurance_copy$smoker)
#class(insurance_copy$region)
#class(insurance_copy$charges)
```
```{r Correlation Data, warning=FALSE}
## Converting variables to lists
age <- c(insurance_copy$age)
sex <- as.integer(c(insurance_copy$sex))
bmi <- c(insurance_copy$bmi)
children <- c(insurance_copy$children)
smoker <- as.integer(c(insurance_copy$smoker))
region <- as.integer(c(insurance_copy$region))
charges <- as.integer(c(insurance_copy$charges))
## binding the variables
insurance_cor <- cbind(age,sex,bmi,children,smoker,region,charges)
#head(insurance_cor)
```
### **Correlation Matrix**
```{r Correlation Matrix, warning=FALSE}
insurance_cor_result <- round (cor(insurance_cor, method = 'spearman'),3)
insurance_cor_result
```
<!-- ============================ END OF ROW ================================= -->
<!-- ============================ START OF ROW ============================== -->
Row {data-height=500}
-----------------------------------------------------------------------
### **Correlation Plot**
```{r Correlation Plot, warning=FALSE}
corrplot(insurance_cor_result, method = 'number')
```
<!-- ============================ END OF PAGE ================================= -->
<!-- ============================ START OF NEXT PAGE ============================== -->
# **Multiple Linear Regression**
<!-- ============================ START OF ROW ============================== -->
Row {data-height=500}
-----------------------------------------------------------------------
### **Research Question**
```{r}
library(grid)
grid.newpage()
grid.text ("Model Variables:
Predictors: Age and Smoker Response: Charges
Hypothesis Formulation:
H0: There is no relationship between the predictors and response variables\
HA: There is a relationship between the predictors and response variables\
Inference:
Statistical summary indicates that smoking status and charges are related.
The R-squared (multiple and adjusted) are within a good range at 0.72.
This suggests that the model is a good fit and
can possibly explain 72% of the total variability.",
x = 0.5, y = 0.5,
gp = gpar(col = "black", fontsize = 13, fill = "lightblue"))
```
### **Multiple Regression Model**
```{r Multiple Regression, warning=FALSE}
model <- lm(charges ~ smoker+age, data = Insurance)
summary(model)
### Confidence Interval of the Model Coefficient
confint(model)
```
<!-- ============================ END OF ROW ================================= -->
<!-- ============================ START OF ROW ============================== -->
Row {data-height=500}
-----------------------------------------------------------------------
### **Model Prediction**
```{r Multiple Linear Regression: Model Prediction, warning=FALSE}
new <- data.frame(age=c(35), smoker=c('yes'))
predict(model, newdata=new)
```
### **Prediction**
```{r}
library(grid)
grid.newpage()
grid.text ("Mathematically, multiple regression is represented as
y = a +(b1)(x1) +......+ (bn)(xn).
In this model, this means the charges for a 35-year-old smoker
will be calculated as
Charges = (-2391.63) + (23855.30) + (274.87*35) = $31,084.
The model accurately predicted the charges for a 35 year-old smoker.",
x = 0.5, y = 0.5,
gp = gpar(col = "black", fontsize = 14, fill = "lightblue"))
```