Exploratory Data Analysis

Row

Regional Distribution

Region by Charges

Row

Age Distribution

Statistical Analysis

Row

BMI Skewness

[1] "Skewness: [0.283410550996969]"

Shapiro Wilk’s Test


    Shapiro-Wilk normality test

data:  Insurance$bmi
W = 0.99389, p-value = 2.605e-05

Charges Skewness

[1] "Skewness: [1.51248251819953]"

Row

Skewness Check for BMI

QQPlot for BMI Normality

Histogram Plot for BMI Normality

BMI Boxplot Check

Row

Skewness Check for Charges

QQPlot Check for charges Normality

Histogram Plot for charges Normality

Charges Boxplot Check

Non-Parametric Tests

Row

Two Sample Independent T-test: Hypothesis One

SOLUTION TO HYPOTHESIS ONE


    Wilcoxon rank sum test with continuity correction

data:  smoker_bmi and non_smoker_bmi
W = 146228, p-value = 0.5321
alternative hypothesis: true location shift is less than 0
95 percent confidence interval:
     -Inf 0.759948
sample estimates:
difference in location 
            0.02508607 
# A tibble: 2 × 3
  smoker median  mean
  <chr>   <dbl> <dbl>
1 no       30.4  30.7
2 yes      30.4  30.7

Row

Two Sample Independent T-test: Hypothesis Two

SOLUTION TO HYPOTHESIS TWO


    Wilcoxon rank sum test with continuity correction

data:  charges_smoking and charges_no_smoking
W = 284133, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 21826.63 25927.09
sample estimates:
difference in location 
              23922.22 
# A tibble: 2 × 2
  smoker total_charges
  <chr>          <dbl>
1 no          8974061.
2 yes         8781764.

Row

Barplot of Region and Smoker Status,

Pearson Chi-Squared Test

Chi Squared Test Result


    Pearson's Chi-squared test

data:  region_smoker
X-squared = 7.3435, df = 3, p-value = 0.06172

Pearson Chi-Squared Test - Residual Plot

Correlation Test

Row

Correlation Hypothesis

Correlation Matrix

            age    sex    bmi children smoker region charges
age       1.000  0.021  0.108    0.057 -0.025  0.008   0.534
sex       0.021  1.000 -0.045   -0.016 -0.076  0.016  -0.009
bmi       0.108 -0.045  1.000    0.016  0.002 -0.246   0.119
children  0.057 -0.016  0.016    1.000  0.017  0.024   0.133
smoker   -0.025 -0.076  0.002    0.017  1.000 -0.055   0.663
region    0.008  0.016 -0.246    0.024 -0.055  1.000  -0.006
charges   0.534 -0.009  0.119    0.133  0.663 -0.006   1.000

Row

Correlation Plot

Multiple Linear Regression

Row

Research Question

Multiple Regression Model


Call:
lm(formula = charges ~ smoker + age, data = Insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-16088.1  -2046.8  -1336.4   -212.7  28760.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2391.63     528.30  -4.527 6.52e-06 ***
smokeryes   23855.30     433.49  55.031  < 2e-16 ***
age           274.87      12.46  22.069  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6397 on 1335 degrees of freedom
Multiple R-squared:  0.7214,    Adjusted R-squared:  0.721 
F-statistic:  1728 on 2 and 1335 DF,  p-value: < 2.2e-16
                 2.5 %     97.5 %
(Intercept) -3428.0191 -1355.2336
smokeryes   23004.9122 24705.6974
age           250.4371   299.3053

Row

Model Prediction

       1 
31084.17 

Prediction

---
title: "Insurance-Analysis-using-R"
author: "Adebola Alaba"
date: "`r Sys.Date()`"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: fill
    social: menu
    source_code: embed
---

```{r setup, Import Libraries, message=FALSE, warning=FALSE, paged.print=FALSE, include=FALSE}

# **Library Import**

library(flexdashboard)
library(ggplot2) #EDA plot
library(ggpubr) #ggplot customization
library(corrplot) #Correlation Matrix Plot
library(dplyr) #Data manipulation
library(tidyverse) #Data Interaction
library(rmarkdown) #Knitting report to Word/pdf
library(e1071) #Skewness Check
library(forcats)

```


<!-- ============================ START OF SIDEBAR =================================== -->




Sidebar {.sidebar}
=======================================================================

### **Description**

This summary captures the application of statistical techniques on a dataset for accurate insights and inferences. The statistical technique used are Normality tests, non-parametric test, correlation. The  Github repository for this project is available  [here](https://github.com/adebola-alaba/Health-Insurance-Analysis-using-R).

Author: [Adebola Alaba](https://mavenanalytics.io/profile/siraug)

Data sources: [Yash Gupta](https://www.kaggle.com/datasets/yashgupta011/insurance), Kaggle


<!-- ============================ END OF SIDE BAR ================================= -->




<!-- ============================ START OF NEXT PAGE ============================== -->


# **Exploratory Data Analysis**

<!-- ============================ START OF ROW ============================== -->

```{r Import Dataset, warning=FALSE}
Insurance <- read.csv("insurance.csv")
#head(Insurance)

#sprintf("Dataset size: [%s]", toString(dim(Insurance)))
```


```{r Check for Null Values, warning=FALSE}
# Check for null values in the Insurance object
null <- is.null(Insurance)
#sprintf("Any null values? - %s", null)
```

<!-- ============================ END OF ROW ================================= -->




<!-- ============================ START OF ROW ============================== -->


Row {data-height=1000}
-----------------------------------------------------------------------


### **Regional Distribution**

```{r Regional Distribution}
ggplot(Insurance, aes(x = fct_infreq(region))) + 
  geom_bar(fill = "green",
          color = "black") +
  labs(x = "Region",
       y = "Frequency",
       title = "Count by Region")

```



### **Region by Charges**

```{r Region by Charges}



# Summarize total charges by region and sex
charges_summary <- Insurance %>%
  group_by(region, sex) %>%
  summarise(total_charges = sum(charges), .groups = "drop")

# Reorder region by total charges (sum over both sexes)
charges_summary <- charges_summary %>%
  group_by(region) %>%
  mutate(region_total = sum(total_charges)) %>%
  ungroup() %>%
  mutate(region = fct_reorder(region, region_total, .desc = TRUE))

# Plot
ggplot(charges_summary, aes(x = region, y = total_charges, fill = sex)) + 
  geom_bar(stat = "identity", position = "stack") +
  #coord_flip() +
  labs(title = "Total Charges by Region and Sex",
       x = "Region", y = "Total Charges")


#regionbycharges <- ggplot(Insurance, aes(x = region, y = charges, fill = sex)) + 
 # geom_bar(data = subset(Insurance, sex == "female"), stat = "identity") + 
  #geom_bar(data = subset(Insurance, sex == "male"), stat = "identity") +
  #coord_flip()

#regionbycharges

```

<!-- ============================ END OF ROW ================================= -->




<!-- ============================ START OF ROW ============================== -->

Row {data-height=1000}
-----------------------------------------------------------------------



### **Age Distribution**

```{r Age Distribution}
ggplot(Insurance, aes(x = age)) + 
  geom_density(aes(y = ..count..), fill = "yellow") +
  geom_vline(aes(xintercept = mean(age)), 
             linetype = "dashed", linewidth = 0.6,
             color = "#FC4E07")
```


<!-- ============================ END OF PAGE ================================= -->




<!-- ============================ START OF NEXT PAGE ============================== -->


# **Statistical Analysis**


<!-- ============================ START OF ROW ============================== -->


Row {data-height=500}
-----------------------------------------------------------------------

### **BMI Skewness**

```{r Skewness, warning=FALSE}

sprintf("Skewness: [%s]", toString(skewness(Insurance$bmi)))
```



### **Shapiro Wilk's Test**

```{r "Shapiro Wilk's Test for BMI Normality", warning=FALSE}


#Research Question: is the BMI variable normally distributed?

#H0: The BMI variable is normally distributed\
#HA: The BMI variable is not normally distributed\

#Confidence Interval: If the p-value is greater than 0.05, the null hypothesis will fail to be rejected and if it is lower than 0.05, the null hypothesis will be rejected.\


shapiro.test(Insurance$bmi)
```



### **Charges Skewness**

```{r Skewness_Charges, warning=FALSE}

sprintf("Skewness: [%s]", toString(skewness(Insurance$charges)))
```



<!-- ============================ END OF ROW ================================= -->




<!-- ============================ START OF ROW ============================== -->




Row {data-height=1000}
-----------------------------------------------------------------------
### **Skewness Check for BMI**

```{r Skewness Check for BMI, warning=FALSE}
ggplot(Insurance, aes(x=bmi)) + 
    geom_density(alpha=.3, fill="blue", color="blue", size=1)+
    geom_vline(aes(xintercept=mean(bmi)), size=1, color ="black")+
    ggtitle("Distribution density of BMI") +
    theme(text = element_text(size = 15))
    
```


### **QQPlot for BMI Normality**

```{r QQPlot Check for BMI Normality, warning=FALSE}

qqnorm(Insurance$bmi, main = "Normal QQPlot of BMI",)
qqline(Insurance$bmi)

```


### **Histogram Plot for BMI Normality**

```{r Histogram for BMI Normality, warning=FALSE}

hist(Insurance$bmi, main = "Histogram of BMI", prob = TRUE, ylim = c(0, 0.07))
lines (density(Insurance$bmi))

```


### **BMI Boxplot Check**


```{r BMI Boxplot Check}

boxplot(Insurance$bmi,
        ylab = "bmi",
        main = "Boxplot of BMI",
        col= "blue",
        outcol="blue")
```


<!-- ============================ END OF ROW ================================= -->




<!-- ============================ START OF ROW ============================== -->




Row {data-height=1000}
-----------------------------------------------------------------------



### **Skewness Check for Charges**


```{r Skewness Check for Charges, warning=FALSE}

ggplot(Insurance, aes(x=charges)) + 
    geom_density(alpha=.3, fill="red", color="red", size=1)+
    geom_vline(aes(xintercept=mean(charges)), size=1, color ="black")+
    ggtitle("Distribution density of Charges") +
    theme(text = element_text(size = 15))

```



### **QQPlot Check for charges Normality**


```{r QQPlot Check for charges Normality, warning=FALSE}

qqnorm(Insurance$charges, main = "Normal QQPlot of Charges",)
qqline(Insurance$charges)

```


### **Histogram Plot for charges Normality**


```{r Histogram Check for charges Normality, warning=FALSE}

hist(Insurance$charges, main = "Histogram of Charges", prob = TRUE)
lines (density(Insurance$charges))

```


### **Charges Boxplot Check**

```{r Charges Boxplot Check, warning=FALSE}
boxplot(Insurance$charges,
        ylab = "charges",
        main = "Boxplot of Insurance Charges",
        col= "red",
        outcol="red")

```


<!-- ============================ END OF PAGE ================================= -->




<!-- ============================ START OF NEXT PAGE ============================== -->


# **Non-Parametric Tests**

<!-- ============================ START OF ROW ============================== -->


Row {data-height=500}
-----------------------------------------------------------------------

### **Two Sample Independent T-test: Hypothesis One**

```{r}
library(grid)

grid.newpage()
grid.text("Hypothesis:
H0: The median BMI of smokers > non smokers
HA: The median BMI of smokers <  non smokers

Inference: 
The p-value of [0.5321] > significant level of 0.05, 
there is enough evidence for the null hypothesis 
to fail to be rejected. 
The non-smokers mean bmi isn't within the ideal range.", 
          x = 0.5, y = 0.5,
          gp = gpar(col = "black", fontsize = 16, fill = "lightblue"))


#grid.rect(x = 0.5, y = 0.5, width = 0.6, height = 0.2, gp = gpar(fill = "lightblue"))
#grid.text("This is a textbox", x = 0.5, y = 0.5)

```

### **SOLUTION TO HYPOTHESIS ONE**


```{r Hypothesis One, warning=FALSE}

smoker_bmi <- Insurance$bmi[Insurance$smoker == 'yes'] # extract bmi where smoker equals yes
non_smoker_bmi <- Insurance$bmi[Insurance$smoker == 'no'] # extract bmi where smoker equals no


wilcox.test(smoker_bmi,non_smoker_bmi, alternative = "less", conf.int = TRUE)

```




```{r Hypothesis OneA, warning=FALSE}
#To Confirm what the exact values are:
#cat("Mean & Median BMI by Smoking Status:\n\n")

group_by(Insurance,smoker) %>%
  summarise(
            median = median(bmi, na.rm = TRUE),
            mean = mean(bmi, na.rm = TRUE))

```


<!-- ============================ END OF ROW ================================= -->




<!-- ============================ START OF ROW ============================== -->




Row {data-height=500}
-----------------------------------------------------------------------

### **Two Sample Independent T-test: Hypothesis Two**


```{r}
library(grid)

grid.newpage()
grid.text ("Hypothesis Formulation:
HO: The insurance claims of smokers and non-smokers are similar
HA: The insurance claims of smokers and non-smokers are not similar

Inference: 
The p-value is below 0.05. 
Therefore, there is no sufficient evidence to fail to reject 
the claim that the charges of those who smoke, 
and non-smokers are the same. 
The null hypothesis is hereby rejected as result 
shows that the claims are indeed different.", 

          x = 0.5, y = 0.5,
          gp = gpar(col = "black", fontsize = 16, fill = "lightblue"))


#grid.rect(x = 0.5, y = 0.5, width = 0.6, height = 0.2, gp = gpar(fill = "lightblue"))
#grid.text("This is a textbox", x = 0.5, y = 0.5)

```

### **SOLUTION TO HYPOTHESIS TWO**

```{r Hypothesis Two, warning=FALSE}
charges_smoking <- Insurance$charges[Insurance$smoker == 'yes'] # extract charges where smoker equals yes
charges_no_smoking <- Insurance$charges[Insurance$smoker == 'no'] # extract charges where smoker equals no

# Calculate and print the sums
sum_smokers <- sum(charges_smoking)
sum_nonsmokers <- sum(charges_no_smoking)

#cat("Total charges for smokers: ", sum_smokers, "\n")
#cat("Total charges for non-smokers: ", sum_nonsmokers, "\n")

# Perform Wilcoxon test
wilcox.test(charges_smoking, charges_no_smoking, conf.int = TRUE)

```

```{r Hypothesis TwoA, warning=FALSE}
#To Confirm what the exact values are:
#cat("Total Charges by Smoking Status:\n\n")

Insurance %>%
  group_by(smoker) %>%
  summarise(total_charges = sum(charges), .groups = "drop")


```



<!-- ============================ END OF ROW ================================= -->




<!-- ============================ START OF ROW ============================== -->




Row {data-height=500}
-----------------------------------------------------------------------

### **Barplot of Region and Smoker Status,**

```{r barplot of Region and Smoker status, warning=FALSE}
#barplot of Region and Smoker status

ggplot(Insurance) +
  aes(x = region, fill = smoker) +
  geom_bar()
```



### **Pearson Chi-Squared Test**

```{r}
library(grid)

grid.newpage()
grid.text ("Hypothesis Formulation:
H0: Region and smoking status are independent\
HA: Region and smoking status are dependent\

Inference: 
The chi-squared test shows that region and smoker 
are independent as the p-value is greater than the 
significance level of 0.05 thus 
the null hypothesis fails to be rejected.", 

          x = 0.5, y = 0.5,
          gp = gpar(col = "black", fontsize = 16, fill = "lightblue"))


#grid.rect(x = 0.5, y = 0.5, width = 0.6, height = 0.2, gp = gpar(fill = "lightblue"))
#grid.text("This is a textbox", x = 0.5, y = 0.5)

```


```{r Chi Squared Test: Table and Plot, warning=FALSE}

#Create a Contingency table of the variables Region and Smoker Status

region_smoker <- table(Insurance$region, Insurance$smoker)

#region_smoker

```


### **Chi Squared Test Result**

```{r Chi Squared Test, warning=FALSE}
chisq_reg_smk <- chisq.test(region_smoker)

chisq_reg_smk

```



### **Pearson Chi-Squared Test - Residual Plot**

```{r Residual Plot, warning=FALSE}
corrplot(chisq_reg_smk$residuals, is.cor = FALSE)
```








<!-- ============================ END OF PAGE ================================= -->




<!-- ============================ START OF NEXT PAGE ============================== -->



# **Correlation Test**


<!-- ============================ START OF ROW ============================== -->



Row {.column data-width=1000}
-----------------------------------------------------------------------

### **Correlation Hypothesis**

```{r}
library(grid)

grid.newpage()
grid.text ("Research Question: 

Are any of the variables correlated?

Inference:

The results show that some variables are positively correlated.

Smoker and Charges variables have a correlation coefficient of 0.66 

Age and Charges also have a correlation coefficient of 0.53.",

x = 0.5, y = 0.5,
          gp = gpar(col = "black", fontsize = 14, fill = "lightblue"))

# Result Interpretation:

#-   Strong negative correlation (-1) means that whenever x rises, y falls.\
#-   0 indicates no correlation exists between the two variables (x and y).\
#-   A strong positive correlation of 1 means that y rises as x does.\
# Age, BMI, Children and Charges are all numerical while sex, smoker and region are categorical.\

# For this test, all categorical will be changed to numerical.

# **For sex:**\
# - male becomes 1\
# - female becomes 2\

# **For smoker:**\
# - no becomes 0\
# - yes becomes 1\

# **For region:**\
# - southeast becomes 1\
# - southwest becomes 2\
# - northeast becomes 3\
# - northwest becomes 4\

          


#grid.rect(x = 0.5, y = 0.5, width = 0.6, height = 0.2, gp = gpar(fill = "lightblue"))
#grid.text("This is a textbox", x = 0.5, y = 0.5)

```



```{r Correlation, warning=FALSE}
## Make a copy of the data
insurance_copy = Insurance
#head(insurance_copy)

## Replacing values for sex
insurance_copy['sex'][insurance_copy['sex'] == 'male'] <- 1
insurance_copy['sex'][insurance_copy['sex'] == 'female'] <- 2

## Replacing values for smoker
insurance_copy['smoker'][insurance_copy['smoker'] == 'no'] <- 0
insurance_copy['smoker'][insurance_copy['smoker'] == 'yes'] <- 1

## Replacing values for region
insurance_copy['region'][insurance_copy['region'] == 'southeast'] <- 1
insurance_copy['region'][insurance_copy['region'] == 'southwest'] <- 2
insurance_copy['region'][insurance_copy['region'] == 'northeast'] <- 3
insurance_copy['region'][insurance_copy['region'] == 'northwest'] <- 4

## Print Changes
#head(insurance_copy)

# check the datatype of the variables
#class(insurance_copy$age)
#class(insurance_copy$sex)
#class(insurance_copy$bmi)
#class(insurance_copy$children)
#class(insurance_copy$smoker)
#class(insurance_copy$region)
#class(insurance_copy$charges)
```



```{r Correlation Data, warning=FALSE}
## Converting variables to lists
age <- c(insurance_copy$age)
sex <- as.integer(c(insurance_copy$sex))
bmi <- c(insurance_copy$bmi)
children <- c(insurance_copy$children)
smoker <- as.integer(c(insurance_copy$smoker))
region <- as.integer(c(insurance_copy$region))
charges <- as.integer(c(insurance_copy$charges))

## binding the variables
insurance_cor <- cbind(age,sex,bmi,children,smoker,region,charges)

#head(insurance_cor)
```




### **Correlation Matrix**

```{r Correlation Matrix, warning=FALSE}
insurance_cor_result <- round (cor(insurance_cor, method = 'spearman'),3)
insurance_cor_result
```




<!-- ============================ END OF ROW ================================= -->




<!-- ============================ START OF ROW ============================== -->




Row {data-height=500}
-----------------------------------------------------------------------

### **Correlation Plot**

```{r Correlation Plot, warning=FALSE}
corrplot(insurance_cor_result, method = 'number')
```







<!-- ============================ END OF PAGE ================================= -->




<!-- ============================ START OF NEXT PAGE ============================== -->



# **Multiple Linear Regression**


<!-- ============================ START OF ROW ============================== -->


Row {data-height=500}
-----------------------------------------------------------------------


### **Research Question**

```{r}
library(grid)

grid.newpage()
grid.text ("Model Variables:
Predictors: Age and Smoker      Response: Charges

Hypothesis Formulation:
H0: There is no relationship between the predictors and response variables\
HA: There is a relationship between the predictors and response variables\

Inference: 
Statistical summary indicates that smoking status and charges are related.
The R-squared (multiple and adjusted) are within a good range at 0.72. 
This suggests that the model is a good fit and 
can possibly explain 72% of the total variability.",

x = 0.5, y = 0.5,
          gp = gpar(col = "black", fontsize = 13, fill = "lightblue"))

```


### **Multiple Regression Model**

```{r Multiple Regression, warning=FALSE}
model <- lm(charges ~ smoker+age, data = Insurance)
summary(model)

### Confidence Interval of the Model Coefficient
confint(model)
```





<!-- ============================ END OF ROW ================================= -->




<!-- ============================ START OF ROW ============================== -->




Row {data-height=500}
-----------------------------------------------------------------------

### **Model Prediction**

```{r Multiple Linear Regression: Model Prediction, warning=FALSE}

new <- data.frame(age=c(35), smoker=c('yes'))

predict(model, newdata=new)

```





### **Prediction**

```{r}
library(grid)

grid.newpage()
grid.text ("Mathematically, multiple regression is represented as 

y = a +(b1)(x1) +......+ (bn)(xn).

In this model, this means the charges for a 35-year-old smoker
will be calculated as 

Charges = (-2391.63) + (23855.30) + (274.87*35) = $31,084.

The model accurately predicted the charges for a 35 year-old smoker.",

x = 0.5, y = 0.5,
          gp = gpar(col = "black", fontsize = 14, fill = "lightblue"))

```