Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach to data analysis that mainly involves summarizing data using visualization methods. Prior to testing any hypothesis or developing model, EDA can be used to understand data, primarily the distribution of variables and relationship among them. EDA techniques are mostly graphical although some quantitative summaries of data may be presented.

The Data

Data for this project is taken from https://www.kaggle.com/mirichoi0218/insurance
It contains information about beneficiaries in an insurance plan, with features indicating characteristics of the patient as well as the total medical expenses charged to the plan for the calendar year.

Let’s load required libraries, read in the data and start summarizing it.

library(tidyverse)
library(gridExtra)
setwd("/Users/nirajanbudhathoki/OneDrive - Central Michigan University/INSU")
mydata <- read.csv('insurance.csv',stringsAsFactors = TRUE)

The head option, by default returns first six rows of the dataset. This default behavior can be changed. Another option is the “tail”, which returns the last rows.

head(mydata,n = 10)
##    age    sex    bmi children smoker    region   charges
## 1   19 female 27.900        0    yes southwest 16884.924
## 2   18   male 33.770        1     no southeast  1725.552
## 3   28   male 33.000        3     no southeast  4449.462
## 4   33   male 22.705        0     no northwest 21984.471
## 5   32   male 28.880        0     no northwest  3866.855
## 6   31 female 25.740        0     no southeast  3756.622
## 7   46 female 33.440        1     no southeast  8240.590
## 8   37 female 27.740        3     no northwest  7281.506
## 9   37   male 29.830        2     no northeast  6406.411
## 10  60 female 25.840        0     no northwest 28923.137

The str() function provides information on size of dataset, variables names and types, and few initial values.

str(mydata)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

summary() returns frequency counts in each categories of factor (categorical) variables and important numerical summaries for numeric variables.

summary(mydata)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Check for missing values
The command below returns number of missing values, if present in the dataset.

sum(is.na(mydata))
## [1] 0

Missing values, if present, should be dealt carefully before any further exercises. The simplest solution is to drop them from dataset. However, if there are reasons and you want to take them into consideration, imputation techniques can be used to substitute them with reasonable values.

Let’s start visualization of variables and their relationship.

Plots involving one variable

Distribution of patients based on categorical features

Although the numerical summaries are already returned by the summary() function, let’s get some visual summaries.

1. Distribution by sex

ggplot(mydata,aes(x=sex))+
  geom_bar(width = 0.5)+
  labs(title = "Distribution by Sex", 
       x="Sex", y="Number of patients")

It seems that there are somewhat equal number of male and female patients. We can also get the exact numbers on the plot.

ggplot(mydata, aes(x = sex)) +
  geom_bar(width=0.5) +
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")+
  labs(title = "Distribution by sex", 
       x="Sex", y="Number of patients")

2. Distribution by Smoking Habits

ggplot(mydata,aes(x=smoker))+
  geom_bar(width = 0.8,fill='lightblue',color='black')+
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
  labs(title = "Distribution by smoking habits", 
       x="Smoker ",y="Number of patients")+
  theme_minimal()

3. Distribution by region of residence

ggplot(mydata,aes(x=region))+
  geom_bar(fill = "lightpink1")+
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
  labs(title = "Distribution by Region of Residence", 
       x="Region",y="Number of patients")+
  coord_flip()

4. Distribution by number of children

ggplot(mydata,aes(x=children))+
  geom_bar(fill = "skyblue")+
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
  theme_classic()+
  labs(title = "Distribution by number of children covered by insurance plan",
       x = "Number of children", y = "Number of patients")

Distribution of patients based on numeric features

1. Distribution by age
Numeric features may be presented visually in plots such as histogram and density plots.

ggplot(mydata,aes(x=age))+
  geom_histogram()+
  labs(title = "Distribution by age", 
        x="Age (Years)",y="Number of patients")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can pick a better binwidth. Let’s try 5. Each bin(interval) has the width of 5 years.

ggplot(mydata, aes(x = age)) +
  geom_histogram(binwidth = 5,colour = "black")+
  labs(title = "Distribution by age", 
       x="Age",y="Number of patients")

Clearly, the data is not bell-shaped. A dataset is said to come from a normal distribution if it has a bell-shaped histogram. Many statistical test rely on the assumption that the data come from a normally distributed population. A Q-Q plot can be drawn to check the normality assumption.

ggplot(mydata, aes(sample = age)) +
  geom_qq() +
  geom_qq_line()

On the x-axis are theoretical quantiles from a normal distribution while sample quantiles are along the y-axis. If the plotted points match up along the straight lines, then we say that the quantiles match and hence the data comes from a normal distribution. The age data does not seem to be normally distributed.

2. Distribution by BMI and Charge
Likewise, we can get the plot of bmi and charges. This time let’s present data in density plots on the same row.

p1 = ggplot(mydata, aes(x = bmi)) +
  geom_density(fill = "royalblue2", alpha = .2)+
  labs(title = "BMI Distribution")
p2 = ggplot(mydata, aes(x = charges)) +
  geom_density(fill = "royalblue2", alpha = .2)+
  labs(title = "Charge Distribution")
grid.arrange(p1, p2, ncol=2)

While the bmi data seem to follow a normal distribution, charge does not. We can again check the QQ plot of these two.

p3 = ggplot(mydata, aes(sample = bmi)) +
  geom_qq() +
  geom_qq_line()+
  labs(title = "QQ Plot for BMI data")

p4 = ggplot(mydata, aes(sample = charges)) +
  geom_qq() +
  geom_qq_line()+
  labs(title = "QQ Plot for Charge data")

grid.arrange(p3, p4, ncol=2)

We have further evidence for non-normality of the charge data from QQ plot. For confirmation, we can use statistical test like Shapiro-Wilk for testing normality. The test is done with the null hypothesis that the data is normal vs alternative hypothesis that the data is not normal.

shapiro.test(mydata$charges)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$charges
## W = 0.81469, p-value < 2.2e-16

Since the p-value is very low, we reject null hypothesis at 5% level of significance. We conclude that the charges data is significantly different from a normal distribution. In other words, we cannot assume normality.

For bmi data that looks normally distributed, we can build a histogram and superimpose a density curve to get a better picture.

ggplot(mydata, aes(x = bmi, y = ..density..)) +
  geom_histogram(fill = "slategray2", colour = "grey60", size = .2) +
  geom_density()+
  labs(title = "BMI Distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plots involving two variables

Both Categorical

1. Distribution by Sex Vs Smoking Habit

ggplot(mydata, aes(sex, fill = smoker))+
  geom_bar(position = "dodge")+
  labs(title = "Sex Vs Smoking Habit",
       x = "Sex", y = "Number of patients")

A simple way to get frequencies in each cateogries is to use cross-tabulation.

table(mydata$sex, mydata$smoker)
##         
##           no yes
##   female 547 115
##   male   517 159

2. Distribution by smoking habit in each region

ggplot(mydata) +
  geom_count(mapping = aes(x = smoker, y = region))+
  labs(title = "Distribution of Smoker by Region",
       x = "Smoking Habit", y = "Region")

We can get the actual counts accompanying above plot.

mydata %>% 
  count(smoker, region)
##   smoker    region   n
## 1     no northeast 257
## 2     no northwest 267
## 3     no southeast 273
## 4     no southwest 267
## 5    yes northeast  67
## 6    yes northwest  58
## 7    yes southeast  91
## 8    yes southwest  58

One Categorical and One Continuous

1. Distribution of bmi by sex

ggplot(mydata,aes(x=sex,y=bmi,color=sex))+
  geom_boxplot()+
  labs(title = "Distribution of BMI by sex",
       x = "Sex", y = "BMI")

Median BMI value for male looks slightly higher than for female. Let’s see the exact mean, median, and frequencies in each group.

mydata %>%
  group_by(sex) %>%
  summarise(Mean = mean(bmi),
            Median = median(bmi),
            SD = sd(bmi),
            N = n())
## # A tibble: 2 x 5
##   sex     Mean Median    SD     N
## * <fct>  <dbl>  <dbl> <dbl> <int>
## 1 female  30.4   30.1  6.05   662
## 2 male    30.9   30.7  6.14   676

Moreover, we see some outliers in the dataset. Outliers should always be treated appropriately before carrying out further statistical analysis. A simple and unfortunately most common solution is to throw away outliers. However, you might have compelling reasons to consider them into analysis. In such cases, imputation with a reasonable value or data transformation might be helpful.

2. Distribution of bmi by region

ggplot(mydata, aes(x = bmi)) +
  geom_histogram(binwidth = 5, fill = "lightblue", colour = "black") +
  labs(title = "Distribution of BMI by Region",
       x = "BMI", y = "Number of patients")+
  facet_grid(region ~ .)

BMI seems to be somewhat homogeneously distributed in each region. The histograms look bell-shaped.

3. Distribution of charges by smoking habit

Shown below is the density plot of charges by smoking habits. A vertical line is placed at the median values for each group.

med_charges <- mydata %>%
  group_by(smoker) %>%
  summarize(Median=median(charges))
ggplot(mydata, aes(charges))+
  geom_density(aes(fill=smoker),alpha=0.4)+
  geom_vline(data = med_charges, aes(xintercept = Median,color=smoker))+
  labs(title = "Distribution of Charges by Smoker")

4. Distribution of charges by region

The violin plot below displays “density” of the distribution, highlighting the areas where more points are found. Box plot inside show five-points summaries.

ggplot(mydata, aes(region, charges))+
  geom_violin()+
  geom_boxplot(width = 0.1)+
  labs(title = "Distribution of Charges by Region",
       x = "Region",
       y = "Charges")

5. Distribution of charges by number of children

ggplot(mydata, aes(x = children, y = charges))+
  geom_point(color = 'red',position = "jitter",alpha=0.4)+
  labs(title = "Distribution of charges by number of children",
       x = "Number of children",
       y = "Charges")

Most of the patients have three or fewer children covered by the insurance plan and some of them are highly charged. Patients with four or more children are often charged low.

Both Continuous

1. Distribution of charges vs age

ggplot(mydata,aes(x=age,y=charges))+
  geom_point()+
  labs(title = "Charges Vs Age",
       x = "Age", y = "Charges")

A slightly increasing trend is found on the plot of charges vs bmi. 2. Distribution of charges vs bmi

ggplot(mydata,aes(x = bmi, y = charges)) +
  geom_point(alpha=0.5)+
  labs(title = "Charges Vs BMI",
       x = "BMI", y = "Charges")

No clear trend in charges vs bmi plot. However, we can say that charges higher than $40,000 are all made to patients with BMI greater than 30 kg/m^2.

Plots involving three variables

1. Distribution of bmi in different region by sex

ggplot(mydata, aes(x = region, y = bmi, fill = sex)) +
  geom_col(position = "dodge")+
  scale_fill_brewer(palette = "Accent")+
  labs(title = "BMI in different Region by Sex",
       x = "Region", y = "BMI")

We can get summary statistics of each combination in the following way.

mydata %>%
group_by(sex,region) %>%
summarise(Mean = mean(bmi), N=n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 8 x 4
## # Groups:   sex [2]
##   sex    region     Mean     N
##   <fct>  <fct>     <dbl> <int>
## 1 female northeast  29.3   161
## 2 female northwest  29.3   164
## 3 female southeast  32.7   175
## 4 female southwest  30.1   162
## 5 male   northeast  29.0   163
## 6 male   northwest  29.1   161
## 7 male   southeast  34.0   189
## 8 male   southwest  31.1   163

2. Distribution of charges in different region by sex

ggplot(mydata, aes(x = region, y = charges, fill = sex)) +
  geom_col() +
  scale_fill_brewer(palette = "Pastel1")+
  labs(title = "Charges in different Region by Sex",
       x = "Region", y = "Charges")+
  theme_light()

The highest charges for males are in the southeast region while the lowest are in the northwest. Again, we can get a summary table of these values for all combinations. Although median value would be a better representation than mean for data that are not normal such as charges, let’s compare mean for the sake of demonstration.

mydata %>%
group_by(sex,region) %>%
summarise(Mean = mean(charges), N=n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 8 x 4
## # Groups:   sex [2]
##   sex    region      Mean     N
##   <fct>  <fct>      <dbl> <int>
## 1 female northeast 12953.   161
## 2 female northwest 12480.   164
## 3 female southeast 13500.   175
## 4 female southwest 11274.   162
## 5 male   northeast 13854.   163
## 6 male   northwest 12354.   161
## 7 male   southeast 15880.   189
## 8 male   southwest 13413.   163

3. Distribution of Charges Vs BMI by sex

ggplot(mydata,aes(x=bmi,y=charges,color=sex))+
  geom_point(alpha=0.6)+
  labs(title = "Charges Vs BMI by sex",
       x = "BMI", y = "Charges")

Just for illustration, let’s define charges greater than or equal to $30,000 as higher charges.

higher.charges <- mydata %>%
  filter(charges >= 30000)

Now, we can see the distribution of charges vs bmi for different sexes. We can also fit separate linear models to each sex.

ggplot(higher.charges,aes(x=bmi,y=charges,color=sex))+
  geom_point()+
  stat_smooth(method=lm)+
  labs(title = "Higher Charges Vs BMI by Sex",
       x = "BMI", y = "Charges >= 30000")
## `geom_smooth()` using formula 'y ~ x'

The question of appropriateness of a linear model is different. However, if we were to fit a linear model, for charges greater than or equal to $30,000, the rate of increase in charges with BMI is higher in female than in males.

Alternatively, we can use facets to have different plots for the two groups.

ggplot(higher.charges,aes(x=bmi,y=charges))+
  geom_point(alpha=0.5)+
  stat_smooth(method = lm)+
  facet_grid(.~sex)+
  labs(title = "Higher Charges Vs BMI by Sex",
       x = "BMI",
       y = "Charges >= 30000")
## `geom_smooth()` using formula 'y ~ x'

4. Plot correlations among continuous variables

library(corrplot)
## corrplot 0.84 loaded
corr<-cor(mydata[,c(1,3,7)])
corrplot(corr, type="upper", method="number", tl.pos="d")

Although there are positive correlation coefficients, no pair of variables are highly correlated.

Plots involving four variables

1. Distribution of Charges Vs BMI by sex in each region

ggplot(mydata,aes(x=bmi,y=charges,color=sex))+
  geom_point(alpha = 0.6)+
  facet_wrap(~region)+
  labs(title = "Distribution of Charges vs BMI by Sex in each Region",
       x = "BMI",
       y = "Charges")

The distribution looks similar in each region.

In an attempt to doing exploratory data analysis (EDA), we explored several relationship between variables in this study. Understanding these relationships are crucial before doing any hypothesis tests or statistical modeling.

References
[1] Chang, W. (2013). R graphics cookbook. O’Reilly.
[2] Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly.