Exploratory Data Analysis (EDA) is an approach to data analysis that mainly involves summarizing data using visualization methods. Prior to testing any hypothesis or developing model, EDA can be used to understand data, primarily the distribution of variables and relationship among them. EDA techniques are mostly graphical although some quantitative summaries of data may be presented.
Data for this project is taken from https://www.kaggle.com/mirichoi0218/insurance
It contains information about beneficiaries in an insurance plan, with features indicating characteristics of the patient as well as the total medical expenses charged to the plan for the calendar year.
Let’s load required libraries, read in the data and start summarizing it.
library(tidyverse)
library(gridExtra)
setwd("/Users/nirajanbudhathoki/OneDrive - Central Michigan University/INSU")
mydata <- read.csv('insurance.csv',stringsAsFactors = TRUE)
The head option, by default returns first six rows of the dataset. This default behavior can be changed. Another option is the “tail”, which returns the last rows.
head(mydata,n = 10)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
## 7 46 female 33.440 1 no southeast 8240.590
## 8 37 female 27.740 3 no northwest 7281.506
## 9 37 male 29.830 2 no northeast 6406.411
## 10 60 female 25.840 0 no northwest 28923.137
The str() function provides information on size of dataset, variables names and types, and few initial values.
str(mydata)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
summary() returns frequency counts in each categories of factor (categorical) variables and important numerical summaries for numeric variables.
summary(mydata)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
Check for missing values
The command below returns number of missing values, if present in the dataset.
sum(is.na(mydata))
## [1] 0
Missing values, if present, should be dealt carefully before any further exercises. The simplest solution is to drop them from dataset. However, if there are reasons and you want to take them into consideration, imputation techniques can be used to substitute them with reasonable values.
Let’s start visualization of variables and their relationship.
Although the numerical summaries are already returned by the summary() function, let’s get some visual summaries.
1. Distribution by sex
ggplot(mydata,aes(x=sex))+
geom_bar(width = 0.5)+
labs(title = "Distribution by Sex",
x="Sex", y="Number of patients")
It seems that there are somewhat equal number of male and female patients. We can also get the exact numbers on the plot.
ggplot(mydata, aes(x = sex)) +
geom_bar(width=0.5) +
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")+
labs(title = "Distribution by sex",
x="Sex", y="Number of patients")
2. Distribution by Smoking Habits
ggplot(mydata,aes(x=smoker))+
geom_bar(width = 0.8,fill='lightblue',color='black')+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
labs(title = "Distribution by smoking habits",
x="Smoker ",y="Number of patients")+
theme_minimal()
3. Distribution by region of residence
ggplot(mydata,aes(x=region))+
geom_bar(fill = "lightpink1")+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
labs(title = "Distribution by Region of Residence",
x="Region",y="Number of patients")+
coord_flip()
4. Distribution by number of children
ggplot(mydata,aes(x=children))+
geom_bar(fill = "skyblue")+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
theme_classic()+
labs(title = "Distribution by number of children covered by insurance plan",
x = "Number of children", y = "Number of patients")
1. Distribution by age
Numeric features may be presented visually in plots such as histogram and density plots.
ggplot(mydata,aes(x=age))+
geom_histogram()+
labs(title = "Distribution by age",
x="Age (Years)",y="Number of patients")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can pick a better binwidth. Let’s try 5. Each bin(interval) has the width of 5 years.
ggplot(mydata, aes(x = age)) +
geom_histogram(binwidth = 5,colour = "black")+
labs(title = "Distribution by age",
x="Age",y="Number of patients")
Clearly, the data is not bell-shaped. A dataset is said to come from a normal distribution if it has a bell-shaped histogram. Many statistical test rely on the assumption that the data come from a normally distributed population. A Q-Q plot can be drawn to check the normality assumption.
ggplot(mydata, aes(sample = age)) +
geom_qq() +
geom_qq_line()
On the x-axis are theoretical quantiles from a normal distribution while sample quantiles are along the y-axis. If the plotted points match up along the straight lines, then we say that the quantiles match and hence the data comes from a normal distribution. The age data does not seem to be normally distributed.
2. Distribution by BMI and Charge
Likewise, we can get the plot of bmi and charges. This time let’s present data in density plots on the same row.
p1 = ggplot(mydata, aes(x = bmi)) +
geom_density(fill = "royalblue2", alpha = .2)+
labs(title = "BMI Distribution")
p2 = ggplot(mydata, aes(x = charges)) +
geom_density(fill = "royalblue2", alpha = .2)+
labs(title = "Charge Distribution")
grid.arrange(p1, p2, ncol=2)
While the bmi data seem to follow a normal distribution, charge does not. We can again check the QQ plot of these two.
p3 = ggplot(mydata, aes(sample = bmi)) +
geom_qq() +
geom_qq_line()+
labs(title = "QQ Plot for BMI data")
p4 = ggplot(mydata, aes(sample = charges)) +
geom_qq() +
geom_qq_line()+
labs(title = "QQ Plot for Charge data")
grid.arrange(p3, p4, ncol=2)
We have further evidence for non-normality of the charge data from QQ plot. For confirmation, we can use statistical test like Shapiro-Wilk for testing normality. The test is done with the null hypothesis that the data is normal vs alternative hypothesis that the data is not normal.
shapiro.test(mydata$charges)
##
## Shapiro-Wilk normality test
##
## data: mydata$charges
## W = 0.81469, p-value < 2.2e-16
Since the p-value is very low, we reject null hypothesis at 5% level of significance. We conclude that the charges data is significantly different from a normal distribution. In other words, we cannot assume normality.
For bmi data that looks normally distributed, we can build a histogram and superimpose a density curve to get a better picture.
ggplot(mydata, aes(x = bmi, y = ..density..)) +
geom_histogram(fill = "slategray2", colour = "grey60", size = .2) +
geom_density()+
labs(title = "BMI Distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
1. Distribution by Sex Vs Smoking Habit
ggplot(mydata, aes(sex, fill = smoker))+
geom_bar(position = "dodge")+
labs(title = "Sex Vs Smoking Habit",
x = "Sex", y = "Number of patients")
A simple way to get frequencies in each cateogries is to use cross-tabulation.
table(mydata$sex, mydata$smoker)
##
## no yes
## female 547 115
## male 517 159
2. Distribution by smoking habit in each region
ggplot(mydata) +
geom_count(mapping = aes(x = smoker, y = region))+
labs(title = "Distribution of Smoker by Region",
x = "Smoking Habit", y = "Region")
We can get the actual counts accompanying above plot.
mydata %>%
count(smoker, region)
## smoker region n
## 1 no northeast 257
## 2 no northwest 267
## 3 no southeast 273
## 4 no southwest 267
## 5 yes northeast 67
## 6 yes northwest 58
## 7 yes southeast 91
## 8 yes southwest 58
1. Distribution of bmi by sex
ggplot(mydata,aes(x=sex,y=bmi,color=sex))+
geom_boxplot()+
labs(title = "Distribution of BMI by sex",
x = "Sex", y = "BMI")
Median BMI value for male looks slightly higher than for female. Let’s see the exact mean, median, and frequencies in each group.
mydata %>%
group_by(sex) %>%
summarise(Mean = mean(bmi),
Median = median(bmi),
SD = sd(bmi),
N = n())
## # A tibble: 2 x 5
## sex Mean Median SD N
## * <fct> <dbl> <dbl> <dbl> <int>
## 1 female 30.4 30.1 6.05 662
## 2 male 30.9 30.7 6.14 676
Moreover, we see some outliers in the dataset. Outliers should always be treated appropriately before carrying out further statistical analysis. A simple and unfortunately most common solution is to throw away outliers. However, you might have compelling reasons to consider them into analysis. In such cases, imputation with a reasonable value or data transformation might be helpful.
2. Distribution of bmi by region
ggplot(mydata, aes(x = bmi)) +
geom_histogram(binwidth = 5, fill = "lightblue", colour = "black") +
labs(title = "Distribution of BMI by Region",
x = "BMI", y = "Number of patients")+
facet_grid(region ~ .)
BMI seems to be somewhat homogeneously distributed in each region. The histograms look bell-shaped.
3. Distribution of charges by smoking habit
Shown below is the density plot of charges by smoking habits. A vertical line is placed at the median values for each group.
med_charges <- mydata %>%
group_by(smoker) %>%
summarize(Median=median(charges))
ggplot(mydata, aes(charges))+
geom_density(aes(fill=smoker),alpha=0.4)+
geom_vline(data = med_charges, aes(xintercept = Median,color=smoker))+
labs(title = "Distribution of Charges by Smoker")
4. Distribution of charges by region
The violin plot below displays “density” of the distribution, highlighting the areas where more points are found. Box plot inside show five-points summaries.
ggplot(mydata, aes(region, charges))+
geom_violin()+
geom_boxplot(width = 0.1)+
labs(title = "Distribution of Charges by Region",
x = "Region",
y = "Charges")
5. Distribution of charges by number of children
ggplot(mydata, aes(x = children, y = charges))+
geom_point(color = 'red',position = "jitter",alpha=0.4)+
labs(title = "Distribution of charges by number of children",
x = "Number of children",
y = "Charges")
Most of the patients have three or fewer children covered by the insurance plan and some of them are highly charged. Patients with four or more children are often charged low.
1. Distribution of charges vs age
ggplot(mydata,aes(x=age,y=charges))+
geom_point()+
labs(title = "Charges Vs Age",
x = "Age", y = "Charges")
A slightly increasing trend is found on the plot of charges vs bmi. 2. Distribution of charges vs bmi
ggplot(mydata,aes(x = bmi, y = charges)) +
geom_point(alpha=0.5)+
labs(title = "Charges Vs BMI",
x = "BMI", y = "Charges")
No clear trend in charges vs bmi plot. However, we can say that charges higher than $40,000 are all made to patients with BMI greater than 30 kg/m^2.
1. Distribution of bmi in different region by sex
ggplot(mydata, aes(x = region, y = bmi, fill = sex)) +
geom_col(position = "dodge")+
scale_fill_brewer(palette = "Accent")+
labs(title = "BMI in different Region by Sex",
x = "Region", y = "BMI")
We can get summary statistics of each combination in the following way.
mydata %>%
group_by(sex,region) %>%
summarise(Mean = mean(bmi), N=n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 8 x 4
## # Groups: sex [2]
## sex region Mean N
## <fct> <fct> <dbl> <int>
## 1 female northeast 29.3 161
## 2 female northwest 29.3 164
## 3 female southeast 32.7 175
## 4 female southwest 30.1 162
## 5 male northeast 29.0 163
## 6 male northwest 29.1 161
## 7 male southeast 34.0 189
## 8 male southwest 31.1 163
2. Distribution of charges in different region by sex
ggplot(mydata, aes(x = region, y = charges, fill = sex)) +
geom_col() +
scale_fill_brewer(palette = "Pastel1")+
labs(title = "Charges in different Region by Sex",
x = "Region", y = "Charges")+
theme_light()
The highest charges for males are in the southeast region while the lowest are in the northwest. Again, we can get a summary table of these values for all combinations. Although median value would be a better representation than mean for data that are not normal such as charges, let’s compare mean for the sake of demonstration.
mydata %>%
group_by(sex,region) %>%
summarise(Mean = mean(charges), N=n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 8 x 4
## # Groups: sex [2]
## sex region Mean N
## <fct> <fct> <dbl> <int>
## 1 female northeast 12953. 161
## 2 female northwest 12480. 164
## 3 female southeast 13500. 175
## 4 female southwest 11274. 162
## 5 male northeast 13854. 163
## 6 male northwest 12354. 161
## 7 male southeast 15880. 189
## 8 male southwest 13413. 163
3. Distribution of Charges Vs BMI by sex
ggplot(mydata,aes(x=bmi,y=charges,color=sex))+
geom_point(alpha=0.6)+
labs(title = "Charges Vs BMI by sex",
x = "BMI", y = "Charges")
Just for illustration, let’s define charges greater than or equal to $30,000 as higher charges.
higher.charges <- mydata %>%
filter(charges >= 30000)
Now, we can see the distribution of charges vs bmi for different sexes. We can also fit separate linear models to each sex.
ggplot(higher.charges,aes(x=bmi,y=charges,color=sex))+
geom_point()+
stat_smooth(method=lm)+
labs(title = "Higher Charges Vs BMI by Sex",
x = "BMI", y = "Charges >= 30000")
## `geom_smooth()` using formula 'y ~ x'
The question of appropriateness of a linear model is different. However, if we were to fit a linear model, for charges greater than or equal to $30,000, the rate of increase in charges with BMI is higher in female than in males.
Alternatively, we can use facets to have different plots for the two groups.
ggplot(higher.charges,aes(x=bmi,y=charges))+
geom_point(alpha=0.5)+
stat_smooth(method = lm)+
facet_grid(.~sex)+
labs(title = "Higher Charges Vs BMI by Sex",
x = "BMI",
y = "Charges >= 30000")
## `geom_smooth()` using formula 'y ~ x'
4. Plot correlations among continuous variables
library(corrplot)
## corrplot 0.84 loaded
corr<-cor(mydata[,c(1,3,7)])
corrplot(corr, type="upper", method="number", tl.pos="d")
Although there are positive correlation coefficients, no pair of variables are highly correlated.
1. Distribution of Charges Vs BMI by sex in each region
ggplot(mydata,aes(x=bmi,y=charges,color=sex))+
geom_point(alpha = 0.6)+
facet_wrap(~region)+
labs(title = "Distribution of Charges vs BMI by Sex in each Region",
x = "BMI",
y = "Charges")
The distribution looks similar in each region.
In an attempt to doing exploratory data analysis (EDA), we explored several relationship between variables in this study. Understanding these relationships are crucial before doing any hypothesis tests or statistical modeling.
References
[1] Chang, W. (2013). R graphics cookbook. O’Reilly.
[2] Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly.