mydata <- read.table("./insurance.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
A unit of observation in this dataset is the primary beneficiary (person) of health insurance
The sample size in this data set is equal to 1338 units of observation
Definition of variables:
The data was taken from the website Kaggle.com, more specifically from the link https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download
The main goal of the analysis is to compare the difference in medical charges between different groups of people (Smokers and non-smokers, male and female…)
mydata_new <- mydata[c(-4, -6)] # Removing the columns "children" & "region"
mydata_new$sexF <- factor(mydata_new$sex,
labels = c(1, 0),
levels = c("male", "female")) # Adding a factor variable for "sex"
mydata_new$smokerF <- factor(mydata_new$smoker,
labels = c(1, 0),
levels = c("yes", "no")) # Adding a factor variable for "smoker"
mydata_new$ID <- seq.int(nrow(mydata_new)) # Adding a column with ID of each observation
mydata_new <- mydata_new[c(8, 1, 2, 6, 3, 4, 7, 5)] # Rearranging the column order
summary(mydata_new[c(-1, -3, -6)]) # Showing summary of variables without ID, sex and smoker columns
## age sexF bmi smokerF charges
## Min. :18.00 1:676 Min. :15.96 1: 274 Min. : 1122
## 1st Qu.:27.00 0:662 1st Qu.:26.30 0:1064 1st Qu.: 4740
## Median :39.00 Median :30.40 Median : 9382
## Mean :39.21 Mean :30.66 Mean :13270
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:16640
## Max. :64.00 Max. :53.13 Max. :63770
get_summary_stats(mydata_new[-1]) # Summary statistics for numerical variables excluding "ID"
## # A tibble: 3 × 13
## variable n min max median q1 q3 iqr mad mean sd se ci
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age 1338 18 64 39 27 51 24 17.8 39.2 14.0 0.384 0.754
## 2 bmi 1338 16.0 53.1 30.4 26.3 34.7 8.40 6.20 30.7 6.10 0.167 0.327
## 3 charges 1338 1122. 63770. 9382. 4740. 16640. 11900. 7441. 13270. 12110. 331. 649.
Parameter estimate interpretation for each numerical variable:
From the previous table we can see the summary of some descriptive statistics of the numerical variables in my dataset. Before that we have the summary of all the variables (without ID, and the two categorical ones) in the dataset.
mydata_new %>%
group_by(smoker) %>%
summarize(mean(charges)) %>%
rename(Smoker = smoker, "Mean of charges" = "mean(charges)") # Showing the mean of charges for smokers and non-smokers
## # A tibble: 2 × 2
## Smoker `Mean of charges`
## <chr> <dbl>
## 1 no 8434.
## 2 yes 32050.
We can see that the average charges for smokers are much higher than the one for non-smokers. This can also be seen in the histogram later in the file.
mydata_new %>%
group_by(sex) %>%
summarize(mean(charges), median(charges)) %>%
rename(Sex = sex, "Mean of charges" = "mean(charges)", "Median of charges" = "median(charges)") # Showing the mean and median of charges for males and females
## # A tibble: 2 × 3
## Sex `Mean of charges` `Median of charges`
## <chr> <dbl> <dbl>
## 1 female 12570. 9413.
## 2 male 13957. 9370.
In this table we have the average charge of both sexes and the median, the medians for both genders are very similar while the mean is slightly higher for males. This can be seen in the boxplot presented later in the file.
library(ggplot2)
ggplot(mydata_new, aes(charges, fill = smoker)) +
geom_histogram(binwidth = 1500,
alpha = 0.8,
position = "identity") +
scale_fill_brewer(palette = "Set1") +
theme_linedraw() +
xlab("Medical Charges in USD") +
ylab("Frequency") +
ggtitle("Medical Charges in USD of smokers vs. non-smokers") # Plotting a histogram of charges for smokers and non-smokers
In this histogram we can see the frequency of different classes of charges (by 1500 USD) split between smokers and non-smokers. From the picture it is visible that there was less smokers in the sample than non-smokers and that the smokers have to pay higher charges on average compared to non-smokers. This was already visible in the table of mean charges comparison for smokers and non-smokers.
ggplot(mydata_new,aes(sex, charges)) +
geom_boxplot(aes(fill=sex)) +
scale_fill_brewer(palette = "Spectral") +
theme_linedraw() +
xlab("Gender") +
ylab("Medical Charges in USD") +
ggtitle("Medical Charges in USD by Gender") # Plotting a boxplot of charges for eaxh sex
In this boxplot we can see the comparison of the distribution of charges between males and females. We can see that the median of charges for the two genders is quite similar. However the q3 of males has higher charges than the q3 of females. This also makes the mean of charges for males a little higher than the mean charges for females. These two measures of mean and median charges for each gender were also compared in the table before.