Ravi Melwyn Aranha: S3852052
Last updated: 02 May, 2021
Data Details:
Problem statement: Check the difference between the insurance premium charges between male and females in the US
I will explore the data in R, generate summary statistics, visually explore data, formulate hypothesis, reject/fail-to-reject the null hypothesis and analyze the statistical results to solve the problem statement selected
Data: I collected the data from kaggle at https://www.kaggle.com/teertha/ushealthinsurancedataset and it is open source
This includes various demographic attributes of a person and the corresponding insurance charges
Demographics included are 7 in total and are : age, sex, bmi, children, smoker, region, charges
I will focus on the impact of sex on premium charges, and will focus on 2 variables ‘sex’ and ‘charges’
The dataset includes the charges for 1,338 people
Sex is binary and contains ‘male’ & ’female
This is considered a factor variable and will be converted to the same during data exploration and preparation
#Import csv data and visualize top 5 rows
insurance <- read.csv("D:/Material/Applied Analytics/Assignments/Assignment 2/insurance.csv")
insurance %>% head()#convert sex and other relevant factor variables to factor() type
insurance$sex <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$region <- as.factor(insurance$region)
insurance[c("sex", "smoker", "region")] %>% str() #converted to factor## 'data.frame': 1338 obs. of 3 variables:
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ smoker: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region: Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
#Check for missing values and quickly check statistical attributes for each column
colSums(is.na(insurance)) #No missing values## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
#Split dataset into male and female for descriptive statistics. Also, keep relevant variables only
insurance <- insurance[, c("sex", "charges")]
male_i <- insurance %>% filter(sex=='male');
female_i <- insurance %>% filter(sex=='female')
#Visualize the data for male and female sexes using histograms
par(mfrow = c(2,2));
male_i$charges %>% hist(main="Distribution of charges for male", xlab = "male charges", col = "blue")
female_i$charges %>% hist(main="Distribution of charges for female", xlab = "female charges", col = "red")#Visualize the data for male and female sexes using boxplots. The interpretation of the graphs is done in the following text block
par(mfrow = c(1,2))
male_i$charges %>% boxplot(main="boxplot of charges for male", ylab = "male charges", col = "blue")
female_i$charges %>% boxplot(main="boxplot of charges for female", ylab = "female charges", col = "red")#Find summary statistics all-up
insurance %>% summarise(Min = min(charges,na.rm = TRUE), Q1 = quantile(charges,probs = .25,na.rm = TRUE),
Median = median(charges, na.rm = TRUE), Q3 = quantile(charges,probs = .75,na.rm = TRUE),
Max = max(charges,na.rm = TRUE), Mean = mean(charges, na.rm = TRUE),
SD = sd(charges, na.rm = TRUE), n = n(), Missing = sum(is.na(charges))) -> table1
table1 <- cbind(sex="All", table1) #Add label for sex as 'All' (To be rbinded with sex-level stats next)
#Find summary statistics by sex
insurance %>% group_by(sex) %>% summarise(Min = min(charges,na.rm = TRUE), Q1 = quantile(charges,probs = .25,na.rm = TRUE),
Median = median(charges, na.rm = TRUE), Q3 = quantile(charges,probs = .75,na.rm = TRUE),
Max = max(charges,na.rm = TRUE), Mean = mean(charges, na.rm = TRUE),
SD = sd(charges, na.rm = TRUE), n = n(), Missing = sum(is.na(charges))) -> table2
table3 <- rbind(table1, table2)
knitr::kable(table3)| sex | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| All | 1121.874 | 4740.287 | 9382.033 | 16639.91 | 63770.43 | 13270.42 | 12110.01 | 1338 | 0 |
| female | 1607.510 | 4885.159 | 9412.962 | 14454.69 | 63770.43 | 12569.58 | 11128.70 | 662 | 0 |
| male | 1121.874 | 4619.134 | 9369.616 | 18989.59 | 62592.87 | 13956.75 | 12971.03 | 676 | 0 |
#Check for equal variances
leveneTest(charges ~ sex ,data = insurance) #unequal variance identified due to p < 0.05#Check for normality
par(mfrow = (c(2,2)))
male_i$charges %>% qqPlot(dist="norm", main = "male charges vs theory_quantiles")## [1] 658 623
## [1] 266 286
##
## Welch Two Sample t-test
##
## data: charges by sex
## t = -2.1009, df = 1313.4, p-value = 0.03584
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2682.48932 -91.85535
## sample estimates:
## mean in group female mean in group male
## 12569.58 13956.75
Below is a list of any references i have used in the presentation.