MATH1324 - Assignment 2 Report

Exploration of the impact of a person’s sex on the insurance premium charges in the United States

Ravi Melwyn Aranha: S3852052

Last updated: 02 May, 2021

Introduction Cont.

Data Details:

Problem Statement & Data

Data Cont.

Data Import, Descriptive Statistics and Visualisation

#Import csv data and visualize top 5 rows
insurance <- read.csv("D:/Material/Applied Analytics/Assignments/Assignment 2/insurance.csv")
insurance %>% head()
#convert sex and other relevant factor variables to factor() type
insurance$sex <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$region <- as.factor(insurance$region)
insurance[c("sex", "smoker", "region")] %>% str() #converted to factor
## 'data.frame':    1338 obs. of  3 variables:
##  $ sex   : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ smoker: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region: Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...

Descriptive Statistics and Visualization - Cont.

#Check for missing values and quickly check statistical attributes for each column
colSums(is.na(insurance)) #No missing values
##      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0
summary(insurance)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Descriptive Statistics and Visualization - Cont.

#Split dataset into male and female for descriptive statistics. Also, keep relevant variables only
insurance <- insurance[, c("sex", "charges")]
male_i <- insurance %>% filter(sex=='male');
female_i <- insurance %>% filter(sex=='female')

#Visualize the data for male and female sexes using histograms
par(mfrow = c(2,2)); 
male_i$charges %>% hist(main="Distribution of charges for male", xlab = "male charges", col = "blue")
female_i$charges %>% hist(main="Distribution of charges for female", xlab = "female charges", col = "red")

Descriptive Statistics and Visualization - Cont.

#Visualize the data for male and female sexes using boxplots. The interpretation of the graphs is done in the following text block
par(mfrow = c(1,2))
male_i$charges %>% boxplot(main="boxplot of charges for male", ylab = "male charges", col = "blue")
female_i$charges %>% boxplot(main="boxplot of charges for female", ylab = "female charges", col = "red")

Descriptive Statistics and Visualization - Findings:

  1. Both male and female charges are skewed to the right with considerable tails
  2. male charges seem to have more values around 40000 value range while compared to females as seen in the histograms
  3. Both males and females seem to have multiple outliers in the tails
  4. As seen in the boxplot, females do have more outliers than male as the IQR range is smaller in females
  5. There aren’t any inconsistent values for any of the distributions (both are bound to a range of (0, ~60000) which is understood to be normal for a currency column)

Decsriptive Statistics and Visualization - Cont. 2

#Find summary statistics all-up
insurance %>% summarise(Min = min(charges,na.rm = TRUE), Q1 = quantile(charges,probs = .25,na.rm = TRUE),
                                           Median = median(charges, na.rm = TRUE), Q3 = quantile(charges,probs = .75,na.rm = TRUE),
                                           Max = max(charges,na.rm = TRUE), Mean = mean(charges, na.rm = TRUE),
                                           SD = sd(charges, na.rm = TRUE), n = n(), Missing = sum(is.na(charges))) -> table1
table1 <- cbind(sex="All", table1) #Add label for sex as 'All' (To be rbinded with sex-level stats next)
#Find summary statistics by sex
insurance %>% group_by(sex) %>% summarise(Min = min(charges,na.rm = TRUE),  Q1 = quantile(charges,probs = .25,na.rm = TRUE),
                                           Median = median(charges, na.rm = TRUE), Q3 = quantile(charges,probs = .75,na.rm = TRUE),
                                           Max = max(charges,na.rm = TRUE), Mean = mean(charges, na.rm = TRUE),
                                           SD = sd(charges, na.rm = TRUE), n = n(), Missing = sum(is.na(charges))) -> table2
table3 <- rbind(table1, table2)
knitr::kable(table3)
sex Min Q1 Median Q3 Max Mean SD n Missing
All 1121.874 4740.287 9382.033 16639.91 63770.43 13270.42 12110.01 1338 0
female 1607.510 4885.159 9412.962 14454.69 63770.43 12569.58 11128.70 662 0
male 1121.874 4619.134 9369.616 18989.59 62592.87 13956.75 12971.03 676 0

Descriptive Statistics and Visualization - Findings 2:

  1. Males have a higher mean of 13956.75 while females have slightly lower mean of 12569.58 (This shows that males are likely to have higher charges on average, but we will need to do further analysis to understand if this really is the case)
  2. Males have a larger inter-quartile range (4619.134, 18989.59) thus leading to considerably lower outliers as compared to females which have a smaller range (4885.159, 14454.69) for IQR.
  3. Medians are closer to each other (male, female) as compared to the variation in mean due to the presence of heavy outliers which skews the mean up.
  4. The all-up statistics seem to be in between male and female with mean = 13270.42 and median = 9382.033 (Total observations = 1338)
  5. While the number of observations for male and female are not equal, they are well in similar range with 676 and 662 observations respectively. Given the closeness of the range, we can treat these as similar clusters of observations.
  6. Also, since sample size is quite large i.e. >30, the analysis can be carried out without concerning with the normality of the underlying data. If this was <30, then normality would have to be checked. I will still plot the normality tests, i.e. qqplots subsequently and briefly.
  7. We will next move to hypothesis testing to do further analysis and get more concrete results

Hypothesis Testing

Hypothesis Testing - Cont.

  1. If sample size <30, then normality is expected from the samples. If its larger, then normality is not mandatory since large samples generally lead to normality of sampling distribution. T-tests are generally robust against minor departures from normality
    • As seen in the subsequent qqplots, the male and female samples are not exactly normal as seen in the tails which digress form the diagonal line). Fortunately, since we have large sample size, this is not a deal breaker
  2. Male, Female population is expected to be independent of each other. Given that male and female data is mutually exclusive and separated earlier, this is verified as independent. And of course, they are independent in common sense as well
  3. 2-sample test also expects equal variance among the 2 samples. (This is also checked below using lavene test). If unequal variance, then relevant parameters have to be used while running a t-test (var.equal = FALSE)
    • As seen in the subsequent levene test, p < 0.05 and thus we cannot assume equal variance (i.e. we will have to use var.equal = FALSE in t test)
  4. If the above assumptions are satisfied, then we can proceed with the t-test appropriately (next section)

Hypothesis Testing - Cont.

#Check for equal variances
leveneTest(charges ~ sex ,data = insurance) #unequal variance identified due to p < 0.05
#Check for normality
par(mfrow = (c(2,2)))
male_i$charges %>% qqPlot(dist="norm", main = "male charges vs theory_quantiles")
## [1] 658 623
female_i$charges %>% qqPlot(dist="norm", main = "female charges vs theory_quantiles")
## [1] 266 286

Hypothesis Testing Cont.

t.test(data = insurance, charges ~ sex, var.equal = FALSE, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  charges by sex
## t = -2.1009, df = 1313.4, p-value = 0.03584
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2682.48932   -91.85535
## sample estimates:
## mean in group female   mean in group male 
##             12569.58             13956.75

Discussion

Discussion - Cont.

References

Below is a list of any references i have used in the presentation.