First I will download the dataset into R and transforming my data set into data.frame so I can analyse it in the following steps.
insurance <- read.csv("insurance.csv")
insurance <- as.data.frame(insurance)
Using head function to see first 6 rows:
head(insurance)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
To show more or less rows/units, it can be written as such:
head(insurance,10)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
## 7 46 female 33.440 1 no southeast 8240.590
## 8 37 female 27.740 3 no northwest 7281.506
## 9 37 male 29.830 2 no northeast 6406.411
## 10 60 female 25.840 0 no northwest 28923.137
I will do the same for last 10 digits to see if data was imported correctly:
tail(insurance,10)
## age sex bmi children smoker region charges
## 1329 23 female 24.225 2 no northeast 22395.744
## 1330 52 male 38.600 2 no southwest 10325.206
## 1331 57 female 25.740 2 no southeast 12629.166
## 1332 23 female 33.400 0 no southwest 10795.937
## 1333 52 female 44.700 3 no southwest 11411.685
## 1334 50 male 30.970 3 no northwest 10600.548
## 1335 18 female 31.920 0 no northeast 2205.981
## 1336 18 female 36.850 0 no southeast 1629.833
## 1337 21 female 25.800 0 no southwest 2007.945
## 1338 61 female 29.070 0 yes northwest 29141.360
Using str function to show the variables that were imported and what type of variable they are.
str(insurance)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
Function factor: this function is used when we have categorical variables. Additionally I changed the names of the values of those variables.
insurance$sex <- factor(insurance$sex,
levels = c("female", "male"),
labels = c("F", "M"))
insurance$smoker <- factor(insurance$smoker,
levels = c("no", "yes"),
labels = c("F", "T"))
insurance$region <- factor(insurance$region,
levels = c("southwest", "southeast", "northwest", "northeast"),
labels = c("SW", "SE", "NW", "NE"))
Using summary function to get the most important parameters for my variables. I will also take out the categorical variables.
summary(insurance[ , -c(2,5,6)])
## age bmi children charges
## Min. :18.00 Min. :15.96 Min. :0.000 Min. : 1122
## 1st Qu.:27.00 1st Qu.:26.30 1st Qu.:0.000 1st Qu.: 4740
## Median :39.00 Median :30.40 Median :1.000 Median : 9382
## Mean :39.21 Mean :30.66 Mean :1.095 Mean :13270
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000 3rd Qu.:16640
## Max. :64.00 Max. :53.13 Max. :5.000 Max. :63770
I will use the sapply function to calculate standard deviation for each variable, to see how dispersed the data is from the mean for each of them.
sapply(insurance[ ,-c(2,5,6)], FUN=sd)
## age bmi children charges
## 14.049960 6.098187 1.205493 12110.011237
I have calculated standard deviation for each unit, now I will round them to 2 decimals using function round.
round(sapply(insurance[ , -c(2,5,6)], FUN = sd), 2)
## age bmi children charges
## 14.05 6.10 1.21 12110.01
Because the variables are not in the same units the standard deviations are not comparable between them. To be able to compare the variables dispersion of data, I need to calculate the coefficient of variability. I will use the stat.desc function to show me different parameters including cv. For that function I need to activate pastecs library which has already been installed using function install.package.
library(pastecs)
stat.desc(insurance[ , -c(2,5,6)])
## age bmi children charges
## nbr.val 1.338000e+03 1.338000e+03 1.338000e+03 1.338000e+03
## nbr.null 0.000000e+00 0.000000e+00 5.740000e+02 0.000000e+00
## nbr.na 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## min 1.800000e+01 1.596000e+01 0.000000e+00 1.121874e+03
## max 6.400000e+01 5.313000e+01 5.000000e+00 6.377043e+04
## range 4.600000e+01 3.717000e+01 5.000000e+00 6.264855e+04
## sum 5.245900e+04 4.102763e+04 1.465000e+03 1.775582e+07
## median 3.900000e+01 3.040000e+01 1.000000e+00 9.382033e+03
## mean 3.920703e+01 3.066340e+01 1.094918e+00 1.327042e+04
## SE.mean 3.841024e-01 1.667142e-01 3.295616e-02 3.310675e+02
## CI.mean.0.95 7.535090e-01 3.270500e-01 6.465140e-02 6.494682e+02
## var 1.974014e+02 3.718788e+01 1.453213e+00 1.466524e+08
## std.dev 1.404996e+01 6.098187e+00 1.205493e+00 1.211001e+04
## coef.var 3.583531e-01 1.988751e-01 1.100989e+00 9.125566e-01
Rounding the results so they are easier to read.
round(stat.desc(insurance[ , -c(2,5,6)]),2)
## age bmi children charges
## nbr.val 1338.00 1338.00 1338.00 1338.00
## nbr.null 0.00 0.00 574.00 0.00
## nbr.na 0.00 0.00 0.00 0.00
## min 18.00 15.96 0.00 1121.87
## max 64.00 53.13 5.00 63770.43
## range 46.00 37.17 5.00 62648.55
## sum 52459.00 41027.62 1465.00 17755824.99
## median 39.00 30.40 1.00 9382.03
## mean 39.21 30.66 1.09 13270.42
## SE.mean 0.38 0.17 0.03 331.07
## CI.mean.0.95 0.75 0.33 0.06 649.47
## var 197.40 37.19 1.45 146652372.15
## std.dev 14.05 6.10 1.21 12110.01
## coef.var 0.36 0.20 1.10 0.91
From calculating the coefficient of variation, it can be seen that regarding BMI, the people in the sample had very similar BMI’s which makes sense for the research - I am assuming that the idea was to research how other factors might affect charges, as it is pretty intuitive that BMI (which is connected to health) would affect charges. Dispersion for charges and amount of children in very high.
Using function describe by from library psych, R will show me the descriptive statistics for the variable Charge, divided by gender. I am using this to see what is the affect of gender on charges.
#install.packages("psych")
library(psych)
describeBy(insurance$charges,insurance$sex)
##
## Descriptive statistics by group
## group: F
## vars n mean sd median trimmed mad min max range
## X1 1 662 12569.58 11128.7 9412.96 10455.16 7129.08 1607.51 63770.43 62162.92
## skew kurtosis se
## X1 1.72 2.71 432.53
## ------------------------------------------------------------
## group: M
## vars n mean sd median trimmed mad min max range
## X1 1 676 13956.75 12971.03 9369.62 11825.4 8121.53 1121.87 62592.87 61471
## skew kurtosis se
## X1 1.33 0.79 498.89
Calculating the arithmetic mean for charges per gender, for the purpose of calculating the affect size using Cohen’s d statistics.
tapply(insurance$charges, insurance$sex, mean)
## F M
## 12569.58 13956.75
Using Cohen’s d because it is a measure of the difference between two group means. It expresses the difference in means in units of the pooled standard deviation, making it independent of the original measurement scale.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
cohens_d(charges ~ sex, data = insurance)
## Cohen's d | 95% CI
## --------------------------
## -0.11 | [-0.22, -0.01]
##
## - Estimated using pooled SD.
Interpretation:
library(effectsize)
effectsize::interpret_cohens_d(-0.11, rules = "cohen1988")
## [1] "very small"
## (Rules: cohen1988)
Because the effect of gender on insurance prices is low, I will analyse the affect of smoking as well (using same procedure as before).
library(psych)
describeBy(insurance$charges,insurance$smoker)
##
## Descriptive statistics by group
## group: F
## vars n mean sd median trimmed mad min max range
## X1 1 1064 8434.27 5993.78 7345.41 7599.76 5477.15 1121.87 36910.61 35788.73
## skew kurtosis se
## X1 1.53 3.12 183.75
## ------------------------------------------------------------
## group: T
## vars n mean sd median trimmed mad min max
## X1 1 274 32050.23 11541.55 34456.35 31782.89 15167.19 12829.46 63770.43
## range skew kurtosis se
## X1 50940.97 0.13 -1.05 697.25
library(effectsize)
cohens_d(charges ~ smoker, data = insurance)
## Cohen's d | 95% CI
## --------------------------
## -3.16 | [-3.34, -2.98]
##
## - Estimated using pooled SD.
library(effectsize)
effectsize::interpret_cohens_d(-3,16, rules = "cohen1988")
## [1] "large"
## (Rules: cohen1988)
For smoking the result of Cohen’s d statistics is that the affect of smoking on health insurance charges is very significant, the affect size is large. Because this correlation/effect is significant I will present it graphically.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(insurance, aes(x = smoker, y = charges, color = smoker)) +
geom_jitter(width = 0.2, alpha = 0.4) +
labs(title = "Charges by smoking status") +
theme_minimal() +
theme(legend.position = "none")
The chart visually supports the data - smoking increses charges significantly.
For the last step I will compare BMI to Charges to see how BMI affects them.
str(insurance$bmi)
## num [1:1338] 27.9 33.8 33 22.7 28.9 ...
str(insurance$charges)
## num [1:1338] 16885 1726 4449 21984 3867 ...
Function str shows what type of variable it is, we see both are numerical, meaning it makes sense to put them in a scatterplot chart.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(y = insurance$charges,
x = insurance$bmi,
ylab = "Charges",
xlab = "BMI",
smooth = FALSE)
From this scatterplot chart it is possible to see that there is a positive relationship between higher BMI and higher charges, meaning people are statistically more likely to be charged more by insurances if the have high body mass indexes.
The boxplots indicate that the highest number of values for charges is around 10.000 and for BMI around 31.
I will use another scatterplot to present the same data but this time divided by gender.
scatterplot(charges ~ bmi | sex,
ylab = "Charges",
xlab = "BMI",
smooth = FALSE,
data = insurance)
From this chart we can interpret that there is a positive relationship between BMI and insurance charges for both male and female patients, however the affect of BMI on Charges is a bit more significant for women than it is for men.
now I will do the same analysis but not divide it by gender but by the fact of smoking.
scatterplot(charges ~ bmi | smoker,
ylab = "Charges",
xlab = "BMI",
smooth = FALSE,
data = insurance)
From this chart it is possible to conclude that the relationship between BMI and Charges is positive, meaning the higher the BMI the higher your charges will be statistically. However it is clearly shown that the affect of higher BMI is much more significant with smoker than it is with non smokers.
install.packages("rmarkdown", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/zalaa/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
##
## There is a binary version available but the source version is later:
## binary source needs_compilation
## rmarkdown 2.29 2.30 FALSE
## installing the source package 'rmarkdown'