R Take Home Exam

First I will download the dataset into R and transforming my data set into data.frame so I can analyse it in the following steps.

insurance <- read.csv("insurance.csv")

insurance <- as.data.frame(insurance)

Using head function to see first 6 rows:

head(insurance)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

To show more or less rows/units, it can be written as such:

head(insurance,10)

##    age    sex    bmi children smoker    region   charges
## 1   19 female 27.900        0    yes southwest 16884.924
## 2   18   male 33.770        1     no southeast  1725.552
## 3   28   male 33.000        3     no southeast  4449.462
## 4   33   male 22.705        0     no northwest 21984.471
## 5   32   male 28.880        0     no northwest  3866.855
## 6   31 female 25.740        0     no southeast  3756.622
## 7   46 female 33.440        1     no southeast  8240.590
## 8   37 female 27.740        3     no northwest  7281.506
## 9   37   male 29.830        2     no northeast  6406.411
## 10  60 female 25.840        0     no northwest 28923.137

I will do the same for last 10 digits to see if data was imported correctly:

tail(insurance,10)

##      age    sex    bmi children smoker    region   charges
## 1329  23 female 24.225        2     no northeast 22395.744
## 1330  52   male 38.600        2     no southwest 10325.206
## 1331  57 female 25.740        2     no southeast 12629.166
## 1332  23 female 33.400        0     no southwest 10795.937
## 1333  52 female 44.700        3     no southwest 11411.685
## 1334  50   male 30.970        3     no northwest 10600.548
## 1335  18 female 31.920        0     no northeast  2205.981
## 1336  18 female 36.850        0     no southeast  1629.833
## 1337  21 female 25.800        0     no southwest  2007.945
## 1338  61 female 29.070        0    yes northwest 29141.360

Using str function to show the variables that were imported and what type of variable they are.

str(insurance)

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

Function factor: this function is used when we have categorical variables. Additionally I changed the names of the values of those variables.

insurance$sex <- factor(insurance$sex,
                        levels = c("female", "male"),
                        labels = c("F", "M"))

insurance$smoker <- factor(insurance$smoker,
                           levels = c("no", "yes"),
                           labels = c("F", "T"))

insurance$region <- factor(insurance$region,
                           levels = c("southwest", "southeast", "northwest", "northeast"),
                           labels = c("SW", "SE", "NW", "NE"))

Using summary function to get the most important parameters for my variables. I will also take out the categorical variables.

summary(insurance[ , -c(2,5,6)])

##       age             bmi           children        charges     
##  Min.   :18.00   Min.   :15.96   Min.   :0.000   Min.   : 1122  
##  1st Qu.:27.00   1st Qu.:26.30   1st Qu.:0.000   1st Qu.: 4740  
##  Median :39.00   Median :30.40   Median :1.000   Median : 9382  
##  Mean   :39.21   Mean   :30.66   Mean   :1.095   Mean   :13270  
##  3rd Qu.:51.00   3rd Qu.:34.69   3rd Qu.:2.000   3rd Qu.:16640  
##  Max.   :64.00   Max.   :53.13   Max.   :5.000   Max.   :63770

I will use the sapply function to calculate standard deviation for each variable, to see how dispersed the data is from the mean for each of them.

sapply(insurance[ ,-c(2,5,6)], FUN=sd)

##          age          bmi     children      charges 
##    14.049960     6.098187     1.205493 12110.011237

I have calculated standard deviation for each unit, now I will round them to 2 decimals using function round.

round(sapply(insurance[ , -c(2,5,6)], FUN = sd), 2)

##      age      bmi children  charges 
##    14.05     6.10     1.21 12110.01

Because the variables are not in the same units the standard deviations are not comparable between them. To be able to compare the variables dispersion of data, I need to calculate the coefficient of variability. I will use the stat.desc function to show me different parameters including cv. For that function I need to activate pastecs library which has already been installed using function install.package.

library(pastecs)

stat.desc(insurance[ , -c(2,5,6)])

##                       age          bmi     children      charges
## nbr.val      1.338000e+03 1.338000e+03 1.338000e+03 1.338000e+03
## nbr.null     0.000000e+00 0.000000e+00 5.740000e+02 0.000000e+00
## nbr.na       0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## min          1.800000e+01 1.596000e+01 0.000000e+00 1.121874e+03
## max          6.400000e+01 5.313000e+01 5.000000e+00 6.377043e+04
## range        4.600000e+01 3.717000e+01 5.000000e+00 6.264855e+04
## sum          5.245900e+04 4.102763e+04 1.465000e+03 1.775582e+07
## median       3.900000e+01 3.040000e+01 1.000000e+00 9.382033e+03
## mean         3.920703e+01 3.066340e+01 1.094918e+00 1.327042e+04
## SE.mean      3.841024e-01 1.667142e-01 3.295616e-02 3.310675e+02
## CI.mean.0.95 7.535090e-01 3.270500e-01 6.465140e-02 6.494682e+02
## var          1.974014e+02 3.718788e+01 1.453213e+00 1.466524e+08
## std.dev      1.404996e+01 6.098187e+00 1.205493e+00 1.211001e+04
## coef.var     3.583531e-01 1.988751e-01 1.100989e+00 9.125566e-01

Rounding the results so they are easier to read.

round(stat.desc(insurance[ , -c(2,5,6)]),2)

##                   age      bmi children      charges
## nbr.val       1338.00  1338.00  1338.00      1338.00
## nbr.null         0.00     0.00   574.00         0.00
## nbr.na           0.00     0.00     0.00         0.00
## min             18.00    15.96     0.00      1121.87
## max             64.00    53.13     5.00     63770.43
## range           46.00    37.17     5.00     62648.55
## sum          52459.00 41027.62  1465.00  17755824.99
## median          39.00    30.40     1.00      9382.03
## mean            39.21    30.66     1.09     13270.42
## SE.mean          0.38     0.17     0.03       331.07
## CI.mean.0.95     0.75     0.33     0.06       649.47
## var            197.40    37.19     1.45 146652372.15
## std.dev         14.05     6.10     1.21     12110.01
## coef.var         0.36     0.20     1.10         0.91

From calculating the coefficient of variation, it can be seen that regarding BMI, the people in the sample had very similar BMI’s which makes sense for the research - I am assuming that the idea was to research how other factors might affect charges, as it is pretty intuitive that BMI (which is connected to health) would affect charges. Dispersion for charges and amount of children in very high.

Using function describe by from library psych, R will show me the descriptive statistics for the variable Charge, divided by gender. I am using this to see what is the affect of gender on charges.

#install.packages("psych")

library(psych)

describeBy(insurance$charges,insurance$sex)

## 
##  Descriptive statistics by group 
## group: F
##    vars   n     mean      sd  median  trimmed     mad     min      max    range
## X1    1 662 12569.58 11128.7 9412.96 10455.16 7129.08 1607.51 63770.43 62162.92
##    skew kurtosis     se
## X1 1.72     2.71 432.53
## ------------------------------------------------------------ 
## group: M
##    vars   n     mean       sd  median trimmed     mad     min      max range
## X1    1 676 13956.75 12971.03 9369.62 11825.4 8121.53 1121.87 62592.87 61471
##    skew kurtosis     se
## X1 1.33     0.79 498.89

Calculating the arithmetic mean for charges per gender, for the purpose of calculating the affect size using Cohen’s d statistics.

tapply(insurance$charges, insurance$sex, mean)

##        F        M 
## 12569.58 13956.75

Using Cohen’s d because it is a measure of the difference between two group means. It expresses the difference in means in units of the pooled standard deviation, making it independent of the original measurement scale.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

cohens_d(charges ~ sex, data = insurance)

## Cohen's d |         95% CI
## --------------------------
## -0.11     | [-0.22, -0.01]
## 
## - Estimated using pooled SD.

Interpretation:

library(effectsize)

effectsize::interpret_cohens_d(-0.11, rules = "cohen1988")

## [1] "very small"
## (Rules: cohen1988)

Because the effect of gender on insurance prices is low, I will analyse the affect of smoking as well (using same procedure as before).

library(psych)

describeBy(insurance$charges,insurance$smoker)

## 
##  Descriptive statistics by group 
## group: F
##    vars    n    mean      sd  median trimmed     mad     min      max    range
## X1    1 1064 8434.27 5993.78 7345.41 7599.76 5477.15 1121.87 36910.61 35788.73
##    skew kurtosis     se
## X1 1.53     3.12 183.75
## ------------------------------------------------------------ 
## group: T
##    vars   n     mean       sd   median  trimmed      mad      min      max
## X1    1 274 32050.23 11541.55 34456.35 31782.89 15167.19 12829.46 63770.43
##       range skew kurtosis     se
## X1 50940.97 0.13    -1.05 697.25

library(effectsize)

cohens_d(charges ~ smoker, data = insurance)

## Cohen's d |         95% CI
## --------------------------
## -3.16     | [-3.34, -2.98]
## 
## - Estimated using pooled SD.

library(effectsize)

effectsize::interpret_cohens_d(-3,16, rules = "cohen1988")

## [1] "large"
## (Rules: cohen1988)

For smoking the result of Cohen’s d statistics is that the affect of smoking on health insurance charges is very significant, the affect size is large. Because this correlation/effect is significant I will present it graphically.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(insurance, aes(x = smoker, y = charges, color = smoker)) +
  geom_jitter(width = 0.2, alpha = 0.4) +
  labs(title = "Charges by smoking status") +
  theme_minimal() +
  theme(legend.position = "none")

The chart visually supports the data - smoking increses charges significantly.

For the last step I will compare BMI to Charges to see how BMI affects them.

str(insurance$bmi)

##  num [1:1338] 27.9 33.8 33 22.7 28.9 ...

str(insurance$charges)

##  num [1:1338] 16885 1726 4449 21984 3867 ...

Function str shows what type of variable it is, we see both are numerical, meaning it makes sense to put them in a scatterplot chart.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplot(y = insurance$charges,
            x = insurance$bmi,
            ylab = "Charges",
            xlab = "BMI",
            smooth = FALSE)

From this scatterplot chart it is possible to see that there is a positive relationship between higher BMI and higher charges, meaning people are statistically more likely to be charged more by insurances if the have high body mass indexes.

The boxplots indicate that the highest number of values for charges is around 10.000 and for BMI around 31.

I will use another scatterplot to present the same data but this time divided by gender.

scatterplot(charges ~ bmi | sex, 
            ylab = "Charges", 
            xlab = "BMI", 
            smooth = FALSE, 
            data = insurance)

From this chart we can interpret that there is a positive relationship between BMI and insurance charges for both male and female patients, however the affect of BMI on Charges is a bit more significant for women than it is for men.

now I will do the same analysis but not divide it by gender but by the fact of smoking.

scatterplot(charges ~ bmi | smoker, 
            ylab = "Charges", 
            xlab = "BMI", 
            smooth = FALSE, 
            data = insurance)

From this chart it is possible to conclude that the relationship between BMI and Charges is positive, meaning the higher the BMI the higher your charges will be statistically. However it is clearly shown that the affect of higher BMI is much more significant with smoker than it is with non smokers.

install.packages("rmarkdown", repos = "https://cloud.r-project.org")

## Installing package into 'C:/Users/zalaa/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)

## 
##   There is a binary version available but the source version is later:
##           binary source needs_compilation
## rmarkdown   2.29   2.30             FALSE

## installing the source package 'rmarkdown'

R Take Home Exam

2025-09-28