Homework 1

Ian Gošnak

Importing the data and using the function “head”

mydata <- read.table("./insurance.csv", header=TRUE, sep=",", dec=".")

head(mydata)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

A unit of observation in this dataset is the primary beneficiary (person) of health insurance

The sample size in this data set is equal to 1338 units of observation

Definition of variables:

age: Age of primary beneficiary (in years)
sex: Insurance contractor gender, can be female or male
bmi: Body mass index, (kg / m ^ 2) using the ratio of height to weight
children: Number of children covered by health insurance / Number of dependents
smoker: Is the beneficiary a smoker or not
region: The beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance (in USD)

The data was taken from the website Kaggle.com, more specifically from the link https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download

The main goal of the analysis is to compare the difference in medical charges between different groups of people (Smokers and non-smokers, male and female…)

Manipulating the data

mydata_new <- mydata[c(-4, -6)] # Removing the columns "children" & "region"

mydata_new$sexF <- factor(mydata_new$sex,
                          labels = c(1, 0),
                          levels = c("male", "female")) # Adding a factor variable for "sex"

mydata_new$smokerF <- factor(mydata_new$smoker,
                             labels = c(1, 0),
                             levels = c("yes", "no")) # Adding a factor variable for "smoker"

mydata_new$ID <- seq.int(nrow(mydata_new)) # Adding a column with ID of each observation

mydata_new <- mydata_new[c(8, 1, 2, 6, 3, 4, 7, 5)] # Rearranging the column order

Descriptive statistics

summary(mydata_new[c(-1, -3, -6)]) # Showing summary of variables without ID, sex and smoker columns

##       age        sexF         bmi        smokerF     charges     
##  Min.   :18.00   1:676   Min.   :15.96   1: 274   Min.   : 1122  
##  1st Qu.:27.00   0:662   1st Qu.:26.30   0:1064   1st Qu.: 4740  
##  Median :39.00           Median :30.40            Median : 9382  
##  Mean   :39.21           Mean   :30.66            Mean   :13270  
##  3rd Qu.:51.00           3rd Qu.:34.69            3rd Qu.:16640  
##  Max.   :64.00           Max.   :53.13            Max.   :63770

get_summary_stats(mydata_new[-1]) # Summary statistics for numerical variables excluding "ID"

## # A tibble: 3 × 13
##   variable     n    min     max median     q1      q3      iqr     mad    mean       sd      se      ci
##   <fct>    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>   <dbl>    <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>
## 1 age       1338   18      64     39     27      51      24      17.8     39.2    14.0    0.384   0.754
## 2 bmi       1338   16.0    53.1   30.4   26.3    34.7     8.40    6.20    30.7     6.10   0.167   0.327
## 3 charges   1338 1122.  63770.  9382.  4740.  16640.  11900.   7441.   13270.  12110.   331.    649.

Parameter estimate interpretation for each numerical variable:

age: The average age in the sample of 1338 observations was 39.207 years (Mean), the Q1 for age was 27 which meand that 75% of people were older than 27 years and 25% younger
bmi: The median value of BMI was 30.40 this means that 50% of people had a higher BMI and 50% had a lower, the min was 15.96 this means that the lowest value of BMI recorded was 15.96
charges: The mean for charges was 13270.422 this means that on average people had that much medical charges to pay, the Q3 was 16639.913 which means that 25% of people had higher charges and 75% had lower ones

From the previous table we can see the summary of some descriptive statistics of the numerical variables in my dataset. Before that we have the summary of all the variables (without ID, and the two categorical ones) in the dataset.

mydata_new %>% 
  group_by(smoker) %>% 
  summarize(mean(charges)) %>% 
  rename(Smoker = smoker, "Mean of charges" = "mean(charges)") # Showing the mean of charges for smokers and non-smokers

## # A tibble: 2 × 2
##   Smoker `Mean of charges`
##   <chr>              <dbl>
## 1 no                 8434.
## 2 yes               32050.

We can see that the average charges for smokers are much higher than the one for non-smokers. This can also be seen in the histogram later in the file.

mydata_new %>% 
  group_by(sex) %>% 
  summarize(mean(charges), median(charges)) %>% 
  rename(Sex = sex, "Mean of charges" = "mean(charges)", "Median of charges" = "median(charges)") # Showing the mean and median of charges for males and females

## # A tibble: 2 × 3
##   Sex    `Mean of charges` `Median of charges`
##   <chr>              <dbl>               <dbl>
## 1 female            12570.               9413.
## 2 male              13957.               9370.

In this table we have the average charge of both sexes and the median, the medians for both genders are very similar while the mean is slightly higher for males. This can be seen in the boxplot presented later in the file.

Graphical representation

library(ggplot2)

ggplot(mydata_new, aes(charges, fill = smoker)) +
  geom_histogram(binwidth = 1500, 
                 alpha = 0.8, 
                 position = "identity") +
  scale_fill_brewer(palette = "Set1") +
  theme_linedraw() + 
  xlab("Medical Charges in USD") +
  ylab("Frequency") +
  ggtitle("Medical Charges in USD of smokers vs. non-smokers") # Plotting a histogram of charges for smokers and non-smokers

In this histogram we can see the frequency of different classes of charges (by 1500 USD) split between smokers and non-smokers. From the picture it is visible that there was less smokers in the sample than non-smokers and that the smokers have to pay higher charges on average compared to non-smokers. This was already visible in the table of mean charges comparison for smokers and non-smokers.

ggplot(mydata_new,aes(sex, charges)) +
  geom_boxplot(aes(fill=sex)) +
  scale_fill_brewer(palette = "Spectral") +
  theme_linedraw() + 
  xlab("Gender") +
  ylab("Medical Charges in USD") +
  ggtitle("Medical Charges in USD by Gender") # Plotting a boxplot of charges for eaxh sex

In this boxplot we can see the comparison of the distribution of charges between males and females. We can see that the median of charges for the two genders is quite similar. However the q3 of males has higher charges than the q3 of females. This also makes the mean of charges for males a little higher than the mean charges for females. These two measures of mean and median charges for each gender were also compared in the table before.