The aim of this report is to analyse the effects of age and smoking status on health insurance premiums in the United States. The data set used was simulated and featured by Brett Lantz in his book Machine Learning with R. It was simulated based off US Census Bureau demographic statistics and is therefore reliable. There are seven variables in this data set, but this report focuses on age and smoking status. The two research questions this research investigated are:
How do the health insurance premiums of smokers differ from those who don’t smoke in the United States?
What is the impact of age on the health Insurance Premiums of citizens of the United States?
The findings of this report were that there were many correlations between smokers and high insurance premiums. These were consistent and significant enough to conclude that the health insurance premiums of smokers in the United States will be higher than those of non smokers, assuming all other variables are kept constant. There were also consistent and significant correlations to deduce that the cost of private health insurance in the US increases as a function of age.
Insurance <- read.csv("~/Desktop/Math1005 Project 1 Data/Insurance.csv")
View(Insurance)
attach(Insurance)
summary(Insurance)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
There are seven variables in this data set; ‘age’, ‘sex’, ‘bmi’, ‘children’, ‘smoker’, ‘region’ and ‘charges’. ‘age’ refers to the age of each subject, ranging from 18 to 64. ‘sex’ refers to the gender of each subject, in this case either male or female.’bmi’ refers to the body mass index of each subject, ranging from 15.96 to 53.13. ‘children’ refers to how many children each subject has, ranging from 0 to 5. ‘smoker’ refers to whether or not the subject smokes. ‘region’ refers to the area in which the subject lives, either northwest, northeast, southwest or southeast. ‘charges’ refers to the health insurance premiums paid by each subject, ranging from 1122 USD to 63770 USD. There are 1338 rows of data in this set.
class(Insurance$age)
## [1] "integer"
The variable ‘age’ is classified as an integer. This is the correct classification as this variable contains a small range of numerical data and, thus should be classified as an integer.
class(Insurance$sex)
## [1] "factor"
‘sex’ is classified as a factor. This is accurate because this variable contains no numerical data, only two factors: male and female.
class(Insurance$bmi)
## [1] "numeric"
The variable ‘bmi’ is appropriately classified as numeric because it contains a large range of numeric data.
class(Insurance$children)
## [1] "integer"
This variable, ‘children’ has been classified as an integer. This is the right classification as ‘children’ contains a small range of numeric data.
class(Insurance$smoker)
## [1] "factor"
‘smoker’ is classified as a factor. This is an accurate classification of this variable as it contains no numeric data, only two factors: yes and no.
class(Insurance$region)
## [1] "factor"
The classification of the variable ‘region’ is correct. This variable contains no numeric data, and thus should be classified as a factor.
class(Insurance$charges)
## [1] "numeric"
The variable ‘charges’ has been classified as numeric. This is appropriate because this variable contains a large range of numeric data.
All of the variables of this data set are classified correctly.
The quality and reliability of this data set is limited to a certain extent. It was found on kaggle.com uploaded under the name Medical Cost Personal Data set. This data was originally simulated by Brett Lantz and is featured as an example in his book Machine Learning with R. In this book, Lantz claims that the data was simulated to reflect the real world conditions of the USA through the use of the US Census Bureau demographic statistics. However, there is no detailed reference to a specific demographic data set used to model the data off. Further, all the Age and Sex composition in the United States data sets from 2010 - 2013 show a higher population of women aged 18-64 than men, yet the data set represents slightly more men than women. Although, the large sample size of the data and the fact that it was modeled off US Census Bureau demographic statistics certifies the reliability of this data set.
In the United States, as around much of the world, private health insurance premiums are individually personalised for each consumer. This is because some sectors of the population are more susceptible to certain illnesses. For example, Hypertension is more prevalent in men than women in the population under the age of 45, and more common in women than men in those above 65 years of age (Hage at el, 2013). Further, those who smoke are likely to have weaker immune systems and have an increased risk of cancer (Australian Government Department of Health, 2019). Other contributing factors including age, environment can also affect the likelihood of illness or injury in certain segments of the population. Insurance companies use many different variables to determine the likely health problems of their customers and further charge them more accurate insurance premiums. Some of the variables that can affect insurance premiums the most are: age, location, number of people on the plan and whether or not you smoke or use tobacco (Botkin, 2019). Thus, health insurance premiums are greatly affected by many different variables.
How do the health insurance premiums of smokers differ from those who don’t smoke in the United States?
hypothesis: US citizens who smoke will pay more in health insurance premiums that those who don’t, assuming all other variables are constant.
mean(Insurance$charges)
## [1] 13270.42
mean(Insurance$charges[Insurance$smoker == "yes"])
## [1] 32050.23
mean(Insurance$charges[Insurance$smoker == "no"])
## [1] 8434.268
mean(Insurance$charges[Insurance$smoker == "yes"]) - mean(Insurance$charges[Insurance$smoker == "no"])
## [1] 23615.96
As shown above, the mean of the variable “charges”, which represents the health insurance premiums of various United States citizens, is significantly higher in those who smoke than those who don’t, and higher still than the average of all charges. The difference between the mean of smokers and non-smokers is 23615.96, a significant difference. This suggests that there is a correlation between smokers and high insurance premiums.
median(Insurance$charges)
## [1] 9382.033
median(Insurance$charges[Insurance$smoker == "yes"])
## [1] 34456.35
median(Insurance$charges[Insurance$smoker == "no"])
## [1] 7345.405
median(Insurance$charges[Insurance$smoker == "yes"]) - median(Insurance$charges[Insurance$smoker == "no"])
## [1] 27110.94
The median of the charges of those who smoke is also notably higher than the median of the charges of non-smokers, as well as being higher than the median of all charges, regardless of smoking status. There difference between the median of the charges of smokers and the median of the charges of non-smokers is significant; 27110.94, as shown above. This shows a strong correlation between smokers and high health insurance premiums.
quantile(Insurance$charges)
## 0% 25% 50% 75% 100%
## 1121.874 4740.287 9382.033 16639.913 63770.428
quantile(Insurance$charges[Insurance$smoker == "yes"])
## 0% 25% 50% 75% 100%
## 12829.46 20826.24 34456.35 41019.21 63770.43
quantile(Insurance$charges[Insurance$smoker == "no"])
## 0% 25% 50% 75% 100%
## 1121.874 3986.439 7345.405 11362.887 36910.608
quantile(Insurance$charges[Insurance$smoker == "yes"]) - quantile(Insurance$charges[Insurance$smoker == "no"])
## 0% 25% 50% 75% 100%
## 11707.58 16839.81 27110.94 29656.32 26859.82
The quantile values of the charges of citizens of the United States who smoke are higher than the quantile values of the charges of those who don’t smoke. The quantile values of the charges of the smokers are also higher than the quantile values of all charges. The differences between the quantile values of the charges of smokers compared to non-smokers are all significant. These are all shown above, the most notable of which is the difference between the 75% quantile, which shows a difference of 29656.32. This shows further consistent correlation between high insurance premiums and smokers.
var(Insurance$charges)
## [1] 146652372
var(Insurance$charges[Insurance$smoker == "yes"])
## [1] 133207311
var(Insurance$charges[Insurance$smoker == "no"])
## [1] 35925420
var(Insurance$charges[Insurance$smoker == "yes"]) - var(Insurance$charges[Insurance$smoker == "no"])
## [1] 97281891
sd(Insurance$charges)
## [1] 12110.01
sd(Insurance$charges[Insurance$smoker == "yes"])
## [1] 11541.55
sd(Insurance$charges[Insurance$smoker == "no"])
## [1] 5993.782
sd(Insurance$charges[Insurance$smoker == "yes"]) - sd(Insurance$charges[Insurance$smoker == "no"])
## [1] 5547.765
As the data analysis above shows, the variance and standard deviation of the charges of smokers is much higher than the variance and standard deviation of the charges of non smokers. This suggests that there is a higher range of charges for those who smoke than there is for those who don’t smoke.
plot(x = Insurance$smoker, y = Insurance$charges, main = "Effects of Smoking Status on Health Insurance Premiums", xlab = "Smoking status", ylab = "Health Insurance Premiums (USD)")
The graph above shows a side by side comparison of the boxplots of charges of smokers and non smokers, where “yes” refers to those who smoke and “no” refers to those who don’t. These boxplots clearly show that the rate of smokers’ health insurance premiums are higher than the rate of nonsmokers’ health insurance premiums. Also shown is the larger range of charges of those who smoke compared to those who don’t.
hist(Insurance$charges, main = "Distribution of Health Insurnace premiums", xlab = "Health Insurance premiums (USD)", ylab = "Frequency Counts (People)")
hist(Insurance$charges[Insurance$smoker == "yes"], main = "Distribution of Health Insurance Premiums of Smokers", xlab = "Health Insurance premiums (USD)", ylab = "Frequency Counts (People)")
hist(Insurance$charges[Insurance$smoker == "no"], main = "Distribution of Health Insurance Premiums of Non-Smokers", xlab = "Health Insurance Premiums (USD)", ylab = "Frequency Counts (People)")
These Histograms clearly show the difference of Health Insurance premiums between those who smoke and those who don’t. The second histogram, Histogram of Health Insurance Premiums of Smokers shows that there is a high frequency of smokers paying between 15000 to 50000 USD in health insurance premiums. Whilst the third histogram, Histogram of Health Insurance Premiums of Non-smokers shows a higher concentration of non-smokers paying from 0 to 150000 USD in Health Insurance Premiums. When both of these histograms are compared ot the first, it is shown that in general, there is a high frequency of people paying between 0 and 15000 USD in health Insurance premiums, which means that smokers make up the bulk of those paying above 15000 USD. Thus, smokers are more frequently paying higher health insurance premiums than non-smokers.
There are many instances showing a correlation between smokers and high insurance premiums. These correlations are consistent and significant enough to conclude that those who smoke are more likely to pay higher insurance premiums. Further, it can be deduced that the health insurance premiums of smokers in the United States will be higher than those of non smokers, assuming all other variables are kept constant.
What is the impact of age on the health Insurance Premiums of citizens of the United States?
hypothesis: The cost of private health insurance in the US will increase as a function of age.
mean(Insurance$charges)
## [1] 13270.42
mean(Insurance$charges[Insurance$age < 40])
## [1] 10157.22
mean(Insurance$charges[Insurance$age >= 40])
## [1] 16430.51
mean(Insurance$charges[Insurance$age >= 40]) - mean(Insurance$charges[Insurance$age < 40])
## [1] 6273.295
median(Insurance$charges)
## [1] 9382.033
median(Insurance$charges[Insurance$age < 40])
## [1] 4749.061
median(Insurance$charges[Insurance$age >= 40])
## [1] 11657.92
median(Insurance$charges[Insurance$age >= 40]) - median(Insurance$charges[Insurance$age < 40])
## [1] 6908.856
The series of codes above computes the mean and median value of the insurance charge among the eligible US citizens. In this specific section, two groups are created based on the age of the buyer, group 1 for US citizens aged 18 to 39 and group 2 for the population aged 40 to 64. Mean value of the cost of private health insurance of group 1 in the duration of four years (2010 to 2013) is 10,160 USD (4.s.f) which is about 3,113 USD (4.s.f) below the mean value for an average US citizen, while on average individuals from group 2 spent 16,630 USD (4.s.f) on private health insurance in the same period, 6,470 USD (4.s.f) more than group 1 and 3,358 USD (4.s.f) more than the national average, this suggests the cost of private health insurance tends to increase as the age increases. Median valve is also a critical figure in data science, at times mean values are skewed either higher or lower due to the outlier(s), median figures minimise the effect of outlier(s) on the data set and gives a better overview of the data. As shown above, the median value of the data set is 9,382 USD (4.s.f), and 4,749 USD (4.s.f) for group 1, 11,740 USD (4.s.f) for group 2. Group 2’s figure is 6,991 USD more than group 1, which means the difference in the cost of private health insurance is even more than what the mean values have suggested. Both the means and medians of this data set show a correlation between older age and more expensive health insurance premiums.
hist(Insurance$charges, main = "Distribution of Health Insurance Premiums", xlab = "Health Insurance Premiums (USD)", ylab = "Frequency Counts (People)")
hist(Insurance$charges[Insurance$age < 40], main = "Distribution of Health Insurance Premiums of Group 1", xlab = "Health Insurance Premiums (USD)", ylab = "Frequency Counts (People)")
hist(Insurance$charges[Insurance$age >= 40], main = "Distribution of Health Insurance Premiums of Group 2", xlab = "Health Insurance Premiums (USD)", ylab = "Frequency Counts (People)")
Three histograms were generated above, the first histogram is an overview of the distribution of the data, while the second and third histogram give a closer look into the distribution in group 1 and group 2. Within the range [15,000, 65,000], there is no significant difference in the distribution, while a notable increase is observed in frequency count in group 2 compared to group 1 within the range [0, 15,000]. This doesn’t show a clear correlation between age and health insurance premiums.
plot(Insurance$age, Insurance$charges, main="Effect of age on Health Insurance Premiums", xlab="Age", ylab="Health Insurance Premiums (USD)")
mod<-lm(charges~age)
abline(mod)
This scatter plot provides another overview of the distribution of the data set, with the dependent variable being the cost of private health insurance. Compared to the histogram, the scatter plot does not consist of a frequency count, but it is more desired when identifying the outlier(s), due to the visualisation of the data set. The linear regression line suggests that as the independent variable (age) increases, the dependent variable (cost of private health insurance) also increases, therefore, exhibiting a positive relationship between the two variables. The linear regression line verified the hypothesis that cost of private health insurance increases as the age of the buyer increases.This scatter plot clearly shows a correlation between age and insurance premiums. This is further demonstrated through the upward slope of the regression abline.
summary(mod)
##
## Call:
## lm(formula = charges ~ age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8059 -6671 -5939 5440 47829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3165.9 937.1 3.378 0.000751 ***
## age 257.7 22.5 11.453 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared: 0.08941, Adjusted R-squared: 0.08872
## F-statistic: 131.2 on 1 and 1336 DF, p-value: < 2.2e-16
lm(charges~age)
##
## Call:
## lm(formula = charges ~ age)
##
## Coefficients:
## (Intercept) age
## 3165.9 257.7
b1 <- 3165.9
b2 <- 257.7
y = b1 + b2*(20)
y
## [1] 8319.9
The linear regression model above demonstrates the relationship between the dependent and independent variables. The two intercepts given (3165.9 and 257.7) can be used to predict the cost of health insurance (y) at any given age (x). Above, the two intercepts given by the linear regression model were inputted into the linear regression equation (y = b1 + b2*(x)). Using the age 20, the linear regression model can be assessed to be accurate, as the result was 8319.9, which is accurate based off of the scatter plot above, where we can see at age 20, the abline is at roughly 8000. Thus, this linear regression model is accurate and can be used to demonstrate the positive correlation between age and health insurance premiums.
From the data analysis above, it can be concluded that as age increases, so too will the cost of health insurance. There is a consistent and significant correlation between older age and more expensive health insurance premiums demonstrated. The histograms showed less compelling evidence of this, but the scatterplot, means, medians and linear regression model all clearly show a positive correlation between age and the cost of health insurance premiums. Thus, the cost of private health insurance in the US increases as a function of age.
Australian Government Department of Health (2019). What are the effects of smoking and tobacco? [online] Australian Government Department of Health. Available at: https://www.health.gov.au/health-topics/smoking-and-tobacco/about-smoking-and-tobacco/what-are-the-effects-of-smoking-and-tobacco [Accessed 17 Sep. 2020].
Botkin, K. (2019). 10 Factors That Affect Your Health Insurance Premium Costs. [online] Moneycrashers.com. Available at: https://www.moneycrashers.com/factors-health-insurance-premium-costs/ [Accessed 29 Sep. 2020].
Ganti, A. (2019). Median Definition. [online] Investopedia. Available at: https://www.investopedia.com/terms/m/median.asp [Accessed 1 Oct. 2020].
Hage, F.G., Mansur, S.J., Xing, D. and Oparil, S. (2013). Hypertension in women. Kidney International Supplements, [online] 3(4), pp.352–356. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4089575/#:~:text=Hypertension%20is%20more%20common%20in [Accessed 24 Sep. 2020].