insurance<-read.csv("C:\\Users\\user\\Desktop\\myR\\insurance.csv")
str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
summary(insurance)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770
pairs(insurance)

#Introduction
#Underwriting in insurance is the process insurers use to determine and evaluate the risk of insuring a home, car, driver or a person’s health or life. The dataset I have chosen contains 1338 rows of insurance charges, age, sex, BMI, number of children, smoker, and region. There are no missing or unidentified values in the dataset. I would like to find out the relationship between the explanatory variables like age, sex, BMI, number of children, smoker, and region and the response variable which is the insurance charges. This will be a process of underwriting in insurance. The explanatory variable: sex, smoker, and region are categorical variables. The categorical variable ‘sex’ has two factors which are ‘female’ and ‘male’, ‘smoker’ variable also has two factors which are ‘Yes’ and ‘No’. ‘Region’ has four factors which are ‘northeast’, ‘southeast’, ‘northwest’, and ‘southwest’. The explanatory variable age is the age of primary beneficiary, BMI is the body mass index, children is the number of children or dependents covered by health insurance. The response variable ‘charges’ is the individual medical costs billed by health insurance. The three questions I would like to explore are
#1. Is there a significant difference of insurance charges between smokers and non-smokers?
#2. Is there any significant differences of insurance charges between regions?
#3. Is there any linear regression models between the response variable (insurance charges) and the explanatory variables (age, sex, BMI, number of children, smoker, and region)?
ggplot(insurance, aes(age, charges))+
  geom_point()

#cannot find exact relationship between age and charges, but it seems as the age increases, charges increase too.
ggplot(insurance, aes(age, charges, color=smoker))+
  geom_point() 

#there seems to be a significant difference between smokers and non-smokers. Smokers seem to have higher insurance charges than non-smokers. Smokers and non-smokers seem to be distributed evenly through age.
ggplot(insurance, aes(age, charges, color=smoker:sex))+
  geom_point() 

ggplot(insurance, aes(bmi, charges, color=smoker))+
  geom_point()

#There is a  significant difference of charges between smokers and non-smokers. For smokers, if they have higher bmi, their insurance charge seems to increase. For non-smokers, even though some have higher bmi, insurance charges do not increase that much, 
ggplot(insurance, aes(y=charges, fill=smoker))+
  geom_boxplot()

#After making scatter plots above, there seems to be a big difference on charges between smokers and non-smokers, so decided to make a boxplot. Smokers have higher insurance charges than non-smokers. Non-smokers have some outliers too.
ggplot(insurance, aes(y=charges, fill=region))+
  geom_boxplot() 

#I was also curious about the difference of charges between regions. The average charges of each region seems similar, but the northeast and southeast seems to have a higher insurance charge and a larger range of charges.
ggplot(insurance, aes(x=region, fill=smoker))+    
  geom_bar(position="fill")

#I also wanted to know the difference of ratios of smokers between regions. The southeast has the highest ratio of smokers. The northeast has the second highest ratio of smokers. From the boxplot above, maybe the smoker ratio could explain why the southeast and northeast has slightly higher average of charges.