Introduction

In this R Markdown document, we will perform data analysis and regression on a dataset using various R packages and visualization techniques. The dataset, “dataoutliers.csv,” contains information related to insurance charges. Let’s start by loading the necessary libraries and the dataset.

## Warning: package 'matlib' was built under R version 4.3.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'ggExtra' was built under R version 4.3.1
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: survival
## Warning: package 'ISLR' was built under R version 4.3.1
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
## Warning: package 'tree' was built under R version 4.3.1
## Loading required package: Matrix
## Loaded glmnet 4.1-7
## 
## Attaching package: 'FNN'
## The following objects are masked from 'package:class':
## 
##     knn, knn.cv
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
## Warning: package 'DAAG' was built under R version 4.3.1
## 
## Attaching package: 'DAAG'
## The following object is masked from 'package:survival':
## 
##     lung
## The following object is masked from 'package:MASS':
## 
##     hills
## The following object is masked from 'package:car':
## 
##     vif

Question 1

What is the structure of the “data” dataframe after loading the dataset and removing the first column?

## 'data.frame':    1144 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 25 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

Data Preprocessing

Before performing any analysis, we need to preprocess the data. We will convert categorical variables, such as “sex,” “smoker,” and “region,” into factors.

Data Exploration and Visualization

Question 2

What does the boxplot reveal about the distribution of the variables in the dataset? Interpret the boxplot.

Question 3

Can you create a barplot showing the distribution of age levels in the dataset?

Data Correlation and Linear Regression

Now, we will perform data correlation and linear regression analysis.

##                  age          sex         bmi     children      smoker
## age       1.00000000 -0.019291793  0.13225698  0.036624589 -0.08523677
## sex      -0.01929179  1.000000000  0.02645352  0.005863853  0.01574338
## bmi       0.13225698  0.026453519  1.00000000 -0.003066000 -0.26742034
## children  0.03662459  0.005863853 -0.00306600  1.000000000  0.03194847
## smoker   -0.08523677  0.015743378 -0.26742034  0.031948468  1.00000000
## region    0.01403375  0.004804029  0.15713835  0.021501105 -0.04024818
## charges   0.50900646 -0.033443996 -0.08321813  0.128975169  0.58742088
##                region     charges
## age       0.014033749  0.50900646
## sex       0.004804029 -0.03344400
## bmi       0.157138348 -0.08321813
## children  0.021501105  0.12897517
## smoker   -0.040248177  0.58742088
## region    1.000000000 -0.08070953
## charges  -0.080709530  1.00000000
##                    age           sex           bmi      children        smoker
## age        191.9837717  -0.133680842  1.112836e+01   0.619343037 -3.469018e-01
## sex         -0.1336808   0.250108597  8.033934e-02   0.003579098  2.312648e-03
## bmi         11.1283649   0.080339341  3.687746e+01  -0.022723694 -4.770042e-01
## children     0.6193430   0.003579098 -2.272369e-02   1.489541080  1.145311e-02
## smoker      -0.3469018   0.002312648 -4.770042e-01   0.011453114  8.627691e-02
## region       0.2168842   0.002679735  1.064350e+00   0.029269069 -1.318607e-02
## charges  41624.2590987 -98.712791739 -2.982562e+03 929.016247443  1.018328e+03
##                 region       charges
## age       2.168842e-01  4.162426e+04
## sex       2.679735e-03 -9.871279e+01
## bmi       1.064350e+00 -2.982562e+03
## children  2.926907e-02  9.290162e+02
## smoker   -1.318607e-02  1.018328e+03
## region    1.244067e+00 -5.312974e+02
## charges  -5.312974e+02  3.483228e+07
##        male
## female    0
## male      1
##     yes
## no    0
## yes   1
##           northwest southeast southwest
## northeast         0         0         0
## northwest         1         0         0
## southeast         0         1         0
## southwest         0         0         1

Question 4

Create a scatter plot to visualize the relationship between “smoker” and “charges” variables. How are the charges distributed across different regions and sexes for smokers?

# Question 5 Can you plot a density distribution of “bmi” and “age” variables for different levels of “sex” and “region”?

##Linear Regression # Let’s perform linear regression analysis on the dataset.

## 
## Call:
## lm(formula = charges ~ ., data = traindata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3896.3 -1374.8  -912.1  -163.1 21529.3 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.785e+03  7.258e+02  -2.459  0.01412 *  
## X                1.471e-02  3.424e-01   0.043  0.96573    
## age              2.383e+02  8.035e+00  29.656  < 2e-16 ***
## sexmale         -4.070e+02  2.256e+02  -1.804  0.07152 .  
## bmi              2.648e+01  1.993e+01   1.328  0.18443    
## children         4.781e+02  9.420e+01   5.075 4.70e-07 ***
## smokeryes        1.282e+04  4.100e+02  31.261  < 2e-16 ***
## regionnorthwest -6.763e+02  3.202e+02  -2.112  0.03494 *  
## regionsoutheast -9.988e+02  3.248e+02  -3.075  0.00217 ** 
## regionsouthwest -1.314e+03  3.235e+02  -4.062 5.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3409 on 905 degrees of freedom
## Multiple R-squared:  0.673,  Adjusted R-squared:  0.6698 
## F-statistic:   207 on 9 and 905 DF,  p-value: < 2.2e-16

Cross Validation

We will now perform cross-validation to evaluate the model’s performance.

## k-Nearest Neighbors 
## 
## 915 samples
##   7 predictor
## 
## Pre-processing: centered (9), scaled (9) 
## Resampling: Cross-Validated (10 fold, repeated 24 times) 
## Summary of sample sizes: 823, 824, 823, 823, 823, 825, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE      Rsquared   MAE     
##   5  3866.355  0.5763352  2355.377
##   7  3768.733  0.5948774  2292.200
##   9  3726.869  0.6034070  2284.412
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.

Tree Regression

Next, we’ll perform regression using decision trees.

## [1] "X"        "age"      "sex"      "bmi"      "children" "smoker"   "region"  
## [8] "charges"
## 
## Regression tree:
## tree(formula = charges ~ ., data = data, subset = train)
## Variables actually used in tree construction:
## [1] "smoker"   "age"      "children"
## Number of terminal nodes:  8 
## Residual mean deviance:  10710000 = 6.043e+09 / 564 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -6262.0 -1509.0  -709.4     0.0   527.0 18580.0

Plotting Regression Tree

## [1] 13816035

## [1] 14083660

Random Forest

Finally, we will apply the random forest algorithm for regression.

## [1] 11185731

to download code please visit our website