In this R Markdown document, we will perform data analysis and regression on a dataset using various R packages and visualization techniques. The dataset, “dataoutliers.csv,” contains information related to insurance charges. Let’s start by loading the necessary libraries and the dataset.
## Warning: package 'matlib' was built under R version 4.3.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'ggExtra' was built under R version 4.3.1
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: survival
## Warning: package 'ISLR' was built under R version 4.3.1
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
## Warning: package 'tree' was built under R version 4.3.1
## Loading required package: Matrix
## Loaded glmnet 4.1-7
##
## Attaching package: 'FNN'
## The following objects are masked from 'package:class':
##
## knn, knn.cv
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## Warning: package 'DAAG' was built under R version 4.3.1
##
## Attaching package: 'DAAG'
## The following object is masked from 'package:survival':
##
## lung
## The following object is masked from 'package:MASS':
##
## hills
## The following object is masked from 'package:car':
##
## vif
What is the structure of the “data” dataframe after loading the dataset and removing the first column?
## 'data.frame': 1144 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 25 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
Before performing any analysis, we need to preprocess the data. We will convert categorical variables, such as “sex,” “smoker,” and “region,” into factors.
What does the boxplot reveal about the distribution of the variables in the dataset? Interpret the boxplot.
Can you create a barplot showing the distribution of age levels in the dataset?
Now, we will perform data correlation and linear regression analysis.
## age sex bmi children smoker
## age 1.00000000 -0.019291793 0.13225698 0.036624589 -0.08523677
## sex -0.01929179 1.000000000 0.02645352 0.005863853 0.01574338
## bmi 0.13225698 0.026453519 1.00000000 -0.003066000 -0.26742034
## children 0.03662459 0.005863853 -0.00306600 1.000000000 0.03194847
## smoker -0.08523677 0.015743378 -0.26742034 0.031948468 1.00000000
## region 0.01403375 0.004804029 0.15713835 0.021501105 -0.04024818
## charges 0.50900646 -0.033443996 -0.08321813 0.128975169 0.58742088
## region charges
## age 0.014033749 0.50900646
## sex 0.004804029 -0.03344400
## bmi 0.157138348 -0.08321813
## children 0.021501105 0.12897517
## smoker -0.040248177 0.58742088
## region 1.000000000 -0.08070953
## charges -0.080709530 1.00000000
## age sex bmi children smoker
## age 191.9837717 -0.133680842 1.112836e+01 0.619343037 -3.469018e-01
## sex -0.1336808 0.250108597 8.033934e-02 0.003579098 2.312648e-03
## bmi 11.1283649 0.080339341 3.687746e+01 -0.022723694 -4.770042e-01
## children 0.6193430 0.003579098 -2.272369e-02 1.489541080 1.145311e-02
## smoker -0.3469018 0.002312648 -4.770042e-01 0.011453114 8.627691e-02
## region 0.2168842 0.002679735 1.064350e+00 0.029269069 -1.318607e-02
## charges 41624.2590987 -98.712791739 -2.982562e+03 929.016247443 1.018328e+03
## region charges
## age 2.168842e-01 4.162426e+04
## sex 2.679735e-03 -9.871279e+01
## bmi 1.064350e+00 -2.982562e+03
## children 2.926907e-02 9.290162e+02
## smoker -1.318607e-02 1.018328e+03
## region 1.244067e+00 -5.312974e+02
## charges -5.312974e+02 3.483228e+07
## male
## female 0
## male 1
## yes
## no 0
## yes 1
## northwest southeast southwest
## northeast 0 0 0
## northwest 1 0 0
## southeast 0 1 0
## southwest 0 0 1
Create a scatter plot to visualize the relationship between “smoker” and “charges” variables. How are the charges distributed across different regions and sexes for smokers?
# Question 5 Can you plot a density distribution of “bmi” and “age”
variables for different levels of “sex” and “region”?
##Linear Regression # Let’s perform linear regression analysis on the
dataset.
##
## Call:
## lm(formula = charges ~ ., data = traindata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3896.3 -1374.8 -912.1 -163.1 21529.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.785e+03 7.258e+02 -2.459 0.01412 *
## X 1.471e-02 3.424e-01 0.043 0.96573
## age 2.383e+02 8.035e+00 29.656 < 2e-16 ***
## sexmale -4.070e+02 2.256e+02 -1.804 0.07152 .
## bmi 2.648e+01 1.993e+01 1.328 0.18443
## children 4.781e+02 9.420e+01 5.075 4.70e-07 ***
## smokeryes 1.282e+04 4.100e+02 31.261 < 2e-16 ***
## regionnorthwest -6.763e+02 3.202e+02 -2.112 0.03494 *
## regionsoutheast -9.988e+02 3.248e+02 -3.075 0.00217 **
## regionsouthwest -1.314e+03 3.235e+02 -4.062 5.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3409 on 905 degrees of freedom
## Multiple R-squared: 0.673, Adjusted R-squared: 0.6698
## F-statistic: 207 on 9 and 905 DF, p-value: < 2.2e-16
We will now perform cross-validation to evaluate the model’s performance.
## k-Nearest Neighbors
##
## 915 samples
## 7 predictor
##
## Pre-processing: centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold, repeated 24 times)
## Summary of sample sizes: 823, 824, 823, 823, 823, 825, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3866.355 0.5763352 2355.377
## 7 3768.733 0.5948774 2292.200
## 9 3726.869 0.6034070 2284.412
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
Next, we’ll perform regression using decision trees.
## [1] "X" "age" "sex" "bmi" "children" "smoker" "region"
## [8] "charges"
##
## Regression tree:
## tree(formula = charges ~ ., data = data, subset = train)
## Variables actually used in tree construction:
## [1] "smoker" "age" "children"
## Number of terminal nodes: 8
## Residual mean deviance: 10710000 = 6.043e+09 / 564
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6262.0 -1509.0 -709.4 0.0 527.0 18580.0
## [1] 13816035
## [1] 14083660
Finally, we will apply the random forest algorithm for regression.
## [1] 11185731
to download code please visit our website