Email: abrahamzaweir@gmail.com
RPubs: https://rpubs.com/bram123
Dalam proyek ini saya memberikan anda dataset insurance.csv, informasi lanjut mengenai data ini dapat anda baca di Kaggle.
Tugas kalian adalah sebagai berikut:
## Warning: package 'DT' was built under R version 3.6.3
## Warning: package 'Metrics' was built under R version 3.6.3
The purpose of this chapter is to get an insight of insurance member behaviour such as: age, sex, region, smoker.
The age distribution of insurance member is relatively the same, except 18 and 19 y’o members which has higher population (above 60). I also make a group of member’s age on table below.
## Agecut total
## 1 [15,20] 166
## 2 (20,25] 140
## 3 (25,30] 138
## 4 (30,35] 130
## 5 (35,40] 127
## 6 (40,45] 137
## 7 (45,50] 144
## 8 (50,55] 140
## 9 (55,60] 125
## 10 (60,65] 91
## Warning: The titlefont attribute is deprecated. Use title = list(font
## = ...) instead.
The gender of insurance member is almost the same (male = 676 people & female = 662 people).
## Warning: The titlefont attribute is deprecated. Use title = list(font
## = ...) instead.
The region where insurance members are living in is evenly distributed.
A lot of members are non smoker (79.5% or 1064 person) and the rest are smoker.
## Warning: The titlefont attribute is deprecated. Use title = list(font
## = ...) instead.
Body mass index of members are normally distributed.
The distribution of individual medical costs billed by health insurance has a positive skew.
I made this density analysis to find the variables which affect the medical costs. From this analysis (you need to see the density chart), I gain some insights:
Sex is not make an impact to medical costs insurance because male and female has a same distribution/density against charges (look at Chart 4.1).
Region is not make a big impact to medical costs insurance because 4 regions almost have a same distribution/density against charges (look at Chart 4.2).
Smoker and non-smoker affect the medical costs insurance because the distribution/density have a big different (look at Chart 4.3).
Total number of children/dependents have a same density against charges, except zero number of children. So, if you’re not have a child it will cause an effect to medical costs insurance (look at Chart 4.4).
According to those insights, I can conclude that Smoke, total dependents, age, and BMI are the variables to predict the medical costs insurance.
Since smokers, total dependents, and BMI made an effect to medical costs insurance, I’m grouping those factors into 8 category shown on sub-chapter below. Eventually, I only have age or and BMI as an input variable(s) to my prediction equation. The equation will shown in sub-chapter, so you need to check it.
Cat - 1 : Smoker, have no dependent, BMI under 30.
Cat - 2 : Smoker, have no dependent, BMI over 30.
Cat - 3 : Smoker, have dependents, BMI under 30.
Cat - 4 : Smoker, have dependents, BMI over 30.
Cat - 5 : Non-smoker, have no dependent, BMI under 30.
Cat - 6 : Non-smoker, have no dependent, BMI over 30.
Cat - 7 : Non-smoker, have dependents, BMI under 30.
Cat - 8 : Non-smoker, have dependents, BMI over 30.
Smoker, have no dependent, BMI under 30.
Age vs Charges chart looks can be approached by using linear regression.
BMI vs Charges looks have a disordered correlation.
##
## Call:
## lm(formula = charges ~ age, data = ins_smoker_nochild_under30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5740.7 -2372.9 -776.5 612.0 17119.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11970.56 1454.78 8.228 5.55e-11 ***
## age 252.39 36.25 6.962 5.70e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4276 on 52 degrees of freedom
## Multiple R-squared: 0.4824, Adjusted R-squared: 0.4725
## F-statistic: 48.46 on 1 and 52 DF, p-value: 5.696e-09
##
## Call:
## lm(formula = charges ~ age + bmi, data = ins_smoker_nochild_under30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2539.3 -1754.2 -1059.9 -203.1 15678.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -956.74 5104.29 -0.187 0.8521
## age 251.20 34.35 7.312 1.75e-09 ***
## bmi 505.18 192.06 2.630 0.0112 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4051 on 51 degrees of freedom
## Multiple R-squared: 0.5442, Adjusted R-squared: 0.5264
## F-statistic: 30.45 on 2 and 51 DF, p-value: 1.985e-09
From those two summary, I decided to use linear regression with 2 variables (age and BMI) because it have a better R-Squared value.
So, the equation will be like:
-956.74 + (251.20 * AGE) + (505.18 * BMI)
Smoker, have no dependent, BMI over 30.
Age vs Charges chart looks can be approached by using linear regression.
BMI vs Charges looks have a disordered correlation.
##
## Call:
## lm(formula = charges ~ age, data = ins_smoker_nochild_over30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19722 -2239 -1235 786 19807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28983.17 1737.81 16.678 < 2e-16 ***
## age 306.74 43.38 7.071 2.05e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5361 on 59 degrees of freedom
## Multiple R-squared: 0.4587, Adjusted R-squared: 0.4495
## F-statistic: 50 on 1 and 59 DF, p-value: 2.05e-09
##
## Call:
## lm(formula = charges ~ age + bmi, data = ins_smoker_nochild_over30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16667.0 -1505.1 -737.2 47.9 22684.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8120.10 5338.78 1.521 0.133702
## age 292.16 38.73 7.544 3.57e-10 ***
## bmi 614.01 150.40 4.082 0.000138 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4766 on 58 degrees of freedom
## Multiple R-squared: 0.5795, Adjusted R-squared: 0.565
## F-statistic: 39.97 on 2 and 58 DF, p-value: 1.225e-11
From those two summary, I decided to use linear regression with 2 variables (age and BMI) because it have a better R-Squared value.
So, the equation will be like:
8120.10 + (292.16 * AGE) + (614.01 * BMI)
Smoker, have dependents, BMI under 30.
Age vs Charges chart looks can be approached by using linear regression.
BMI vs Charges looks have a disordered correlation.
##
## Call:
## lm(formula = charges ~ age, data = ins_smoker_child_under30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4651.4 -1487.2 -352.5 637.6 15865.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10842.61 1332.43 8.137 7.77e-12 ***
## age 274.71 33.18 8.280 4.19e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3199 on 73 degrees of freedom
## Multiple R-squared: 0.4843, Adjusted R-squared: 0.4772
## F-statistic: 68.56 on 1 and 73 DF, p-value: 4.194e-12
##
## Call:
## lm(formula = charges ~ age + bmi, data = ins_smoker_child_under30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2741.8 -1070.1 -608.8 -5.2 15619.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2428.48 2769.81 0.877 0.3835
## age 259.48 31.33 8.282 4.56e-12 ***
## bmi 359.27 105.64 3.401 0.0011 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2989 on 72 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5433
## F-statistic: 45.02 on 2 and 72 DF, p-value: 2.075e-13
From those two summary, I decided to use linear regression with 2 variables (age and BMI) because it have a better R-Squared value.
So, the equation will be like:
2428.48 + (259.48 * AGE) + (359.27 * BMI)
Smoker, have dependents, BMI over 30.
Age vs Charges chart looks can be approached by using linear regression.
BMI vs Charges looks have a disordered correlation.
##
## Call:
## lm(formula = charges ~ age, data = ins_smoker_child_over30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4354.3 -2271.4 -859.3 1174.1 18445.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32648.20 1361.98 23.971 < 2e-16 ***
## age 241.21 31.86 7.572 4.89e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3723 on 82 degrees of freedom
## Multiple R-squared: 0.4115, Adjusted R-squared: 0.4043
## F-statistic: 57.34 on 1 and 82 DF, p-value: 4.889e-11
##
## Call:
## lm(formula = charges ~ age + bmi, data = ins_smoker_child_over30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2212.2 -1334.7 -653.6 40.7 17621.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16021.03 3341.64 4.794 7.30e-06 ***
## age 253.72 27.69 9.162 3.81e-14 ***
## bmi 447.91 84.22 5.318 9.08e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3225 on 81 degrees of freedom
## Multiple R-squared: 0.5638, Adjusted R-squared: 0.553
## F-statistic: 52.35 on 2 and 81 DF, p-value: 2.553e-15
From those two summary, I decided to use linear regression with 2 variables (age and BMI) because it have a better R-Squared value.
So, the equation will be like:
16021.03 + (253.72 * AGE) + (447.91 * BMI)
** Non-smoker, have no dependent, BMI under 30.**
Age vs Charges chart looks can be approached by using linear regression.
BMI vs Charges looks have a disordered correlation.
##
## Call:
## lm(formula = charges ~ age, data = ins_nonsmoker_nochild_under30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2428.1 -1440.0 -886.5 -391.6 21672.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3239.15 687.81 -4.709 4.43e-06 ***
## age 277.00 16.79 16.495 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3950 on 217 degrees of freedom
## Multiple R-squared: 0.5563, Adjusted R-squared: 0.5543
## F-statistic: 272.1 on 1 and 217 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = charges ~ age + bmi, data = ins_nonsmoker_nochild_under30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2457.4 -1453.0 -852.6 -391.1 21715.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3791.02 2191.60 -1.730 0.0851 .
## age 276.56 16.91 16.354 <2e-16 ***
## bmi 22.40 84.44 0.265 0.7911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3958 on 216 degrees of freedom
## Multiple R-squared: 0.5565, Adjusted R-squared: 0.5524
## F-statistic: 135.5 on 2 and 216 DF, p-value: < 2.2e-16
From those two summary, I decided to use linear regression with 1 variable (age) because it have a better R-Squared value.
So, the equation will be like:
-3239.15 + (277.00 * AGE)
** Non-smoker, have no dependent, BMI over 30.**
Age vs Charges chart looks can be approached by using linear regression.
BMI vs Charges looks have a disordered correlation.
##
## Call:
## lm(formula = charges ~ age, data = ins_nonsmoker_nochild_over30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2595.7 -1666.9 -1096.7 -377.7 20412.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2155.79 635.34 -3.393 0.000809 ***
## age 254.01 14.66 17.329 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3861 on 238 degrees of freedom
## Multiple R-squared: 0.5579, Adjusted R-squared: 0.556
## F-statistic: 300.3 on 1 and 238 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = charges ~ age + bmi, data = ins_nonsmoker_nochild_over30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2811.9 -1632.1 -1099.0 -396.8 20324.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -465.01 2319.97 -0.200 0.841
## age 254.86 14.71 17.321 <2e-16 ***
## bmi -48.90 64.53 -0.758 0.449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3864 on 237 degrees of freedom
## Multiple R-squared: 0.5589, Adjusted R-squared: 0.5552
## F-statistic: 150.2 on 2 and 237 DF, p-value: < 2.2e-16
From those two summary, I decided to use linear regression with 1 variable (age) because it have a better R-Squared value.
So, the equation will be like:
-2155.79 + (254.01 * AGE)
** Non-smoker, have dependents, BMI under 30.**
Age vs Charges chart looks can be approached by using linear regression.
BMI vs Charges looks have a disordered correlation.
##
## Call:
## lm(formula = charges ~ age, data = ins_nonsmoker_child_under30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3050 -2080 -1578 -808 22413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -884.08 1029.59 -0.859 0.391
## age 247.85 25.87 9.581 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4975 on 281 degrees of freedom
## Multiple R-squared: 0.2462, Adjusted R-squared: 0.2436
## F-statistic: 91.8 on 1 and 281 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = charges ~ age + bmi, data = ins_nonsmoker_child_under30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3255.8 -2178.9 -1528.4 -664.9 22367.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2835.49 2564.04 -1.106 0.270
## age 245.27 26.07 9.408 <2e-16 ***
## bmi 79.79 96.01 0.831 0.407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4978 on 280 degrees of freedom
## Multiple R-squared: 0.2481, Adjusted R-squared: 0.2427
## F-statistic: 46.19 on 2 and 280 DF, p-value: < 2.2e-16
From those two summary, I decided to use linear regression with 1 variable (age) because it have a better R-Squared value.
So, the equation will be like:
-884.08 + (247.85 * AGE)
** Non-smoker, have dependents, BMI over 30.**
Age vs Charges chart looks can be approached by using linear regression.
BMI vs Charges looks have a disordered correlation.
##
## Call:
## lm(formula = charges ~ age, data = ins_nonsmoker_child_over30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3138.6 -2283.0 -1647.7 -746.8 24235.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2161.36 1041.35 -2.076 0.0387 *
## age 282.54 24.23 11.660 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5250 on 320 degrees of freedom
## Multiple R-squared: 0.2982, Adjusted R-squared: 0.296
## F-statistic: 136 on 1 and 320 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = charges ~ age + bmi, data = ins_nonsmoker_child_over30)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3320.7 -2278.4 -1636.0 -711.3 24252.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -923.87 2648.40 -0.349 0.727
## age 283.07 24.28 11.658 <2e-16 ***
## bmi -35.83 70.50 -0.508 0.612
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5256 on 319 degrees of freedom
## Multiple R-squared: 0.2988, Adjusted R-squared: 0.2944
## F-statistic: 67.95 on 2 and 319 DF, p-value: < 2.2e-16
From those two summary, I decided to use linear regression with 1 variable (age) because it have a better R-Squared value.
So, the equation will be like:
-2161.36 + (282.54 * AGE)
From grouping the equation, I made a function to predict the charges. The function is look like:
predict <- function(x){
for(i in 1:nrow(x)){
if(x[i,"smoker"] == "yes" && x[i,"children"] == 0 && x[i,"bmi"] < 30){
x[i,"result"] = -956.74 + (251.20*x[i,"age"]) + (505.18*x[i,"bmi"])
} else if(x[i,"smoker"] == "yes" && x[i,"children"] == 0 && x[i,"bmi"] >= 30) {
x[i,"result"] = 8120.10 + (292.16*x[i,"age"]) + (614.01*x[i,"bmi"])
} else if(x[i,"smoker"] == "yes" && x[i,"children"] > 0 && x[i,"bmi"] < 30){
x[i,"result"] = 2428.48 + (259.48*x[i,"age"]) + (359.27*x[i,"bmi"])
} else if(x[i,"smoker"] == "yes" && x[i,"children"] > 0 && x[i,"bmi"] >= 30){
x[i,"result"] = 16021.03 + (253.72*x[i,"age"]) + (447.91*x[i,"bmi"])
} else if(x[i,"smoker"] == "no" && x[i,"children"] == 0 && x[i,"bmi"] < 30){
x[i,"result"] = -3239.15 + (277.00*x[i,"age"])
} else if(x[i,"smoker"] == "no" && x[i,"children"] == 0 && x[i,"bmi"] >= 30){
x[i,"result"] = -2155.79 + (254.01*x[i,"age"])
} else if(x[i,"smoker"] == "no" && x[i,"children"] > 0 && x[i,"bmi"] < 30){
x[i,"result"] = -884.08 + (247.85*x[i,"age"])
} else {
x[i,"result"] = -2161.36 + (282.54*x[i,"age"])
}
}
return(x)
}And the results of my prediction is shown on table below
The Root Mean Square Error of my prediction:
## [1] 4437.567
Mean error value of my prediction is about plus minus 4437.56
The Mean Absolute Percent Error of my prediction:
## [1] 0.2672371
Mean percentage error of my prediction is 26.7%
How much the insurance costs for this people?
Daniel, 27 years old, 34.21 BMI, have no dependent, non - smoker, and living in northeast.
Rani, 31 years old, 27.54 BMI, 3 dependents, smoker, and living in southeast.
Now, you can answer those question with this function:
predict_charge <- function(age, bmi, children, smoker){
if(smoker == "yes" && children == 0 && bmi < 30){
result = -956.74 + (251.20*age) + (505.18*bmi)
} else if(smoker == "yes" && children == 0 && bmi >= 30) {
result = 8120.10 + (292.16*age) + (614.01*bmi)
} else if(smoker == "yes" && children > 0 && bmi < 30){
result = 2428.48 + (259.48*age) + (359.27*bmi)
} else if(smoker == "yes" && children > 0 && bmi >= 30){
result = 16021.03 + (253.72*age) + (447.91*bmi)
} else if(smoker == "no" && children == 0 && bmi < 30){
result = -3239.15 + (277.00*age)
} else if(smoker == "no" && children == 0 && bmi >= 30){
result = -2155.79 + (254.01*age)
} else if(smoker == "no" && children > 0 && bmi < 30){
result = -884.08 + (247.85*age)
} else {
result = -2161.36 + (282.54*age)
}
return(result)
}The medical cost insurance for Daniel:
## [1] 4702.48
The medical cost insurance for Rani:
## [1] 20366.66
Suatu perusahaan di Amerika Serikat ingin mempekerjakan seseorang dari luar Amerika Serikat untuk posisi teknis, mereka perlu mengajukan aplikasi ke pemerintah Amerika Serikat untuk mendapatkan kartu hijau atau visa bagi pelamar asing. Untuk menunjukkan ekuitas bagi karyawan AS dan non-AS, perusahaan perlu menyatakan seberapa banyak mereka bersedia membayar karyawan ketika mereka mengajukan permohonan visa atau kartu hijau. Sementara itu, mereka perlu memberikan jumlah rata-rata, yang disebut “prevailing wage” seorang karyawan dengan keterampilan dan latar belakang serupa biasanya dibayar untuk posisi yang sama.
Perbedaan antara upah yang dibayar dan upah yang berlaku dapat menunjukkan apakah perusahaan AS bersedia membayar lebih banyak gaji kepada karyawan non-AS. Gaji lebih banyak untuk calon karyawan asing akan menarik. Selain itu, perlu diperhatikan bahwa untuk area dan pekerjaan yang berbeda, gaji dapat menunjukkan perbedaan. Oleh karena itu perlu untuk mencari tahu hubungan antara gaji, area dan posisi dapat membantu karyawan non-AS untuk memilih pekerjaan di AS.
Berdasarkan klasifikasi VISA yang mereka miliki disimpulkan bahwa ada lima jenis yang berbeda: “green card”, “H-1B”, “H-1B1 Chile”, “H- 1B1 Singapore” dan “E-3 Australia”. Untuk projek ini, silahkan anda memilih kelas VISA “H-1B” untuk melakukan data mentah pelamar yang berpenduduk tetap tahun 2018 atau 2019. Kalian dapat mendwonload Data asli yang dikumpulkan oleh Kantor Sertifikasi Tenaga Kerja Asing Departemen Tenaga Kerja AS