Reading the csv


launch <- read.csv("challenger.csv")
launch

Calculating the temp manually

b <- cov(launch$temperature, launch$distress_ct) / var(launch$temperature)
b

Estimating alpha manually

a
[1] 3.698413

Calculating the correlation with different methods, first default:

r
[1] -0.5111264

Calculating the correlation with different methods, spear method:

r
[1] -3.43453
cor(launch$temperature, launch$distress_ct)
[1] -0.5111264

computing the slope using correlation

r * (sd(launch$distress_ct) / sd(launch$temperature))
[1] -0.3194444

confirming the regression line using the lm function (not in text)

model

Call:
lm(formula = distress_ct ~ temperature, data = launch)

Coefficients:
(Intercept)  temperature  
    3.69841     -0.04754  
summary(model)

Call:
lm(formula = distress_ct ~ temperature, data = launch)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5608 -0.3944 -0.0854  0.1056  1.8671 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  3.69841    1.21951   3.033  0.00633 **
temperature -0.04754    0.01744  -2.725  0.01268 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5774 on 21 degrees of freedom
Multiple R-squared:  0.2613,    Adjusted R-squared:  0.2261 
F-statistic: 7.426 on 1 and 21 DF,  p-value: 0.01268

creating SMR function

reg <- function(y, x) {
  x <- as.matrix(x)
  x <- cbind(Intercept = 1, x)
  b <- solve(t(x) %*% x) %*% t(x) %*% y
  colnames(b) <- "estimate"
  print(b)
}

examine the launch data

str(launch)
'data.frame':   23 obs. of  4 variables:
 $ distress_ct         : int  0 1 0 0 0 0 0 0 1 1 ...
 $ temperature         : int  66 70 69 68 67 72 73 70 57 63 ...
 $ field_check_pressure: int  50 50 50 50 50 50 100 100 200 200 ...
 $ flight_num          : int  1 2 3 4 5 6 7 8 9 10 ...

test regression model with simple linear regression

reg(y = launch$distress_ct, x = launch[2])
               estimate
Intercept    3.69841270
temperature -0.04753968

use regression model with multiple regression

reg(y = launch$distress_ct, x = launch[2:4])
                         estimate
Intercept             3.527093383
temperature          -0.051385940
field_check_pressure  0.001757009
flight_num            0.014292843

confirming the multiple regression result using the lm function (not in text)

model

Call:
lm(formula = distress_ct ~ temperature + field_check_pressure + 
    flight_num, data = launch)

Coefficients:
         (Intercept)           temperature  field_check_pressure  
            3.527093             -0.051386              0.001757  
          flight_num  
            0.014293  
summary(model)

Call:
lm(formula = distress_ct ~ temperature + field_check_pressure + 
    flight_num, data = launch)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.65003 -0.24414 -0.11219  0.01279  1.67530 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)  
(Intercept)           3.527093   1.307024   2.699   0.0142 *
temperature          -0.051386   0.018341  -2.802   0.0114 *
field_check_pressure  0.001757   0.003402   0.517   0.6115  
flight_num            0.014293   0.035138   0.407   0.6887  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.565 on 19 degrees of freedom
Multiple R-squared:   0.36, Adjusted R-squared:  0.259 
F-statistic: 3.563 on 3 and 19 DF,  p-value: 0.03371

Predicting Medical Expenses

Step 2: Exploring and preparing the data —-

Summarize the charges variable

summary(insurance$expenses)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1122    4740    9382   13270   16640   63770 

histogram of insurace charges

Table of region

table(insurance$region)

northeast northwest southeast southwest 
      324       325       364       325 

exploring relationships among features: correlation matrix

cor(insurance[c("age", "bmi", "children", "expenses")])
               age        bmi   children   expenses
age      1.0000000 0.10934101 0.04246900 0.29900819
bmi      0.1093410 1.00000000 0.01264471 0.19857626
children 0.0424690 0.01264471 1.00000000 0.06799823
expenses 0.2990082 0.19857626 0.06799823 1.00000000

visualizing the relationships

Step 3: Training a model on the data

ins_model 

Call:
lm(formula = expenses ~ age + children + bmi + sex + smoker + 
    region, data = insurance)

Coefficients:
    (Intercept)              age         children              bmi  
       -11941.6            256.8            475.7            339.3  
        sexmale        smokeryes  regionnorthwest  regionsoutheast  
         -131.4          23847.5           -352.8          -1035.6  
regionsouthwest  
         -959.3  

Step 4:Evaluating model performance

summary(ins_model)

Call:
lm(formula = expenses ~ age + children + bmi + sex + smoker + 
    region, data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-11302.7  -2850.9   -979.6   1383.9  29981.7 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -11941.6      987.8 -12.089  < 2e-16 ***
age                256.8       11.9  21.586  < 2e-16 ***
children           475.7      137.8   3.452 0.000574 ***
bmi                339.3       28.6  11.864  < 2e-16 ***
sexmale           -131.3      332.9  -0.395 0.693255    
smokeryes        23847.5      413.1  57.723  < 2e-16 ***
regionnorthwest   -352.8      476.3  -0.741 0.458976    
regionsoutheast  -1035.6      478.7  -2.163 0.030685 *  
regionsouthwest   -959.3      477.9  -2.007 0.044921 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7494 
F-statistic: 500.9 on 8 and 1329 DF,  p-value: < 2.2e-16

Step 5: improving model performance

adding a higher order age term

insurance$age2 <- insurance$age^2 

add an indicator for BMI >= 30

insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

creating final model

summary(ins_model2)

Call:
lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 * 
    smoker + region, data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-17297.1  -1656.0  -1262.7   -727.8  24161.6 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       139.0053  1363.1359   0.102 0.918792    
age               -32.6181    59.8250  -0.545 0.585690    
age2                3.7307     0.7463   4.999 6.54e-07 ***
children          678.6017   105.8855   6.409 2.03e-10 ***
bmi               119.7715    34.2796   3.494 0.000492 ***
sexmale          -496.7690   244.3713  -2.033 0.042267 *  
bmi30            -997.9355   422.9607  -2.359 0.018449 *  
smokeryes       13404.5952   439.9591  30.468  < 2e-16 ***
regionnorthwest  -279.1661   349.2826  -0.799 0.424285    
regionsoutheast  -828.0345   351.6484  -2.355 0.018682 *  
regionsouthwest -1222.1619   350.5314  -3.487 0.000505 ***
bmi30:smokeryes 19810.1534   604.6769  32.762  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4445 on 1326 degrees of freedom
Multiple R-squared:  0.8664,    Adjusted R-squared:  0.8653 
F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16

making prediction with the regression model

cor(insurance$pred, insurance$expenses)
[1] 0.9307999

plotting the prediction against expenses

plot(insurance$pred, insurance$expenses)
abline(a = 0, b = 1, col = "red", lwd = 3, lty = 2)

predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 2,
                   bmi = 30, sex = "male", bmi30 = 1,
                   smoker = "no", region = "northeast"))
       1 
5973.774 
predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 2,
                   bmi = 30, sex = "female", bmi30 = 1,
                   smoker = "no", region = "northeast"))
       1 
6470.543 
predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 0,
                   bmi = 30, sex = "female", bmi30 = 1,
                   smoker = "no", region = "northeast"))
      1 
5113.34 

Understanding regression trees and model trees

Calculation SDR

# set up the data
tee <- c(1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 6, 7, 7, 7, 7)
at1 <- c(1, 1, 1, 2, 2, 3, 4, 5, 5)
at2 <- c(6, 6, 7, 7, 7, 7)
bt1 <- c(1, 1, 1, 2, 2, 3, 4)
bt2 <- c(5, 5, 6, 6, 7, 7, 7, 7)

compute the SDR

sdr_a <- sd(tee) - (length(at1) / length(tee) * sd(at1) + length(at2) / length(tee) * sd(at2))
sdr_b <- sd(tee) - (length(bt1) / length(tee) * sd(bt1) + length(bt2) / length(tee) * sd(bt2))

Compare SDR for each split

sdr_a 
[1] 1.202815
sdr_b
[1] 1.392751
LS0tCnRpdGxlOiAiWW91c3NlZi1NTFIiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMjIyBSZWFkaW5nIHRoZSBjc3YgCmBgYHtyfQoKbGF1bmNoIDwtIHJlYWQuY3N2KCJjaGFsbGVuZ2VyLmNzdiIpCmxhdW5jaApgYGAKCgojIyMgQ2FsY3VsYXRpbmcgdGhlIHRlbXAgbWFudWFsbHkgCmBgYHtyfQpiIDwtIGNvdihsYXVuY2gkdGVtcGVyYXR1cmUsIGxhdW5jaCRkaXN0cmVzc19jdCkgLyB2YXIobGF1bmNoJHRlbXBlcmF0dXJlKQpiCmBgYAoKCiMjIyBFc3RpbWF0aW5nIGFscGhhIG1hbnVhbGx5IApgYGB7cn0KYSA8LSBtZWFuKGxhdW5jaCRkaXN0cmVzc19jdCkgLSBiICogbWVhbihsYXVuY2gkdGVtcGVyYXR1cmUpCmEKCmBgYAoKIyMjIENhbGN1bGF0aW5nIHRoZSBjb3JyZWxhdGlvbiB3aXRoIGRpZmZlcmVudCBtZXRob2RzLCBmaXJzdCBkZWZhdWx0OiAKYGBge3J9CnIgPC0gY292KGxhdW5jaCR0ZW1wZXJhdHVyZSwgbGF1bmNoJGRpc3RyZXNzX2N0LCkgLwogICAgICAgKHNkKGxhdW5jaCR0ZW1wZXJhdHVyZSkgKiBzZChsYXVuY2gkZGlzdHJlc3NfY3QpKQpyCmBgYAojIyMgQ2FsY3VsYXRpbmcgdGhlIGNvcnJlbGF0aW9uIHdpdGggZGlmZmVyZW50IG1ldGhvZHMsIHNwZWFyIG1ldGhvZDogCmBgYHtyfQpyIDwtIGNvdihsYXVuY2gkdGVtcGVyYXR1cmUsIGxhdW5jaCRkaXN0cmVzc19jdCwgbWV0aG9kID0gInNwZWFybWFuIikgLwogICAgICAgKHNkKGxhdW5jaCR0ZW1wZXJhdHVyZSkgKiBzZChsYXVuY2gkZGlzdHJlc3NfY3QpKQpyCmBgYApgYGB7cn0KY29yKGxhdW5jaCR0ZW1wZXJhdHVyZSwgbGF1bmNoJGRpc3RyZXNzX2N0KQoKYGBgCgojIyMgY29tcHV0aW5nIHRoZSBzbG9wZSB1c2luZyBjb3JyZWxhdGlvbgoKCmBgYHtyfQpyICogKHNkKGxhdW5jaCRkaXN0cmVzc19jdCkgLyBzZChsYXVuY2gkdGVtcGVyYXR1cmUpKQpgYGAKCiMjIyBjb25maXJtaW5nIHRoZSByZWdyZXNzaW9uIGxpbmUgdXNpbmcgdGhlIGxtIGZ1bmN0aW9uIChub3QgaW4gdGV4dCkKCgpgYGB7cn0KbW9kZWwgPC0gbG0oZGlzdHJlc3NfY3QgfiB0ZW1wZXJhdHVyZSwgZGF0YSA9IGxhdW5jaCkKbW9kZWwKCmBgYAoKYGBge3J9CnN1bW1hcnkobW9kZWwpCmBgYAoKIyMjIGNyZWF0aW5nIFNNUiBmdW5jdGlvbiAKYGBge3J9CnJlZyA8LSBmdW5jdGlvbih5LCB4KSB7CiAgeCA8LSBhcy5tYXRyaXgoeCkKICB4IDwtIGNiaW5kKEludGVyY2VwdCA9IDEsIHgpCiAgYiA8LSBzb2x2ZSh0KHgpICUqJSB4KSAlKiUgdCh4KSAlKiUgeQogIGNvbG5hbWVzKGIpIDwtICJlc3RpbWF0ZSIKICBwcmludChiKQp9CmBgYAoKIyMjIGV4YW1pbmUgdGhlIGxhdW5jaCBkYXRhCgpgYGB7cn0Kc3RyKGxhdW5jaCkKCmBgYAoKIyMjIHRlc3QgcmVncmVzc2lvbiBtb2RlbCB3aXRoIHNpbXBsZSBsaW5lYXIgcmVncmVzc2lvbgpgYGB7cn0KcmVnKHkgPSBsYXVuY2gkZGlzdHJlc3NfY3QsIHggPSBsYXVuY2hbMl0pCgpgYGAKCgojIyMgdXNlIHJlZ3Jlc3Npb24gbW9kZWwgd2l0aCBtdWx0aXBsZSByZWdyZXNzaW9uCgpgYGB7cn0KcmVnKHkgPSBsYXVuY2gkZGlzdHJlc3NfY3QsIHggPSBsYXVuY2hbMjo0XSkKCmBgYAoKIyMjIGNvbmZpcm1pbmcgdGhlIG11bHRpcGxlIHJlZ3Jlc3Npb24gcmVzdWx0IHVzaW5nIHRoZSBsbSBmdW5jdGlvbiAobm90IGluIHRleHQpCmBgYHtyfQptb2RlbCA8LSBsbShkaXN0cmVzc19jdCB+IHRlbXBlcmF0dXJlICsgZmllbGRfY2hlY2tfcHJlc3N1cmUgKyBmbGlnaHRfbnVtLCBkYXRhID0gbGF1bmNoKQptb2RlbApgYGAKCmBgYHtyfQpzdW1tYXJ5KG1vZGVsKQpgYGAKCiMjIFByZWRpY3RpbmcgTWVkaWNhbCBFeHBlbnNlcwoKIyMjIFN0ZXAgMjogRXhwbG9yaW5nIGFuZCBwcmVwYXJpbmcgdGhlIGRhdGEgLS0tLQpgYGB7cn0KaW5zdXJhbmNlIDwtIHJlYWQuY3N2KCJpbnN1cmFuY2UuY3N2Iiwgc3RyaW5nc0FzRmFjdG9ycyA9IFRSVUUpCmluc3VyYW5jZQpgYGAKCiMjIyBTdW1tYXJpemUgdGhlIGNoYXJnZXMgdmFyaWFibGUgCgpgYGB7cn0Kc3VtbWFyeShpbnN1cmFuY2UkZXhwZW5zZXMpCmBgYAoKIyMjIGhpc3RvZ3JhbSBvZiBpbnN1cmFjZSBjaGFyZ2VzIApgYGB7cn0KaGlzdChpbnN1cmFuY2UkZXhwZW5zZXMpCmBgYAoKIyMjIFRhYmxlIG9mIHJlZ2lvbiAKCmBgYHtyfQp0YWJsZShpbnN1cmFuY2UkcmVnaW9uKQpgYGAKCiMjIyBleHBsb3JpbmcgcmVsYXRpb25zaGlwcyBhbW9uZyBmZWF0dXJlczogY29ycmVsYXRpb24gbWF0cml4CmBgYHtyfQpjb3IoaW5zdXJhbmNlW2MoImFnZSIsICJibWkiLCAiY2hpbGRyZW4iLCAiZXhwZW5zZXMiKV0pCmBgYAoKIyMjIHZpc3VhbGl6aW5nIHRoZSByZWxhdGlvbnNoaXBzIAoKYGBge3J9CnBhaXJzKGluc3VyYW5jZVtjKCJhZ2UiLCAiYm1pIiwgImNoaWxkcmVuIiwgImV4cGVuc2VzIildKQpgYGAKCiMjIyBTdGVwIDM6IFRyYWluaW5nIGEgbW9kZWwgb24gdGhlIGRhdGEgCgpgYGB7cn0KaW5zX21vZGVsIDwtIGxtKGV4cGVuc2VzIH4gYWdlICsgY2hpbGRyZW4gKyBibWkgKyBzZXggKyBzbW9rZXIgKyByZWdpb24sIAogICAgICAgICAgICAgICAgZGF0YSA9IGluc3VyYW5jZSkKaW5zX21vZGVsIApgYGAKIyMgU3RlcCA0OkV2YWx1YXRpbmcgbW9kZWwgcGVyZm9ybWFuY2UgCgpgYGB7cn0Kc3VtbWFyeShpbnNfbW9kZWwpCmBgYAoKIyMgU3RlcCA1OiBpbXByb3ZpbmcgbW9kZWwgcGVyZm9ybWFuY2UgCgojIyMgYWRkaW5nIGEgaGlnaGVyIG9yZGVyIGFnZSB0ZXJtIApgYGB7cn0KaW5zdXJhbmNlJGFnZTIgPC0gaW5zdXJhbmNlJGFnZV4yIApgYGAKIyMjIGFkZCBhbiBpbmRpY2F0b3IgZm9yIEJNSSA+PSAzMApgYGB7cn0KaW5zdXJhbmNlJGJtaTMwIDwtIGlmZWxzZShpbnN1cmFuY2UkYm1pID49IDMwLCAxLCAwKQpgYGAKCiMjIyBjcmVhdGluZyBmaW5hbCBtb2RlbCAKYGBge3J9Cmluc19tb2RlbDIgPC0gbG0oZXhwZW5zZXMgfiBhZ2UgKyBhZ2UyICsgY2hpbGRyZW4gKyBibWkgKyBzZXggKwogICAgICAgICAgICAgICAgICAgYm1pMzAqc21va2VyICsgcmVnaW9uLCBkYXRhID0gaW5zdXJhbmNlKQpzdW1tYXJ5KGluc19tb2RlbDIpCmBgYAoKIyMjIG1ha2luZyBwcmVkaWN0aW9uIHdpdGggdGhlIHJlZ3Jlc3Npb24gbW9kZWwgCmBgYHtyfQppbnN1cmFuY2UkcHJlZCA8LSBwcmVkaWN0KGluc19tb2RlbDIsIGluc3VyYW5jZSkKY29yKGluc3VyYW5jZSRwcmVkLCBpbnN1cmFuY2UkZXhwZW5zZXMpCmBgYAoKIyMjIHBsb3R0aW5nIHRoZSBwcmVkaWN0aW9uIGFnYWluc3QgZXhwZW5zZXMgCgpgYGB7cn0KcGxvdChpbnN1cmFuY2UkcHJlZCwgaW5zdXJhbmNlJGV4cGVuc2VzKQphYmxpbmUoYSA9IDAsIGIgPSAxLCBjb2wgPSAicmVkIiwgbHdkID0gMywgbHR5ID0gMikKYGBgCgpgYGB7cn0KcHJlZGljdChpbnNfbW9kZWwyLAogICAgICAgIGRhdGEuZnJhbWUoYWdlID0gMzAsIGFnZTIgPSAzMF4yLCBjaGlsZHJlbiA9IDIsCiAgICAgICAgICAgICAgICAgICBibWkgPSAzMCwgc2V4ID0gIm1hbGUiLCBibWkzMCA9IDEsCiAgICAgICAgICAgICAgICAgICBzbW9rZXIgPSAibm8iLCByZWdpb24gPSAibm9ydGhlYXN0IikpCmBgYAoKCgpgYGB7cn0KcHJlZGljdChpbnNfbW9kZWwyLAogICAgICAgIGRhdGEuZnJhbWUoYWdlID0gMzAsIGFnZTIgPSAzMF4yLCBjaGlsZHJlbiA9IDIsCiAgICAgICAgICAgICAgICAgICBibWkgPSAzMCwgc2V4ID0gImZlbWFsZSIsIGJtaTMwID0gMSwKICAgICAgICAgICAgICAgICAgIHNtb2tlciA9ICJubyIsIHJlZ2lvbiA9ICJub3J0aGVhc3QiKSkKCmBgYAoKYGBge3J9CnByZWRpY3QoaW5zX21vZGVsMiwKICAgICAgICBkYXRhLmZyYW1lKGFnZSA9IDMwLCBhZ2UyID0gMzBeMiwgY2hpbGRyZW4gPSAwLAogICAgICAgICAgICAgICAgICAgYm1pID0gMzAsIHNleCA9ICJmZW1hbGUiLCBibWkzMCA9IDEsCiAgICAgICAgICAgICAgICAgICBzbW9rZXIgPSAibm8iLCByZWdpb24gPSAibm9ydGhlYXN0IikpCmBgYAoKCiMjIFVuZGVyc3RhbmRpbmcgcmVncmVzc2lvbiB0cmVlcyBhbmQgbW9kZWwgdHJlZXMKCiMjIENhbGN1bGF0aW9uIFNEUiAKYGBge3J9CiMgc2V0IHVwIHRoZSBkYXRhCnRlZSA8LSBjKDEsIDEsIDEsIDIsIDIsIDMsIDQsIDUsIDUsIDYsIDYsIDcsIDcsIDcsIDcpCmF0MSA8LSBjKDEsIDEsIDEsIDIsIDIsIDMsIDQsIDUsIDUpCmF0MiA8LSBjKDYsIDYsIDcsIDcsIDcsIDcpCmJ0MSA8LSBjKDEsIDEsIDEsIDIsIDIsIDMsIDQpCmJ0MiA8LSBjKDUsIDUsIDYsIDYsIDcsIDcsIDcsIDcpCmBgYAoKIyMgY29tcHV0ZSB0aGUgU0RSIAoKYGBge3J9CnNkcl9hIDwtIHNkKHRlZSkgLSAobGVuZ3RoKGF0MSkgLyBsZW5ndGgodGVlKSAqIHNkKGF0MSkgKyBsZW5ndGgoYXQyKSAvIGxlbmd0aCh0ZWUpICogc2QoYXQyKSkKc2RyX2IgPC0gc2QodGVlKSAtIChsZW5ndGgoYnQxKSAvIGxlbmd0aCh0ZWUpICogc2QoYnQxKSArIGxlbmd0aChidDIpIC8gbGVuZ3RoKHRlZSkgKiBzZChidDIpKQoKYGBgCgoKIyMjIENvbXBhcmUgU0RSIGZvciBlYWNoIHNwbGl0IApgYGB7cn0Kc2RyX2EgCnNkcl9iCmBgYAoKCgoKCgo=