Part 1: Linear Regression

####In this section, I manually compute the slope and intercept of a simple linear regression model to demonstrate understanding of the mathematical foundation of OLS. I also calculate the correlation coefficient and verify the relationship between correlation and the regression slope. The lm() function is then used to confirm the results and obtain additional statistics such as R-squared and p-values. A custom regression function is implemented using the matrix formula, showing the algebraic structure behind multiple regression.

Understanding regression

## Example: Space Shuttle Launch Data
launch <- read.csv("challenger.csv")
# estimate beta manually
b <- cov(launch$temperature, launch$distress_ct) / var(launch$temperature)
b
# estimate alpha manually
a <- mean(launch$distress_ct) - b * mean(launch$temperature)
a
# calculate the correlation of launch data
r <- cov(launch$temperature, launch$distress_ct) /
       (sd(launch$temperature) * sd(launch$distress_ct))
r
cor(launch$temperature, launch$distress_ct)
# computing the slope using correlation
r * (sd(launch$distress_ct) / sd(launch$temperature))
# confirming the regression line using the lm function (not in text)
model <- lm(distress_ct ~ temperature, data = launch)
model
summary(model)
# creating a simple multiple regression function
reg <- function(y, x) {
  x <- as.matrix(x)
  x <- cbind(Intercept = 1, x)
  b <- solve(t(x) %*% x) %*% t(x) %*% y
  colnames(b) <- "estimate"
  print(b)
}
# examine the launch data
str(launch)
# test regression model with simple linear regression
reg(y = launch$distress_ct, x = launch[2])
# use regression model with multiple regression
reg(y = launch$distress_ct, x = launch[2:4])
# confirming the multiple regression result using the lm function (not in text)
model <- lm(distress_ct ~ temperature + field_check_pressure + flight_num, data = launch)
model

Predicting Medical Expenses

## Step 2: Exploring and preparing the data ----
insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
str(insurance)
# summarize the charges variable
summary(insurance$expenses)
# histogram of insurance charges
hist(insurance$expenses)
# table of region
table(insurance$region)
# exploring relationships among features: correlation matrix
cor(insurance[c("age", "bmi", "children", "expenses")])
# visualing relationships among features: scatterplot matrix
pairs(insurance[c("age", "bmi", "children", "expenses")])
## Step 3: Training a model on the data ----
ins_model <- lm(expenses ~ age + children + bmi + sex + smoker + region,
                data = insurance)
ins_model <- lm(expenses ~ ., data = insurance) # this is equivalent to above

# see the estimated beta coefficients
ins_model

Step 4: Evaluating model performance

# see more detail about the estimated beta coefficients
summary(ins_model)

Step 5: Improving model performance

# add a higher-order "age" term
insurance$age2 <- insurance$age^2
# add an indicator for BMI >= 30
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)
# create final model
ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex +
                   bmi30*smoker + region, data = insurance)
summary(ins_model2)
# making predictions with the regression model
insurance$pred <- predict(ins_model2, insurance)
cor(insurance$pred, insurance$expenses)
plot(insurance$pred, insurance$expenses)
abline(a = 0, b = 1, col = "red", lwd = 3, lty = 2)
predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 2,
                   bmi = 30, sex = "male", bmi30 = 1,
                   smoker = "no", region = "northeast"))
predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 2,
                   bmi = 30, sex = "female", bmi30 = 1,
                   smoker = "no", region = "northeast"))
predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 0,
                   bmi = 30, sex = "female", bmi30 = 1,
                   smoker = "no", region = "northeast"))

Part 2: Regression Trees and Model Trees

####I begin with exploratory data analysis to understand variable distributions and relationships. A multiple linear regression model is built to estimate how demographic and health-related factors influence medical expenses. To improve the model, I add a quadratic age term, a BMI indicator variable, and an interaction between BMI and smoking status to capture nonlinear and combined effects. Model performance is evaluated using correlation and prediction accuracy

Understanding regression trees and model trees

Example: Calculating SDR

# set up the data
tee <- c(1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 6, 7, 7, 7, 7)
at1 <- c(1, 1, 1, 2, 2, 3, 4, 5, 5)
at2 <- c(6, 6, 7, 7, 7, 7)
bt1 <- c(1, 1, 1, 2, 2, 3, 4)
bt2 <- c(5, 5, 6, 6, 7, 7, 7, 7)
# compute the SDR
sdr_a <- sd(tee) - (length(at1) / length(tee) * sd(at1) + length(at2) / length(tee) * sd(at2))
sdr_b <- sd(tee) - (length(bt1) / length(tee) * sd(bt1) + length(bt2) / length(tee) * sd(bt2))
# compare the SDR for each split
sdr_a
sdr_b

Exercise No 3: Estimating Wine Quality

####The dataset is split into training and testing sets to evaluate out-of-sample performance. A regression tree is trained using rpart, and performance is measured using correlation and Mean Absolute Error (MAE). To improve predictions, a Cubist model tree is implemented, combining decision rules with linear regression at the leaves. The results are compared to assess which approach provides better predictive accuracy.

Step 2: Exploring and preparing the data

wine <- read.csv("whitewines.csv")
# examine the wine data
str(wine)
# the distribution of quality ratings
hist(wine$quality)
# summary statistics of the wine data
summary(wine)
wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]

Step 3: Training a model on the data

# regression tree using rpart
library(rpart)
m.rpart <- rpart(quality ~ ., data = wine_train)
# get basic information about the tree
m.rpart
# get more detailed information about the tree
summary(m.rpart)
install.packages("rpart.plot")
# use the rpart.plot package to create a visualization
library(rpart.plot)
# a basic decision tree diagram
rpart.plot(m.rpart, digits = 3)
# a few adjustments to the diagram
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)

Step 4: Evaluate model performanc

# generate predictions for the testing dataset
p.rpart <- predict(m.rpart, wine_test)
# compare the distribution of predicted values vs. actual values
summary(p.rpart)
summary(wine_test$quality)
# compare the correlation
cor(p.rpart, wine_test$quality)
# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}
# mean absolute error between predicted and actual values
MAE(p.rpart, wine_test$quality)
# mean absolute error between actual values and mean value
mean(wine_train$quality) # result = 5.87
MAE(5.87, wine_test$quality)

Step 5: Improving model performance

#install.packages("plyr")
#install.packages("Cubist")
# train a Cubist Model Tree
library(Cubist)
m.cubist <- cubist(x = wine_train[-12], y = wine_train$quality)
# display basic information about the model tree
m.cubist
# display the tree itself
summary(m.cubist)
# generate predictions for the model
p.cubist <- predict(m.cubist, wine_test)
# summary statistics about the predictions
summary(p.cubist)
# correlation between the predicted and true values
cor(p.cubist, wine_test$quality)
# mean absolute error of predicted and true values
# (uses a custom function defined above)
MAE(wine_test$quality, p.cubist) 
---
title: "Chapter 6: Regression Methods"
output: html_notebook
---


#### Part 1: Linear Regression

####In this section, I manually compute the slope and intercept of a simple linear regression model to demonstrate understanding of the mathematical foundation of OLS. I also calculate the correlation coefficient and verify the relationship between correlation and the regression slope. The lm() function is then used to confirm the results and obtain additional statistics such as R-squared and p-values. A custom regression function is implemented using the matrix formula, showing the algebraic structure behind multiple regression.

## Understanding regression



```{r}
## Example: Space Shuttle Launch Data
launch <- read.csv("challenger.csv")
```


```{r}
# estimate beta manually
b <- cov(launch$temperature, launch$distress_ct) / var(launch$temperature)
b
```


```{r}
# estimate alpha manually
a <- mean(launch$distress_ct) - b * mean(launch$temperature)
a
```



```{r}
# calculate the correlation of launch data
r <- cov(launch$temperature, launch$distress_ct) /
       (sd(launch$temperature) * sd(launch$distress_ct))
r
```


```{r}
cor(launch$temperature, launch$distress_ct)
```




```{r}
# computing the slope using correlation
r * (sd(launch$distress_ct) / sd(launch$temperature))
```


```{r}
# confirming the regression line using the lm function (not in text)
model <- lm(distress_ct ~ temperature, data = launch)
model
```



```{r}
summary(model)
```


```{r}
# creating a simple multiple regression function
reg <- function(y, x) {
  x <- as.matrix(x)
  x <- cbind(Intercept = 1, x)
  b <- solve(t(x) %*% x) %*% t(x) %*% y
  colnames(b) <- "estimate"
  print(b)
}
```




```{r}
# examine the launch data
str(launch)
```



```{r}
# test regression model with simple linear regression
reg(y = launch$distress_ct, x = launch[2])
```


```{r}
# use regression model with multiple regression
reg(y = launch$distress_ct, x = launch[2:4])
```



```{r}
# confirming the multiple regression result using the lm function (not in text)
model <- lm(distress_ct ~ temperature + field_check_pressure + flight_num, data = launch)
model
```


## Predicting Medical Expenses

```{r}
## Step 2: Exploring and preparing the data ----
insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE)
str(insurance)
```



```{r}
# summarize the charges variable
summary(insurance$expenses)
```


```{r}
# histogram of insurance charges
hist(insurance$expenses)
```



```{r}
# table of region
table(insurance$region)
```



```{r}
# exploring relationships among features: correlation matrix
cor(insurance[c("age", "bmi", "children", "expenses")])
```


```{r}
# visualing relationships among features: scatterplot matrix
pairs(insurance[c("age", "bmi", "children", "expenses")])
```




```{r}
## Step 3: Training a model on the data ----
ins_model <- lm(expenses ~ age + children + bmi + sex + smoker + region,
                data = insurance)
ins_model <- lm(expenses ~ ., data = insurance) # this is equivalent to above

# see the estimated beta coefficients
ins_model
```


## Step 4: Evaluating model performance

```{r}
# see more detail about the estimated beta coefficients
summary(ins_model)
```


## Step 5: Improving model performance



```{r}
# add a higher-order "age" term
insurance$age2 <- insurance$age^2
```



```{r}
# add an indicator for BMI >= 30
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)
```



```{r}
# create final model
ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex +
                   bmi30*smoker + region, data = insurance)
```


```{r}
summary(ins_model2)
```


```{r}
# making predictions with the regression model
insurance$pred <- predict(ins_model2, insurance)
cor(insurance$pred, insurance$expenses)
```



```{r}
plot(insurance$pred, insurance$expenses)
abline(a = 0, b = 1, col = "red", lwd = 3, lty = 2)
```




```{r}
predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 2,
                   bmi = 30, sex = "male", bmi30 = 1,
                   smoker = "no", region = "northeast"))
```


```{r}
predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 2,
                   bmi = 30, sex = "female", bmi30 = 1,
                   smoker = "no", region = "northeast"))
```


```{r}
predict(ins_model2,
        data.frame(age = 30, age2 = 30^2, children = 0,
                   bmi = 30, sex = "female", bmi30 = 1,
                   smoker = "no", region = "northeast"))
```



#### Part 2: Regression Trees and Model Trees

####I begin with exploratory data analysis to understand variable distributions and relationships. A multiple linear regression model is built to estimate how demographic and health-related factors influence medical expenses. To improve the model, I add a quadratic age term, a BMI indicator variable, and an interaction between BMI and smoking status to capture nonlinear and combined effects. Model performance is evaluated using correlation and prediction accuracy

## Understanding regression trees and model trees

## Example: Calculating SDR

```{r}
# set up the data
tee <- c(1, 1, 1, 2, 2, 3, 4, 5, 5, 6, 6, 7, 7, 7, 7)
at1 <- c(1, 1, 1, 2, 2, 3, 4, 5, 5)
at2 <- c(6, 6, 7, 7, 7, 7)
bt1 <- c(1, 1, 1, 2, 2, 3, 4)
bt2 <- c(5, 5, 6, 6, 7, 7, 7, 7)
```




```{r}
# compute the SDR
sdr_a <- sd(tee) - (length(at1) / length(tee) * sd(at1) + length(at2) / length(tee) * sd(at2))
sdr_b <- sd(tee) - (length(bt1) / length(tee) * sd(bt1) + length(bt2) / length(tee) * sd(bt2))
```



```{r}
# compare the SDR for each split
sdr_a
sdr_b
```



## Exercise No 3: Estimating Wine Quality

####The dataset is split into training and testing sets to evaluate out-of-sample performance. A regression tree is trained using rpart, and performance is measured using correlation and Mean Absolute Error (MAE). To improve predictions, a Cubist model tree is implemented, combining decision rules with linear regression at the leaves. The results are compared to assess which approach provides better predictive accuracy.


## Step 2: Exploring and preparing the data

```{r}
wine <- read.csv("whitewines.csv")
```



```{r}
# examine the wine data
str(wine)
```


```{r}
# the distribution of quality ratings
hist(wine$quality)
```


```{r}
# summary statistics of the wine data
summary(wine)
```



```{r}
wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]
```



## Step 3: Training a model on the data

```{r}
# regression tree using rpart
library(rpart)
m.rpart <- rpart(quality ~ ., data = wine_train)
```


```{r}
# get basic information about the tree
m.rpart
```



```{r}
# get more detailed information about the tree
summary(m.rpart)
```


```{r}
install.packages("rpart.plot")
```


```{r}
# use the rpart.plot package to create a visualization
library(rpart.plot)
```


```{r}
# a basic decision tree diagram
rpart.plot(m.rpart, digits = 3)
```


```{r}
# a few adjustments to the diagram
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)
```


## Step 4: Evaluate model performanc

```{r}
# generate predictions for the testing dataset
p.rpart <- predict(m.rpart, wine_test)
```


```{r}
# compare the distribution of predicted values vs. actual values
summary(p.rpart)
summary(wine_test$quality)
```


```{r}
# compare the correlation
cor(p.rpart, wine_test$quality)
```


```{r}
# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}
```



```{r}
# mean absolute error between predicted and actual values
MAE(p.rpart, wine_test$quality)
```


```{r}
# mean absolute error between actual values and mean value
mean(wine_train$quality) # result = 5.87
MAE(5.87, wine_test$quality)
```


## Step 5: Improving model performance

```{r}
#install.packages("plyr")
#install.packages("Cubist")
```


```{r}
# train a Cubist Model Tree
library(Cubist)
m.cubist <- cubist(x = wine_train[-12], y = wine_train$quality)
```


```{r}
# display basic information about the model tree
m.cubist
```



```{r}
# display the tree itself
summary(m.cubist)
```


```{r}
# generate predictions for the model
p.cubist <- predict(m.cubist, wine_test)
```


```{r}
# summary statistics about the predictions
summary(p.cubist)
```


```{r}
# correlation between the predicted and true values
cor(p.cubist, wine_test$quality)
```


```{r}
# mean absolute error of predicted and true values
# (uses a custom function defined above)
MAE(wine_test$quality, p.cubist) 
```



