I plan to use this as reference in the future for linear regression analysis. Therefore, I included a lot of the information straight from your lecture slides.

A linear model is defined by an equation in the following form: 𝑦 = 𝛼 + 𝛽𝑥 + 𝜀 where, – y is a dependent/outcome variable – x is an independent/predictor variable – α (alpha) is intercept, that is where the line crosses the y axis, – β (beta) is slope, that describes the change in y given an increase of x. – ε (epsilon) is an error term, that reminds the predictions are not perfect

Ordinary Least squares estimation: -Linear regression line doesn’t pass through all data points; it cuts through the data somewhat evenly with some predictions lower or higher than the line. -Ordinary Least Squares (OLS) is a method for estimating the unknown parameters α (intercept) and β (slope) in a linear regression model -Slope and intercept chosen so that they minimize the sum of squared errors -Errors are the vertical distance between the predicted y value ( 𝑦_hat_i ) and the actual y value (𝑦_i ) -Errors known as residuals

Objective: To predict the value of an outcome (or response) variable based on the value of an input (or predictor) variable.

# Example: Space Shuttle Launch Data
launch <- read.csv("challenger.csv")
head(launch)
##   distress_ct temperature field_check_pressure flight_num
## 1           0          66                   50          1
## 2           1          70                   50          2
## 3           0          69                   50          3
## 4           0          68                   50          4
## 5           0          67                   50          5
## 6           0          72                   50          6

Use cov() and var() functions to estimate b and the mean() function to estimate a.

However, estimating the regression equation manually is not ideal. Let’s do it anyway.

# estimate beta manually
b <- cov(launch$temperature, launch$distress_ct) / var(launch$temperature)
print(c("beta", b))
## [1] "beta"                "-0.0475396825396825"
# estimate alpha manually
a <- mean(launch$distress_ct) - b * mean(launch$temperature)
print(c("alpha", a))
## [1] "alpha"           "3.6984126984127"

R provides functions for performing this calculation automatically.

model <- lm(distress_ct ~ temperature, data = launch)
model
## 
## Call:
## lm(formula = distress_ct ~ temperature, data = launch)
## 
## Coefficients:
## (Intercept)  temperature  
##     3.69841     -0.04754
summary(model)
## 
## Call:
## lm(formula = distress_ct ~ temperature, data = launch)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5608 -0.3944 -0.0854  0.1056  1.8671 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.69841    1.21951   3.033  0.00633 **
## temperature -0.04754    0.01744  -2.725  0.01268 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5774 on 21 degrees of freedom
## Multiple R-squared:  0.2613, Adjusted R-squared:  0.2261 
## F-statistic: 7.426 on 1 and 21 DF,  p-value: 0.01268

This is consistent with the manual calculation.

Correlations Correlation between two variables quantifies how closely their relationships follows a straight line Typically referring to Pearson’s coirrelation coefficient

Pearson’s Ranges between -1 and +1 Extreme values: linear relationship Close to 0: absense of linear relationship

# calculate the correlation of launch data with or without cor() function
# Do it manually if you want to work harder
r <- cov(launch$temperature, launch$distress_ct) /
(sd(launch$temperature) * sd(launch$distress_ct))
r
## [1] -0.5111264
# using cor function (let R do the heavy lifting)
cor(launch$temperature, launch$distress_ct)
## [1] -0.5111264
# computing the slope (b) using correlation
r * (sd(launch$distress_ct) / sd(launch$temperature))
## [1] -0.04753968

Multiple linear Regression Use this when there are more than one predictor

reg <- function(y, x) {
x <- as.matrix(x)
x <- cbind(Intercept = 1, x)
b <- solve(t(x) %*% x) %*% t(x) %*% y
colnames(b) <- "estimate"
print(b)
}

# examine the launch data
str(launch)
## 'data.frame':    23 obs. of  4 variables:
##  $ distress_ct         : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ temperature         : int  66 70 69 68 67 72 73 70 57 63 ...
##  $ field_check_pressure: int  50 50 50 50 50 50 100 100 200 200 ...
##  $ flight_num          : int  1 2 3 4 5 6 7 8 9 10 ...
# test regression model with simple linear regression
reg(y = launch$distress_ct, x = launch[2])
##                estimate
## Intercept    3.69841270
## temperature -0.04753968

For example: when the response (y) is distress_ct and three predictors are temperature, field_check_pressure, and flight_num

# use regression model with multiple regression
reg(y = launch$distress_ct, x = launch[2:4])
##                          estimate
## Intercept             3.527093383
## temperature          -0.051385940
## field_check_pressure  0.001757009
## flight_num            0.014292843
# confirming the multiple regression result using the lm()
model <- lm(distress_ct ~ temperature + field_check_pressure +
flight_num, data = launch)
model
## 
## Call:
## lm(formula = distress_ct ~ temperature + field_check_pressure + 
##     flight_num, data = launch)
## 
## Coefficients:
##          (Intercept)           temperature  field_check_pressure  
##             3.527093             -0.051386              0.001757  
##           flight_num  
##             0.014293

The result: Beta coefficients for the multiple regression model.

plot(model)

# Request 2X2 plot layout
par(mfrow=c(2,2))
plot(model)

# reset plot layout
par(mfrow=c(1,1))

Part Two: Predicting Medical Expenses Use patient data to estimate the avg medical care expenses for population segments. The estimates can be used to create actuarial tables that set the price of yearly premiums higher or lower, depending on the expected treatment costs.

insurance<-read.csv("insurance.csv")
str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ expenses: num  16885 1726 4449 21984 3867 ...
# summarize the charges variable
summary(insurance$expenses)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770

Check for normality using a histogram

Linear regression doesnt strictly require normally distributed variabless. However, the model often fits better when it is.

# histogram of insurance charges
hist(insurance$expenses)

The histogram of insurance charges shows a right-skewed distribution There’s far fewer people with insurance expenses >6000 dollars than 1000 dollars.

Although this distribution is not ideal for a linear regression, knowing this weakness ahead of time may help us design a better-fitting model later on.

Regression models require that every feature is numeric, yet we have three factor-type features in our data frame. – sex variable is divided into male and female levels, – smoker is divided into yes and no – region variable has four levels .

How are the various factor-tyoe feature variables distributed?

# table of region
table(insurance$region)
## 
## northeast northwest southeast southwest 
##       324       325       364       325
# table of sex
table(insurance$sex)
## 
## female   male 
##    662    676
# table of smoking habit
table(insurance$smoker)
## 
##   no  yes 
## 1064  274

Correlation matrix provides a quick overview of the relationships between the independent variables ~ the dependen variable, and the relationships between independent variables.

cor(insurance[c("age", "bmi", "children", "expenses")])
##                age        bmi   children   expenses
## age      1.0000000 0.10934101 0.04246900 0.29900819
## bmi      0.1093410 1.00000000 0.01264471 0.19857626
## children 0.0424690 0.01264471 1.00000000 0.06799823
## expenses 0.2990082 0.19857626 0.06799823 1.00000000

A scatterplot will bring these numbers TO LIFE

It’ll make it much easier to visualize the relationships. My eyes are glazing over looking at that matrix.

# visualing relationships among features: scatterplot matrix
pairs(insurance[c("age", "bmi", "children", "expenses")])

Ahh. That’s better. I see an interesting correlation between expenses and age. As age inscreases, so do insurance expenses. However, there are 3 bands in the scatter plot. This may indicate that there is another factor that is unaccounted for that is influencing this relationship.

At first glance, the straight lines seem weird when the “children” factor is involved. This makes sense though, as number of children is a discrete integer. You cannot have 1.67 children: but you can have 1 or 2.

A better scatterplot matrix you say? Don’t mind if I do.

# more informative scatterplot matrix
#install.packages("psych")
library(psych)
## Warning: package 'psych' was built under R version 4.2.2
pairs.panels(insurance[c("age", "bmi", "children", "expenses")])

In R, there are dozens of ways of doing the same thing. Packages are open source, so there as many packages available as there are people willing to design them.

# install and load GGolly for a better visual
# install.packages("GGally")
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(insurance[c("age", "bmi", "children", "expenses")])

Step 3–Training a Model on the Data

Following command fits a linear regression model relating the six independent variables to the total medical expenses.

ins_model <- lm(expenses ~ age + children + bmi + sex + smoker + region,
data = insurance)
ins_model <- lm(expenses ~ ., data = insurance) # this is equivalent to above
# see the estimated beta coefficients
ins_model
## 
## Call:
## lm(formula = expenses ~ ., data = insurance)
## 
## Coefficients:
##     (Intercept)              age          sexmale              bmi  
##        -11941.6            256.8           -131.4            339.3  
##        children        smokeryes  regionnorthwest  regionsoutheast  
##           475.7          23847.5           -352.8          -1035.6  
## regionsouthwest  
##          -959.3

\[ Y = -11941.6 - 256.8*age - 131.4*sexmale +339.3*bmi + 475.7*children + 23847.5*smokeryes - 352.8*regionNW - 1035.6*regionSE - 959.3*regionSW \] I’m not sure why I just typed that out… it looks horrible. But there’s your model.

Intercept Interpretation

Intercept is the predicted value of expenses when the independent variables are equal to zero.

Often the intercept is of little value alone because it is impossible to have values of zero for all features. – For example, since no person exists with age zero and BMI zero, the intercept has no real-world interpretation. – In practice, the intercept is often ignored

Beta Coefficient Interpretation

Beta coefficients indicate the estimated increase in expenses for an increase of one in each of the features, assuming all other values are held constant.

For instance, for each additional year of age, we would expect $256.80 higher medical expenses on average, assuming everything else is equal. – Each additional child results in an average of $475.70 in additional medical expenses each year, assuming everything else is equal. – Each unit increase in BMI is associated with an average increase of $339.30 in yearly medical expenses, all else equal.

Dummy Coding

Dummy coding allows a nominal feature to be treated as numeric by creating a binary variable – Often called a dummy variable, for each category of the feature.

The dummy variable is set to 1 if the observation falls into the specified category or 0 otherwise. For instance: – The sex feature has two categories: male and female. – This will be split into two binary variables, which R names sexmale and sexfemale. – For observations where sex = male, then sexmale = 1 and sexfemale = 0; conversely, if sex = female, then sexmale = 0 and sexfemale = 1.

However, when there are more than 2 groups, a bit more complicated R split the four-category feature region into four dummy variables: – regionnorthwest, regionsoutheast, regionsouthwest, and regionnortheast • When adding a dummy variable to a regression model, one category is always left out to serve as the reference category. • The estimates are then interpreted relative to the reference. • In our model, R automatically held out the sexfemale, smokerno, and regionnortheast variables – Thus, making female non-smokers in the northeast region the reference group.

Some interpretations of the model:

ins_model
## 
## Call:
## lm(formula = expenses ~ ., data = insurance)
## 
## Coefficients:
##     (Intercept)              age          sexmale              bmi  
##        -11941.6            256.8           -131.4            339.3  
##        children        smokeryes  regionnorthwest  regionsoutheast  
##           475.7          23847.5           -352.8          -1035.6  
## regionsouthwest  
##          -959.3

– males have $131.40 less medical expenses each year relative to females – smokers cost an average of $23,847.50 more than non-smokers per year. – The coefficient for each of the three regions in the model is negative, which implies that the reference group, the northeast region, tends to have the highest average expenses.

The results of the linear regression model make logical sense: – Old age, smoking, and obesity tend to be linked to additional health issues, – Additional family member dependents may result in an increase in physician visits and preventive care such as vaccinations and yearly physical exams.

Great. Now our next question: How well the model is fitting the data? We must evaluate the model performance

Step 4 – Evaluating model Performance

summary(ins_model)
## 
## Call:
## lm(formula = expenses ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11302.7  -2850.9   -979.6   1383.9  29981.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11941.6      987.8 -12.089  < 2e-16 ***
## age                256.8       11.9  21.586  < 2e-16 ***
## sexmale           -131.3      332.9  -0.395 0.693255    
## bmi                339.3       28.6  11.864  < 2e-16 ***
## children           475.7      137.8   3.452 0.000574 ***
## smokeryes        23847.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -352.8      476.3  -0.741 0.458976    
## regionsoutheast  -1035.6      478.7  -2.163 0.030685 *  
## regionsouthwest   -959.3      477.9  -2.007 0.044921 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.9 on 8 and 1329 DF,  p-value: < 2.2e-16

Residuals, coefficients, and Multiple R-squared value can be utilized to evalute the performance (fit) of the model

Evaluation of Residuals

Residuals section provides summary statistics for the errors in our predictions – Since a residual is equal to the true value minus the predicted value, the maximum error of 29981.7 suggests that the model under-predicted expenses by nearly $30,000 for at least one observation. – 50% of errors fall within the 1Q and 3Q values (the first and third quartile) – Majority of predictions were between $2,850.90 over the true value and 1,383.90 under the true value.

Evaluation of Regression Coefficient

For each estimated regression coefficient, the p-value, denoted Pr(>|t|), provides an estimate of the probability that the true coefficient is zero given the value of the estimate. • Small p-values suggest that the true coefficient is very unlikely to be zero, which means that the feature is extremely unlikely to have no relationship with the dependent variable. • Note that some of the p-values have stars (***), which correspond to the footnotes to indicate the significance level met by the estimate. • This level is a threshold, chosen prior to building the model, which will be used to indicate “real” findings, as opposed to those due to chance alone; p-values less than the significance level are considered statistically significant. • If the model had few such terms, it may be cause for concern, since this would indicate that the features used are not very predictive of the outcome. • Here, our model has several highly significant variables, and they seem to be related to the outcome in logical ways.

Evaluation of Multiple R-squreed value

Multiple R-squared value – Also called the coefficient of determination – Provides a measure of how well our model as a whole explains the values of the dependent variable. – Closer the value is to 1.0, the better the model perfectly explains the data. • The R-squared value is 0.7494 suggests our model explains nearly 75% of the variation in the dependent variable. • Adjusted R-squared value corrects R-squared by penalizing models with a large number of independent variables. – Because models with more features always explain more variation • It is useful for comparing the performance of models with different numbers of explanatory variables. • Our model is performing fairly well based on the prior performance indicators • It is not uncommon for regression models of real-world data to have fairly low R-squared values – a value of 0.75 is actually quite good. • The size of some of the errors is a bit concerning, but not surprising given the nature of medical expense data.

Step 5 – Improving Model Performance

If we define the model differently, it might perform better.

Key difference between regression modeling and other ML approaches: regression leaves feature selection and model specification to the user Therefore, subject matter knowledge is important to optimize model performance If we have an idea how a feature is related to outcome, we can use this to inform model specification to improve performance

Model specification – adding non-linear relationships

Linear regression: independent var ~ dependent var is assumed to be linear However, this isn’t always true

age~medical expenditure mayy not be constant through all age values -treatment could be disproportionately expensive in the elderly

Typical regression equation: \(y = alpha + beta*x\) non-linear relationship: \(y = alpha + beta_1*x + beta_2*x^2\)

The difference between these two models is that an additional beta will be estimated, which is intended to capture the effect of the x-squared term. • This allows the impact of age to be measured as a function of age squared. • To add the non-linear age to the model, we simply need to create a new variable:

# add a higher-order "age" term
insurance$age2 <- insurance$age^2

• Then, when we produce our improved model, we’ll add both age and age2 to the lm() formula using the expenses ~ age + age2 formula. • This will allow the model to separate the linear and non- linear impact of age on medical expenses.

Transformation – Converting a numeric variable to a binary indicator

Suppose we have a feature whose effect is not cumulative, rather it has an effect only after a specific threshold has been reached. – For instance, BMI may have zero impact on medical expenditures for individuals in the normal weight range, but it may be strongly related to higher costs for the obese (that is, BMI of 30 or above). • We can model this relationship by creating a binary obesity indicator variable – That is, 1 if the BMI is at least 30, and 0 if less. • The estimated beta for this binary feature would then indicate the average net impact on medical expenses for individuals with BMI of 30 or above, relative to those with BMI less than 30. • We can use the ifelse() function to create the feature. – For BMI greater than or equal to 30, we will return 1, otherwise 0:

# add an indicator for BMI >= 30
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

• We can then include the bmi30 variable in our improved model, either replacing the original bmi variable or in addition, depending on whether or not we think the effect of obesity occurs in addition to a separate linear BMI effect. • Without a good reason to do otherwise, we’ll include both in our final model.

Model Specification – Adding Interaction Effects

• What if certain features have a combined impact on the dependent variable? – For instance, smoking and obesity may have harmful effects separately, but it is reasonable to assume that their combined effect may be worse than the sum of each one alone. • When two features have a combined effect, this is known as an interaction. • If we suspect that two variables interact, we can test this hypothesis by adding their interaction to the model. • Interaction effects are specified using the R formula syntax. • To have the obesity indicator (bmi30) and the smoking indicator (smoker) interact, we would write a formula in the form expenses ~ bmi30*smoker • The * operator is shorthand that instructs R to model as expenses ~ bmi30 + smokeryes + bmi30:smokeryes • The : operator in the expanded form indicates that bmi30:smokeryes is the interaction between the two variables. • Note that the expanded form also automatically included the bmi30 and smoker variables as well as the interaction.

Let’s put it all together: Add the non-linear term for age, Create indicator for obesity, Specify an interaction between obesity and smoking.

# create final model
ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex + 
                   bmi30*smoker + region, data = insurance)
summary(ins_model2)
## 
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 * 
##     smoker + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17297.1  -1656.0  -1262.7   -727.8  24161.6 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       139.0053  1363.1359   0.102 0.918792    
## age               -32.6181    59.8250  -0.545 0.585690    
## age2                3.7307     0.7463   4.999 6.54e-07 ***
## children          678.6017   105.8855   6.409 2.03e-10 ***
## bmi               119.7715    34.2796   3.494 0.000492 ***
## sexmale          -496.7690   244.3713  -2.033 0.042267 *  
## bmi30            -997.9355   422.9607  -2.359 0.018449 *  
## smokeryes       13404.5952   439.9591  30.468  < 2e-16 ***
## regionnorthwest  -279.1661   349.2826  -0.799 0.424285    
## regionsoutheast  -828.0345   351.6484  -2.355 0.018682 *  
## regionsouthwest -1222.1619   350.5314  -3.487 0.000505 ***
## bmi30:smokeryes 19810.1534   604.6769  32.762  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8653 
## F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16

Make Predictions, plot the improved regression model

insurance$pred <- predict(ins_model2, insurance)
cor(insurance$pred, insurance$expenses)
## [1] 0.9307999
plot(insurance$pred, insurance$expenses,  main="improved regression model")
abline(a = 0, b = 1, col = "red", lwd = 3, lty = 2)

Evaluate the improved model

summary(ins_model2)
## 
## Call:
## lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 * 
##     smoker + region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17297.1  -1656.0  -1262.7   -727.8  24161.6 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       139.0053  1363.1359   0.102 0.918792    
## age               -32.6181    59.8250  -0.545 0.585690    
## age2                3.7307     0.7463   4.999 6.54e-07 ***
## children          678.6017   105.8855   6.409 2.03e-10 ***
## bmi               119.7715    34.2796   3.494 0.000492 ***
## sexmale          -496.7690   244.3713  -2.033 0.042267 *  
## bmi30            -997.9355   422.9607  -2.359 0.018449 *  
## smokeryes       13404.5952   439.9591  30.468  < 2e-16 ***
## regionnorthwest  -279.1661   349.2826  -0.799 0.424285    
## regionsoutheast  -828.0345   351.6484  -2.355 0.018682 *  
## regionsouthwest -1222.1619   350.5314  -3.487 0.000505 ***
## bmi30:smokeryes 19810.1534   604.6769  32.762  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8653 
## F-statistic: 781.7 on 11 and 1326 DF,  p-value: < 2.2e-16

R-squared value has improved from 0.75 to 0.87. That’s a fantastic improvement Therefore, our model is now explaining 87% of the variation in medical treatment costs.

Make some predictions

# Predict Male
predict(ins_model2, data.frame(age = 30,
                               age2 = 30^2,
                               children = 2,
                               bmi = 30, 
                               sex = "male", 
                               bmi30 = 1, 
                               smoker = "no", 
                               region = "northeast"))
##        1 
## 5973.774
# Predict Female
predict(ins_model2, data.frame(age = 30,
                               age2 = 30^2,
                               children = 2,
                               bmi = 30, 
                               sex = "female", 
                               bmi30 = 1, 
                               smoker = "no", 
                               region = "northeast"))
##        1 
## 6470.543

Keeping all other values constant, sex= male vs female was compared. According to the defined model, the male value should be 496.7690 less than female

Using the predict function, we see that this is true! our assumption is consistent with our prediction.

predict(ins_model2, data.frame(age = 30,
                               age2 = 30^2,
                               children = 0,
                               bmi = 30, 
                               sex = "female", 
                               bmi30 = 1, 
                               smoker = "no", 
                               region = "northeast"))
##       1 
## 5113.34

The interaction between obesity and smoking suggests a massive effect

smokeryes estimate 13404.595 std error 439.959 t 30.468 p val: < 2e-16

bmi30:smokeryes estimate 19810.1534 std error : 604.6769 t: 32.762
p-val < 2e-16

– increased costs of over $13,404 for smoking alone – obese smokers spend another $19,810 per year. This suggest that smoking exacerbates diseases associated with obesity, thus driving insurance costs higher.