Image credit: rawpixel.

This analysis fits a stepwise linear regression model to predict Systolic blood pressure. Included are seven numeric and categorical features: Age, Height, Weight, Gender, Hospital Location (CGH - County General Hospital, SMMC - St. Mary’s Medical Center, VA - VA Hospital), Status of Health (Excellent, Fair, Good, Poor), and Smoker. Our regression will include the general form of:

\[\begin{equation} \hat{Systolic} = \theta(Age) + \theta(Height) + \theta(Weight) + ... \\ + \theta(HealthFair) + \theta(HealthPoor) + \theta(Smoker) + b \\ \end{equation}\]

This is just one of the five machine learning modeling guides you can find here.

Overview

The following achieves the following requirements:

  1. load patient self evaluation dataset.

  2. Linear Regresion is used to predict continues numbers. Use a linear regression model on:
  • Age
  • Gender
  • Height
  • Weight
  • Smoker
  • Location
  • SelfAssessedHealthStatus
  • Systolic blood pressure (target)
  1. What are the regression coefficients (thetas)?

  2. Create a reduced model using stepwise regression.

Pre-Modeling

Load Required Packages

There are two ways to load the required packages.

  1. Install pacman using the following code.
# #install.packages("pacman")
# library("pacman")
  1. Or use this function and see if it works for you. If not, again, try the code above.
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(base64enc, ggplot2, kableExtra)

Load Data

The dataset we will be loading appears as:

Document Preview

Document Preview

patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)

Preview Data

Examine the data structure.

#Preview structure
str(patients)
## 'data.frame':    100 obs. of  10 variables:
##  $ Age                     : int  38 43 38 40 49 46 33 40 28 31 ...
##  $ Diastolic               : int  93 77 83 75 80 70 88 82 78 86 ...
##  $ Gender                  : Factor w/ 2 levels "'Female'","'Male'": 2 2 1 1 1 1 1 2 2 1 ...
##  $ Height                  : int  71 69 64 67 64 68 64 68 68 66 ...
##  $ LastName                : Factor w/ 100 levels "'Adams'","'Alexander'",..: 84 45 96 46 11 22 55 97 57 86 ...
##  $ Location                : Factor w/ 3 levels "'County General Hospital'",..: 1 3 2 3 1 2 3 3 2 1 ...
##  $ SelfAssessedHealthStatus: Factor w/ 4 levels "'Excellent'",..: 1 2 3 2 3 3 3 3 1 1 ...
##  $ Smoker                  : int  1 0 0 0 0 0 1 0 0 0 ...
##  $ Systolic                : int  124 109 125 117 122 121 130 115 115 118 ...
##  $ Weight                  : int  176 163 131 133 119 142 142 180 183 132 ...

Examine the top 5 rows.

#Preview top 5 rows
head(patients, n=5)
##   Age Diastolic   Gender Height   LastName                    Location
## 1  38        93   'Male'     71    'Smith'   'County General Hospital'
## 2  43        77   'Male'     69  'Johnson'               'VA Hospital'
## 3  38        83 'Female'     64 'Williams' 'St. Mary's Medical Center'
## 4  40        75 'Female'     67    'Jones'               'VA Hospital'
## 5  49        80 'Female'     64    'Brown'   'County General Hospital'
##   SelfAssessedHealthStatus Smoker Systolic Weight
## 1              'Excellent'      1      124    176
## 2                   'Fair'      0      109    163
## 3                   'Good'      0      125    131
## 4                   'Fair'      0      117    133
## 5                   'Good'      0      122    119

Preprocessing

Each attribute contains 100 observations; there are no missing values. Therefore, we do not need to fill missing values with mean/mode, or drop any columns/rows.

Now remove from patients table the unwanted columns of Diastolic and LastName.

patientsOriginal <- patients

df <- patients[-c(2, 5)] #deletes columns 2 and 5

Split dataframes into categorical and numeric

df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]

One-hot encode categorical columns.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)

#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")

df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1

Standardize numerics, and then recombine ordered dataframes.

An additional interesting discussion on when to standardize is here.

scaled_numericdf <- scale(df_numeric)

Recombine forked categorical and numeric dataframes together using column bind.

df <- cbind(scaled_numericdf, df_categorical)

Plot histograms

Plot histogram of numeric columns. For bin specification, see here.

For plotting multiples, see here.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=df$Age)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Age", x="Age", y="Count")

HeightPlot <- ggplot(data=df, aes(x=df$Height)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Height", x="Height", y="Count")

WeightPlot <- ggplot(data=df, aes(x=df$Weight)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Weight", x="Weight", y="Count")

SystolicPlot <- ggplot(data=df, aes(x=df$Systolic)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="green",
                  alpha = .3) +
                  labs(title="Histogram for Systolic", x="Systolic", y="Count")

grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)

Weight distribution is a bit unusual, as you would expect it to be fairly normally distributed.

Optional: Explore the dataset using a scatterplot.

Rename column headers for easier interpretation and reference.

names(df)[5] <- "Male"
names(df)[6] <- "Location1"
names(df)[7] <- "Location2"
names(df)[8] <- "Location3"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"

Modeling

Having completed the pre-processing and data exploration phases, we now move onto building a multiple linear regression model.

# fit <- lm(Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + Location3 + HealthExcellent + HealthFair + HealthGood + HealthPoor + Smoker, data = df)

#Finding the best model using Forward / Backward Stepwise Regression
full.model = lm(Systolic ~ ., data = df) #. means use all columns

summary(full.model)
## 
## Call:
## lm(formula = Systolic ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2853 -0.4662 -0.1028  0.4681  1.7636 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.43112    0.33094  -1.303   0.1960    
## Age              0.08627    0.07201   1.198   0.2341    
## Height           0.19844    0.10729   1.850   0.0677 .  
## Weight          -0.05311    0.23105  -0.230   0.8187    
## Male            -0.22038    0.48649  -0.453   0.6516    
## Location1        0.25844    0.16881   1.531   0.1293    
## Location2        0.13084    0.19193   0.682   0.4972    
## Location3             NA         NA      NA       NA    
## HealthExcellent -0.06843    0.24970  -0.274   0.7847    
## HealthFair      -0.47823    0.28414  -1.683   0.0959 .  
## HealthGood       0.01892    0.24150   0.078   0.9377    
## HealthPoor            NA         NA      NA       NA    
## Smoker           1.44098    0.15581   9.249 1.15e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.702 on 89 degrees of freedom
## Multiple R-squared:  0.5569, Adjusted R-squared:  0.5071 
## F-statistic: 11.19 on 10 and 89 DF,  p-value: 3.894e-12

Where: response ~ op1 term1 op2 term 2 . opn term n

Optional: Explore the dataset using the squared error, with the code: plot_ss(x = dfAge, y = dfSystolic, showSquares = TRUE), which I found here.

Optimize model

We can optimize this model using stepwise regression to find the best fit, yet still robust model.

reduced.model= step(full.model, direction = "backward")
## Start:  AIC=-60.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + 
##     Location3 + HealthExcellent + HealthFair + HealthGood + HealthPoor + 
##     Smoker
## 
## 
## Step:  AIC=-60.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + 
##     Location3 + HealthExcellent + HealthFair + HealthGood + Smoker
## 
## 
## Step:  AIC=-60.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + 
##     HealthExcellent + HealthFair + HealthGood + Smoker
## 
##                   Df Sum of Sq    RSS     AIC
## - HealthGood       1     0.003 43.869 -62.397
## - Weight           1     0.026 43.892 -62.345
## - HealthExcellent  1     0.037 43.902 -62.320
## - Male             1     0.101 43.967 -62.174
## - Location2        1     0.229 44.095 -61.883
## - Age              1     0.707 44.573 -60.805
## <none>                         43.865 -60.404
## - Location1        1     1.155 45.021 -59.805
## - HealthFair       1     1.396 45.262 -59.271
## - Height           1     1.686 45.552 -58.633
## - Smoker           1    42.158 86.023   4.945
## 
## Step:  AIC=-62.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + 
##     HealthExcellent + HealthFair + Smoker
## 
##                   Df Sum of Sq    RSS     AIC
## - Weight           1     0.026 43.895 -64.338
## - Male             1     0.101 43.969 -64.168
## - HealthExcellent  1     0.122 43.991 -64.119
## - Location2        1     0.234 44.102 -63.866
## - Age              1     0.722 44.590 -62.765
## <none>                         43.869 -62.397
## - Location1        1     1.159 45.028 -61.789
## - Height           1     1.683 45.552 -60.633
## - HealthFair       1     2.606 46.475 -58.626
## - Smoker           1    42.566 86.434   3.421
## 
## Step:  AIC=-64.34
## Systolic ~ Age + Height + Male + Location1 + Location2 + HealthExcellent + 
##     HealthFair + Smoker
## 
##                   Df Sum of Sq    RSS     AIC
## - HealthExcellent  1     0.163 44.058 -65.967
## - Location2        1     0.243 44.138 -65.786
## - Age              1     0.713 44.607 -64.727
## <none>                         43.895 -64.338
## - Male             1     1.104 44.999 -63.853
## - Location1        1     1.285 45.180 -63.452
## - Height           1     1.690 45.585 -62.560
## - HealthFair       1     2.586 46.481 -60.614
## - Smoker           1    42.687 86.581   1.591
## 
## Step:  AIC=-65.97
## Systolic ~ Age + Height + Male + Location1 + Location2 + HealthFair + 
##     Smoker
## 
##              Df Sum of Sq    RSS     AIC
## - Location2   1     0.331 44.389 -67.218
## - Age         1     0.701 44.759 -66.389
## <none>                    44.058 -65.967
## - Male        1     0.991 45.049 -65.743
## - Location1   1     1.330 45.387 -64.993
## - Height      1     1.565 45.623 -64.476
## - HealthFair  1     2.450 46.508 -62.555
## - Smoker      1    43.571 87.629   0.794
## 
## Step:  AIC=-67.22
## Systolic ~ Age + Height + Male + Location1 + HealthFair + Smoker
## 
##              Df Sum of Sq    RSS     AIC
## - Age         1     0.614 45.003 -67.843
## <none>                    44.389 -67.218
## - Location1   1     1.007 45.396 -66.976
## - Male        1     1.042 45.431 -66.898
## - Height      1     1.552 45.941 -65.782
## - HealthFair  1     2.710 47.099 -63.292
## - Smoker      1    43.243 87.632  -1.203
## 
## Step:  AIC=-67.84
## Systolic ~ Height + Male + Location1 + HealthFair + Smoker
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    45.003 -67.843
## - Male        1     1.068 46.071 -67.498
## - Location1   1     1.133 46.137 -67.356
## - Height      1     1.748 46.752 -66.032
## - HealthFair  1     2.677 47.681 -64.065
## - Smoker      1    43.577 88.580  -2.126
summary(reduced.model)
## 
## Call:
## lm(formula = Systolic ~ Height + Male + Location1 + HealthFair + 
##     Smoker, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2726 -0.5124 -0.0512  0.3933  1.8015 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3623     0.1351  -2.681  0.00868 ** 
## Height        0.1983     0.1038   1.911  0.05905 .  
## Male         -0.3101     0.2076  -1.494  0.13864    
## Location1     0.2255     0.1465   1.539  0.12725    
## HealthFair   -0.4643     0.1963  -2.365  0.02010 *  
## Smoker        1.4404     0.1510   9.540 1.74e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6919 on 94 degrees of freedom
## Multiple R-squared:  0.5454, Adjusted R-squared:  0.5212 
## F-statistic: 22.56 on 5 and 94 DF,  p-value: 8.427e-15
plot(reduced.model)

AIC / BIC Model Comparison

When comparing models fitted by maximum likelihood to the same data, the smaller the AIC or BIC, the better the fit. For more information on model selection criteria see R Documentation and this.

ModelComparison_AIC <- AIC(full.model, reduced.model)

print(ModelComparison_AIC)
##               df      AIC
## full.model    12 225.3834
## reduced.model  7 217.9447
ModelComparison_BIC <- BIC(full.model, reduced.model)

print(ModelComparison_BIC)
##               df      BIC
## full.model    12 256.6455
## reduced.model  7 236.1809

Results

A full and a reduced regression model were constructed. Performance varied between the full model, which includes all variables, and the reduced model, which includes a limited set of variables. The reduced model was built using the backward setpwise method to find the most significant variables for predicting the target Systolic variable. It is shown that the reduced model had a lower p-value, lower AIC and a lower BIC.

This reduced model found the following statisically significant predictors: Height, Gender, Hospital Location - County General Hospital, Health Status - Fair, and Smoker.

\[\begin{equation} \hat{Systolic} = 0.20(Height) - 0.31(Male) + 0.23(County General Hospital) \\ - 0.46(HealthFair) + 1.44(Smoker) -0.36 \\ \end{equation}\]

Of note, there is an interesting discussion as to whether stepwise regression should ever be used. Some traditional staticians say it should never be, since you’re leaving model building purely to the math, and not using human intelligence to determine which features to include in the model. Data miners, on the other hand, use an alternative methodology, where you completely rely on the math to select features. Only afterwords, do you determine if the included features are appropriate.

It’s unsettling because the Age of Enlightenment was founded on the scientific method, Isaac Newton himself adopted it. We shouldn’t be readily willing to step off this proven path, should we? But consider this case of why we should. When you’re determining which variables to explain the dependent (here it is Systolic), you’re introducing your own bias. The effect is that of wearing a pair of horse blinders. You’ll dismiss things that you don’t believe have any impact, and only include those factors you believe important. But correlations and causations sometimes surprise us. Herein lies my own case for ex post facto feature selection, and with it, stepwise regression.

R

# #install.packages("pacman")
# library("pacman")

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(base64enc, ggplot2, kableExtra)

patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)

#Preview structure
str(patients)

#Preview top 5 rows
head(patients, n=5)

patientsOriginal <- patients

df <- patients[-c(2, 5)] #deletes columns 2 and 5

df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)

#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")

df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1

scaled_numericdf <- scale(df_numeric)

df <- cbind(scaled_numericdf, df_categorical)

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=df$Age)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Age", x="Age", y="Count")

HeightPlot <- ggplot(data=df, aes(x=df$Height)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Height", x="Height", y="Count")

WeightPlot <- ggplot(data=df, aes(x=df$Weight)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Weight", x="Weight", y="Count")

SystolicPlot <- ggplot(data=df, aes(x=df$Systolic)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="green",
                  alpha = .3) +
                  labs(title="Histogram for Systolic", x="Systolic", y="Count")

grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)

names(df)[5] <- "Male"
names(df)[6] <- "Location1"
names(df)[7] <- "Location2"
names(df)[8] <- "Location3"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"

# fit <- lm(Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + Location3 + HealthExcellent + HealthFair + HealthGood + HealthPoor + Smoker, data = df)

#Finding the best model using Forward / Backward Stepwise Regression
full.model = lm(Systolic ~ ., data = df) #. means use all columns

summary(full.model)

reduced.model= step(full.model, direction = "backward")

summary(reduced.model)

plot(reduced.model)

ModelComparison_AIC <- AIC(full.model, reduced.model)

print(ModelComparison_AIC)

ModelComparison_BIC <- BIC(full.model, reduced.model)

print(ModelComparison_BIC)

Python

coding: utf-8

In[1]:

import pandas as pd

import numpy as np

In[2]:

Load data

genfromtext is a Numpy function. I prefer this explicit file path.

Because there is header column, set header=0

patients = pd.read_csv(“C:\tmp\patients.csv”, header=0)

Backup patients, just in case we need it later

patientsBackup = patients

### Preview the Data

We want to preview the data to see what we’ll be working with. This will > display any missing values, as well.

In[3]:

quick description of the data

patients.info()

top 3 rows

patients.head(3)

Each attribute contains 100 observations; there are no missing values. > Therefore, we do not need to fill missing values with mean/mode, or drop any > columns/rows.

In[4]:

show a summary of the numerical attributes

patients.describe; # semi-colon to turn off echo (terminology?)

In[5]:

Histogram visualization

hist() relies on matplotlib

import matplotlib.pyplot as plot

%matplotlib inline

patients.hist(bins=20, figsize=(16,8))

plot.show()

##### Cross Correlation Check

Previous function provided in Hands-On Machine Learning book was > depreciated

In[6]:

from pandas.plotting import scatter_matrix

This is not the latest dataframe

attributes = [“Age”, “Diastolic”, “Height”, “Smoker”, “Weight”]

scatter_matrix(patients[attributes], figsize=(16, 8))

the text output below is expected

https://pandas.pydata.org/pandas-docs/stable/visualization.html

Nothing so interesting.

### Data Adjustments

#### First split matrix into y (dependent) and x (independent)

Remember, Python is 0-offset! The “3rd” entry is at position 2.

patientsY = Diastolic

patientsX = Everything else exluding LastName and Systolic

Perfect, clear example on splitting y and x found > here > and here

Final solution on selecting multiple columns found > here

The independent variables consist of numeric, categorical and binary > datatypes. Each will be processed individually.

In[7]:

split dependent variable and independent variables

patientsY = patients[patients.columns[1]]

patientsY = patients.iloc[:,1:1]

The clearest is this:

patientsY = patients[“Diastolic”]

patientsX = patients[“Age”,“Gender”]

patientsX = patients.loc[:,“Age”:“Gender”]

patientsX = patients[[“Age”, “Gender”, “Height”, “Location”, > “SelfAssessedHealthStatus”, “Weight”]]

patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]

Smoker is not pulled with the other categorical data, so this next code line > was added

patientsXBinary = patients[[“Smoker”]]

#### Standardize the Data, or mean removal and variance > scaling

>Standardization of datasets is a common requirement for many machine > learning estimators implemented in scikit-learn; they might behave badly if the > individual features do not more or less look like standard normally distributed > data: Gaussian with zero mean and unit variance.

>Basically, take a matrix and change it so that its mean is equal to 0 and > variance is 1

It matters in our case because Weight has values so much higher than Age. > After fitting, our interpretation of the model will be influenced more by > weight than age, since it has higher values.

We don’t need to standardize the dependent y variable, so we split the matrix > before standardizing the entire X matrix.

here is the clearest example of normalizing and > standardizing.

In[8]:

from sklearn import preprocessing

Standardize

patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]

patientsXNumeric_scaled = preprocessing.scale(patientsXNumeric, axis=0)

Scaled data should have zero mean and unit variance.

In[9]:

Mean

print(“Mean:”, patientsXNumeric_scaled.mean())

Std

print(“Std:”, patientsXNumeric_scaled.std())

print(“Length:”, len(patientsXNumeric_scaled))

## Standardizing removed the header from the row. I need to fix this.

##### One hot encoding preperation

Perform one hot encoding, where 1 = hot, 0 = cold. Each feature value gets > its own binary column.

Excellent tutorial > here

Another one here

In[10]:

Return only object datatypes (non-numeric here)

categories = patientsX.select_dtypes(include=[object])

As you will only be dealing with categorical features in this tutorial, it’s > better to filter them out.

You can create a separate DataFrame consisting of only these features by > running the following command.

The method .copy() is used here so that any changes made in new DataFrame > don’t get reflected in the original one.

categoriesX = patientsX.select_dtypes(include=[object]).copy()

categoriesX.head()

Let’s also check the column-wise distribution of null values:

In[11]:

print(categoriesX.isnull().sum())

print(patientsXBinary.isnull().sum())

No missing values. Good!

Next, count distint cases of each category

In[12]:

print(categoriesX[“Location”].value_counts().count())

print(“Gender:”, categoriesX[“Gender”].value_counts().count())

print(“Location:”, categoriesX[“Location”].value_counts().count())

print(“SelfAssessedHealthStatus:”, > categoriesX[“SelfAssessedHealthStatus”].value_counts().count())

print(“Smoker:”, patientsXBinary[“Smoker”].value_counts().count())

There are not too many unique values that would complicate linear regression > as a result of one-hot encoding.

##### One-Hot Encoding

As said in this terrific one-hot > tutorial:

>There are many libraries out there that support one-hot encoding but the > simplest one is using pandas’ .get_dummies() method.

>

>There are mainly three arguments important here, the first one is the > DataFrame you want to encode on, second being the columns argument which lets > you specify the columns you want to do encoding on, and third, the prefix > argument which lets you specify the prefix for the new columns that will be > created after encoding.

LastName is not to be included in the linear regression.

In[13]:

categoriesX_onehot = categoriesX.copy()

categoriesX_onehot = pd.get_dummies(categoriesX, columns=[“Gender”, “Location”, > “SelfAssessedHealthStatus”], prefix = [“Gender”, “Location”, > “SelfAssessedHealthStatus”])

categoriesXBinary_onehot = pd.get_dummies(patientsXBinary, columns=[“Smoker”], > prefix = [“Smoker”])

Return results

print(categoriesX_onehot.head());

print(categoriesXBinary_onehot.head());

Now that one-hot encoding has split the categorical attributes into many > dummy attributes, they must be concatenated back together. This can be done via > pandas’ .concat() method. The axis argument is set to 1 as you want to merge on > columns.

In[14]:

print(“categoriesX_onehot is:”, type(categoriesX_onehot))

print(categoriesX_onehot.shape)

print(“categoriesXBinary_onehot is:”, type(categoriesXBinary_onehot))

print(categoriesXBinary_onehot.shape)

print(“patientsXNumeric_scaled is:”, type(patientsXNumeric_scaled))

print(patientsXNumeric_scaled.shape)

patientsXNumeric_scaled is an array. I used this SO > post to convert it to a > dataframe.

In[15]:

patientsXNumeric_scaleddf = pd.DataFrame(patientsXNumeric_scaled)

In[16]:

print(“patientsXNumeric_scaleddf is:”, type(patientsXNumeric_scaleddf))

In[17]:

print(patientsXNumeric_scaleddf.head())

Looks better, but it still needs column names.

In[18]:

patientsXNumeric_scaleddf.columns = [“Age”, “Height”, “Weight”]

Now we bring all the columns back together as one dataframe.

In[29]:

Bring them back together

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric, > categoriesXBinary_onehot], axis=1)

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric_scaleddf, categoriesXBinary_onehot], axis=1)

top 3 rows

print(patientsXAll.head(3))

Above, the numeric, one-hot encoded categorical and binary columns have been > concatenated into one dataframe.

##### Binning/Aggregating

None of the features (e.g. Age) require binning/aggregating.

### Build a Linear Regression Model

3. Use variables Age, Gender, Height, Weight, Smoker, Location, > SelfAssessedHealthStatus to build a linear regression model to predict the > systolic blood pressure.

That does not include Diastolic or LastName in the prediction.

In this assignment, there is no need to split the dataset into training and > testing, or training, validation and testing, but if you wanted to, > this is an incredibly clear example > on that.

In[30]:

mdl = fitlm(patientsXAll, patientsY)

import matplotlib.pyplot as plt

import numpy as np

from sklearn import linear_model

from sklearn.metrics import mean_squared_error, r2_score

Create linear regression object

regr = linear_model.LinearRegression()

Train the model using the training sets

trainedmodel = regr.fit(patientsXAll, patientsY)

### Interpretation

4. What are the regression coefficients (thetas)?

In[31]:

yint (theta0)

yint = regr.intercept_

print(“Y intercept: ”, yint, “”)

The coefficients (theta1)

coefficients = regr.coef_

print(“Coefficients: ”, coefficients)

5. How do you interpret those numbers?

In[32]:

print(“Number of coefficients:”, len(coefficients))

print(“Number of columns:”, len(patientsXAll.columns))

Just wanted to quickly check to see if the coefficient count is the same as > my attribute count.

In[33]:

print(patientsXAll.columns)

print(coefficients)

A coefficient of 10 for a numeric, non-dummy attribute indicates that for > every +1 standard deviation in the independent variable (exogenous), the > dependent variable (endogenous) increases by 10 variable. That is, when > Weight increases by 1 standard deviation, then predicted diastolic > increases by 1.76e-1 or 0.176.

A coefficient of 10 for a categorical, dummy attribute indicates that when > the independent variable is 1 (TRUE), the dependent variable increases by 10 > variable, relative to the base assumption. That is, when you Smoke, > predicted diastolic increases by 5.188, relative to the baseline of > not-Smoking.

6. If you need to identify one outlier record, which record is a potential > outlier? How do you reach this conclusion?

There are no outliers for categorical (dummy)/binary attributes. Gender, > LastName, Location, SelfAssessment and Smoker are all irrelevant in the search > for outliers. Hence, we are only interested in the remaining three numeric > attributes Age, Height and Weight. Let’s begin the search with a box plot.

In[34]:

import matplotlib.pyplot as plot

get_ipython().run_line_magic(‘matplotlib’, ‘inline’)

patientsXNumeric_scaleddf.plot.box(figsize=(16,4))

This suggests Height has the largest absolute outlier, which is a minimum. We > can now examine this with scatterplots.

In[35]:

matplotlib.pyplot.scatter(patientsXNumeric_scaleddf[“Age”], patientsY)

In:

matplotlib.pyplot.scatter(patientsXNumeric_scaleddf[“Height”], patientsY)

Here we see that same minimum outlier in Height.

In[36]:

matplotlib.pyplot.scatter(patientsXNumeric_scaleddf[“Weight”], patientsY)

Find the numbers for those minumums and maximums.

In[37]:

patientsXNumeric_scaleddf.min()

In[38]:

patientsXNumeric_scaleddf.max()

So far, the single outlier record I identify using a boxplot is the lowest > Height value, -2.505

In[39]:

patientsXNumeric_scaleddf[“Height”].min()

## But what about Cook or Leverage?

statsmodels.stats.outliers_influence.OLSInfluence. #

But I could not get this working. The above is all I can manage. Think next > time I need to try matlab…

The feature I’d remove first is LastName, since through one-hot encoding > this attribute renders 100 columns. For our dataset consisting of only 100 > rows, this is far too high and might conflict againt maximum degress of > freedom.

Markdown

To view this entire document’s markdown code, click here.

Data

If you don’t have the dataset, copy the table below, and after pasting it into Excel, save it as a comma seperated file named patients.csv in your preferred directory.

Age Diastolic Gender Height LastName Location SelfAssessedHealthStatus Smoker Systolic Weight
38 93 ‘Male’ 71 ‘Smith’ ‘County General Hospital’ ‘Excellent’ 1 124 176
43 77 ‘Male’ 69 ‘Johnson’ ‘VA Hospital’ ‘Fair’ 0 109 163
38 83 ‘Female’ 64 ‘Williams’ ‘St. Mary’s Medical Center’ ‘Good’ 0 125 131
40 75 ‘Female’ 67 ‘Jones’ ‘VA Hospital’ ‘Fair’ 0 117 133
49 80 ‘Female’ 64 ‘Brown’ ‘County General Hospital’ ‘Good’ 0 122 119
46 70 ‘Female’ 68 ‘Davis’ ‘St. Mary’s Medical Center’ ‘Good’ 0 121 142
33 88 ‘Female’ 64 ‘Miller’ ‘VA Hospital’ ‘Good’ 1 130 142
40 82 ‘Male’ 68 ‘Wilson’ ‘VA Hospital’ ‘Good’ 0 115 180
28 78 ‘Male’ 68 ‘Moore’ ‘St. Mary’s Medical Center’ ‘Excellent’ 0 115 183
31 86 ‘Female’ 66 ‘Taylor’ ‘County General Hospital’ ‘Excellent’ 0 118 132
45 77 ‘Female’ 68 ‘Anderson’ ‘County General Hospital’ ‘Excellent’ 0 114 128
42 68 ‘Female’ 66 ‘Thomas’ ‘St. Mary’s Medical Center’ ‘Poor’ 0 115 137
25 74 ‘Male’ 71 ‘Jackson’ ‘VA Hospital’ ‘Poor’ 0 127 174
39 95 ‘Male’ 72 ‘White’ ‘VA Hospital’ ‘Excellent’ 1 130 202
36 79 ‘Female’ 65 ‘Harris’ ‘St. Mary’s Medical Center’ ‘Good’ 0 114 129
48 92 ‘Male’ 71 ‘Martin’ ‘VA Hospital’ ‘Good’ 1 130 181
32 95 ‘Male’ 69 ‘Thompson’ ‘St. Mary’s Medical Center’ ‘Excellent’ 1 124 191
27 79 ‘Female’ 69 ‘Garcia’ ‘VA Hospital’ ‘Fair’ 1 123 131
37 77 ‘Male’ 70 ‘Martinez’ ‘County General Hospital’ ‘Good’ 0 119 179
50 76 ‘Male’ 68 ‘Robinson’ ‘County General Hospital’ ‘Good’ 0 125 172
48 75 ‘Female’ 65 ‘Clark’ ‘VA Hospital’ ‘Excellent’ 0 121 133
39 79 ‘Female’ 64 ‘Rodriguez’ ‘VA Hospital’ ‘Fair’ 0 123 117
41 88 ‘Female’ 62 ‘Lewis’ ‘VA Hospital’ ‘Fair’ 0 114 137
44 90 ‘Female’ 66 ‘Lee’ ‘County General Hospital’ ‘Fair’ 1 128 146
28 96 ‘Female’ 65 ‘Walker’ ‘County General Hospital’ ‘Good’ 1 129 123
25 77 ‘Male’ 70 ‘Hall’ ‘VA Hospital’ ‘Poor’ 0 114 189
39 80 ‘Female’ 63 ‘Allen’ ‘VA Hospital’ ‘Excellent’ 0 113 143
25 76 ‘Female’ 63 ‘Young’ ‘County General Hospital’ ‘Good’ 0 125 114
36 83 ‘Male’ 68 ‘Hernandez’ ‘County General Hospital’ ‘Poor’ 0 120 166
30 89 ‘Male’ 67 ‘King’ ‘County General Hospital’ ‘Excellent’ 1 127 186
45 92 ‘Female’ 70 ‘Wright’ ‘VA Hospital’ ‘Excellent’ 1 134 126
40 83 ‘Female’ 66 ‘Lopez’ ‘VA Hospital’ ‘Poor’ 0 121 137
25 80 ‘Female’ 64 ‘Hill’ ‘St. Mary’s Medical Center’ ‘Excellent’ 0 115 138
47 84 ‘Male’ 70 ‘Scott’ ‘St. Mary’s Medical Center’ ‘Excellent’ 0 127 187
44 92 ‘Male’ 71 ‘Green’ ‘County General Hospital’ ‘Good’ 0 121 193
48 83 ‘Female’ 66 ‘Adams’ ‘VA Hospital’ ‘Excellent’ 0 127 137
44 90 ‘Male’ 71 ‘Baker’ ‘VA Hospital’ ‘Good’ 1 136 192
35 85 ‘Female’ 66 ‘Gonzalez’ ‘St. Mary’s Medical Center’ ‘Fair’ 0 117 118
33 90 ‘Male’ 66 ‘Nelson’ ‘St. Mary’s Medical Center’ ‘Good’ 1 124 180
38 74 ‘Female’ 63 ‘Carter’ ‘St. Mary’s Medical Center’ ‘Good’ 0 120 128
39 92 ‘Male’ 71 ‘Mitchell’ ‘County General Hospital’ ‘Fair’ 1 128 164
44 80 ‘Male’ 69 ‘Perez’ ‘VA Hospital’ ‘Excellent’ 0 116 183
44 89 ‘Male’ 70 ‘Roberts’ ‘VA Hospital’ ‘Good’ 1 132 169
37 96 ‘Male’ 70 ‘Turner’ ‘VA Hospital’ ‘Excellent’ 1 137 194
45 89 ‘Male’ 67 ‘Phillips’ ‘VA Hospital’ ‘Good’ 0 117 172
37 77 ‘Female’ 65 ‘Campbell’ ‘County General Hospital’ ‘Fair’ 0 116 135
30 81 ‘Male’ 68 ‘Parker’ ‘VA Hospital’ ‘Poor’ 0 119 182
39 76 ‘Female’ 62 ‘Evans’ ‘County General Hospital’ ‘Good’ 0 123 121
42 83 ‘Male’ 70 ‘Edwards’ ‘County General Hospital’ ‘Excellent’ 0 116 158
42 78 ‘Male’ 67 ‘Collins’ ‘County General Hospital’ ‘Good’ 1 124 179
49 95 ‘Male’ 68 ‘Stewart’ ‘County General Hospital’ ‘Poor’ 1 129 170
44 91 ‘Female’ 62 ‘Sanchez’ ‘St. Mary’s Medical Center’ ‘Good’ 1 130 136
43 91 ‘Female’ 64 ‘Morris’ ‘County General Hospital’ ‘Poor’ 1 132 135
47 86 ‘Female’ 66 ‘Rogers’ ‘VA Hospital’ ‘Excellent’ 0 117 147
50 89 ‘Male’ 72 ‘Reed’ ‘VA Hospital’ ‘Excellent’ 1 129 186
38 79 ‘Female’ 63 ‘Cook’ ‘VA Hospital’ ‘Excellent’ 0 118 124
41 74 ‘Female’ 66 ‘Morgan’ ‘St. Mary’s Medical Center’ ‘Good’ 0 120 134
45 82 ‘Male’ 70 ‘Bell’ ‘St. Mary’s Medical Center’ ‘Good’ 1 138 170
36 76 ‘Male’ 71 ‘Murphy’ ‘VA Hospital’ ‘Good’ 0 117 180
38 81 ‘Female’ 68 ‘Bailey’ ‘St. Mary’s Medical Center’ ‘Good’ 0 113 130
29 77 ‘Female’ 63 ‘Rivera’ ‘County General Hospital’ ‘Excellent’ 0 122 130
28 73 ‘Female’ 65 ‘Cooper’ ‘VA Hospital’ ‘Good’ 0 115 127
30 85 ‘Female’ 67 ‘Richardson’ ‘County General Hospital’ ‘Excellent’ 0 120 141
28 76 ‘Female’ 66 ‘Cox’ ‘County General Hospital’ ‘Good’ 0 117 111
29 80 ‘Female’ 68 ‘Howard’ ‘VA Hospital’ ‘Excellent’ 0 123 134
36 80 ‘Male’ 71 ‘Ward’ ‘St. Mary’s Medical Center’ ‘Good’ 0 123 189
45 79 ‘Female’ 70 ‘Torres’ ‘County General Hospital’ ‘Excellent’ 0 119 137
32 82 ‘Female’ 60 ‘Peterson’ ‘County General Hospital’ ‘Excellent’ 0 110 136
31 79 ‘Female’ 64 ‘Gray’ ‘VA Hospital’ ‘Excellent’ 0 121 130
48 82 ‘Female’ 64 ‘Ramirez’ ‘County General Hospital’ ‘Excellent’ 1 138 137
25 75 ‘Male’ 66 ‘James’ ‘County General Hospital’ ‘Good’ 0 125 186
40 91 ‘Female’ 64 ‘Watson’ ‘VA Hospital’ ‘Fair’ 1 122 127
39 74 ‘Male’ 72 ‘Brooks’ ‘St. Mary’s Medical Center’ ‘Excellent’ 0 120 176
41 78 ‘Female’ 65 ‘Kelly’ ‘St. Mary’s Medical Center’ ‘Poor’ 0 117 127
33 85 ‘Female’ 67 ‘Sanders’ ‘St. Mary’s Medical Center’ ‘Excellent’ 1 125 115
31 84 ‘Male’ 72 ‘Price’ ‘VA Hospital’ ‘Fair’ 1 124 178
35 75 ‘Female’ 64 ‘Bennett’ ‘County General Hospital’ ‘Fair’ 0 121 131
32 78 ‘Male’ 68 ‘Wood’ ‘St. Mary’s Medical Center’ ‘Poor’ 0 118 183
42 81 ‘Male’ 66 ‘Barnes’ ‘County General Hospital’ ‘Excellent’ 0 120 194
48 79 ‘Female’ 64 ‘Ross’ ‘VA Hospital’ ‘Good’ 0 118 126
34 85 ‘Male’ 68 ‘Henderson’ ‘St. Mary’s Medical Center’ ‘Good’ 0 118 186
39 79 ‘Male’ 69 ‘Coleman’ ‘VA Hospital’ ‘Excellent’ 0 122 188
28 82 ‘Male’ 69 ‘Jenkins’ ‘County General Hospital’ ‘Good’ 1 134 189
29 80 ‘Female’ 64 ‘Perry’ ‘St. Mary’s Medical Center’ ‘Good’ 0 131 120
32 80 ‘Female’ 63 ‘Powell’ ‘VA Hospital’ ‘Excellent’ 0 113 132
39 92 ‘Male’ 68 ‘Long’ ‘County General Hospital’ ‘Good’ 1 125 182
37 92 ‘Female’ 65 ‘Patterson’ ‘County General Hospital’ ‘Poor’ 1 135 120
49 96 ‘Female’ 63 ‘Hughes’ ‘County General Hospital’ ‘Good’ 1 128 123
31 87 ‘Female’ 66 ‘Flores’ ‘VA Hospital’ ‘Good’ 1 123 141
37 81 ‘Female’ 65 ‘Washington’ ‘St. Mary’s Medical Center’ ‘Good’ 0 122 129
38 90 ‘Male’ 68 ‘Butler’ ‘County General Hospital’ ‘Excellent’ 1 138 184
45 77 ‘Male’ 71 ‘Simmons’ ‘VA Hospital’ ‘Excellent’ 0 124 181
30 91 ‘Female’ 70 ‘Foster’ ‘St. Mary’s Medical Center’ ‘Fair’ 0 130 124
48 79 ‘Male’ 71 ‘Gonzales’ ‘County General Hospital’ ‘Good’ 0 123 174
48 73 ‘Female’ 66 ‘Bryant’ ‘County General Hospital’ ‘Excellent’ 0 129 134
25 99 ‘Male’ 69 ‘Alexander’ ‘County General Hospital’ ‘Good’ 1 128 171
44 92 ‘Male’ 69 ‘Russell’ ‘VA Hospital’ ‘Good’ 1 124 188
49 74 ‘Male’ 70 ‘Griffin’ ‘County General Hospital’ ‘Fair’ 0 119 186
45 93 ‘Male’ 68 ‘Diaz’ ‘County General Hospital’ ‘Good’ 1 136 172
48 86 ‘Male’ 66 ‘Hayes’ ‘County General Hospital’ ‘Fair’ 0 114 177

Publications

Videos

I’ve recorded a 45 minute video on how to bring machine learning to the next level in an applied Wine Quality Prediction Project.

If you’re not ready for that and want a tutorial on the basics of Machine Learning, my 1.5 hour Overview of Machine Learning might be better. It will guide you through many of the general concepts, as well as some of the various models listed below.