This analysis fits a stepwise linear regression model to predict Systolic blood pressure. Included are seven numeric and categorical features: Age, Height, Weight, Gender, Hospital Location (CGH - County General Hospital, SMMC - St. Mary’s Medical Center, VA - VA Hospital), Status of Health (Excellent, Fair, Good, Poor), and Smoker. Our regression will include the general form of:

\[\begin{equation} \hat{Systolic} = \theta(Age) + \theta(Height) + \theta(Weight) + ... \\ + \theta(HealthFair) + \theta(HealthPoor) + \theta(Smoker) + b \\ \end{equation}\]

This is just one of the five machine learning modeling guides you can find here.

Overview

The following achieves the following requirements:

load patient self evaluation dataset.
Linear Regresion is used to predict continues numbers. Use a linear regression model on:

Age
Gender
Height
Weight
Smoker
Location
SelfAssessedHealthStatus
Systolic blood pressure (target)

What are the regression coefficients (thetas)?
Create a reduced model using stepwise regression.

Pre-Modeling

Load Required Packages

There are two ways to load the required packages.

Install pacman using the following code.

# #install.packages("pacman")
# library("pacman")

Or use this function and see if it works for you. If not, again, try the code above.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")

## Loading required package: pacman

pacman::p_load(base64enc, ggplot2, kableExtra)

Load Data

The dataset we will be loading appears as:

Document Preview

patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)

Preview Data

Examine the data structure.

#Preview structure
str(patients)

## 'data.frame':    100 obs. of  10 variables:
##  $ Age                     : int  38 43 38 40 49 46 33 40 28 31 ...
##  $ Diastolic               : int  93 77 83 75 80 70 88 82 78 86 ...
##  $ Gender                  : Factor w/ 2 levels "'Female'","'Male'": 2 2 1 1 1 1 1 2 2 1 ...
##  $ Height                  : int  71 69 64 67 64 68 64 68 68 66 ...
##  $ LastName                : Factor w/ 100 levels "'Adams'","'Alexander'",..: 84 45 96 46 11 22 55 97 57 86 ...
##  $ Location                : Factor w/ 3 levels "'County General Hospital'",..: 1 3 2 3 1 2 3 3 2 1 ...
##  $ SelfAssessedHealthStatus: Factor w/ 4 levels "'Excellent'",..: 1 2 3 2 3 3 3 3 1 1 ...
##  $ Smoker                  : int  1 0 0 0 0 0 1 0 0 0 ...
##  $ Systolic                : int  124 109 125 117 122 121 130 115 115 118 ...
##  $ Weight                  : int  176 163 131 133 119 142 142 180 183 132 ...

Examine the top 5 rows.

#Preview top 5 rows
head(patients, n=5)

##   Age Diastolic   Gender Height   LastName                    Location
## 1  38        93   'Male'     71    'Smith'   'County General Hospital'
## 2  43        77   'Male'     69  'Johnson'               'VA Hospital'
## 3  38        83 'Female'     64 'Williams' 'St. Mary's Medical Center'
## 4  40        75 'Female'     67    'Jones'               'VA Hospital'
## 5  49        80 'Female'     64    'Brown'   'County General Hospital'
##   SelfAssessedHealthStatus Smoker Systolic Weight
## 1              'Excellent'      1      124    176
## 2                   'Fair'      0      109    163
## 3                   'Good'      0      125    131
## 4                   'Fair'      0      117    133
## 5                   'Good'      0      122    119

Preprocessing

Each attribute contains 100 observations; there are no missing values. Therefore, we do not need to fill missing values with mean/mode, or drop any columns/rows.

Now remove from patients table the unwanted columns of Diastolic and LastName.

patientsOriginal <- patients

df <- patients[-c(2, 5)] #deletes columns 2 and 5

Split dataframes into categorical and numeric

df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]

One-hot encode categorical columns.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)

#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")

df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1

Standardize numerics, and then recombine ordered dataframes.

An additional interesting discussion on when to standardize is here.

scaled_numericdf <- scale(df_numeric)

Recombine forked categorical and numeric dataframes together using column bind.

df <- cbind(scaled_numericdf, df_categorical)

Plot histograms

Plot histogram of numeric columns. For bin specification, see here.

For plotting multiples, see here.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=df$Age)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Age", x="Age", y="Count")

HeightPlot <- ggplot(data=df, aes(x=df$Height)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Height", x="Height", y="Count")

WeightPlot <- ggplot(data=df, aes(x=df$Weight)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Weight", x="Weight", y="Count")

SystolicPlot <- ggplot(data=df, aes(x=df$Systolic)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="green",
                  alpha = .3) +
                  labs(title="Histogram for Systolic", x="Systolic", y="Count")

grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)

Weight distribution is a bit unusual, as you would expect it to be fairly normally distributed.

Optional: Explore the dataset using a scatterplot.

Rename column headers for easier interpretation and reference.

names(df)[5] <- "Male"
names(df)[6] <- "Location1"
names(df)[7] <- "Location2"
names(df)[8] <- "Location3"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"

Modeling

Having completed the pre-processing and data exploration phases, we now move onto building a multiple linear regression model.

# fit <- lm(Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + Location3 + HealthExcellent + HealthFair + HealthGood + HealthPoor + Smoker, data = df)

#Finding the best model using Forward / Backward Stepwise Regression
full.model = lm(Systolic ~ ., data = df) #. means use all columns

summary(full.model)

## 
## Call:
## lm(formula = Systolic ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2853 -0.4662 -0.1028  0.4681  1.7636 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.43112    0.33094  -1.303   0.1960    
## Age              0.08627    0.07201   1.198   0.2341    
## Height           0.19844    0.10729   1.850   0.0677 .  
## Weight          -0.05311    0.23105  -0.230   0.8187    
## Male            -0.22038    0.48649  -0.453   0.6516    
## Location1        0.25844    0.16881   1.531   0.1293    
## Location2        0.13084    0.19193   0.682   0.4972    
## Location3             NA         NA      NA       NA    
## HealthExcellent -0.06843    0.24970  -0.274   0.7847    
## HealthFair      -0.47823    0.28414  -1.683   0.0959 .  
## HealthGood       0.01892    0.24150   0.078   0.9377    
## HealthPoor            NA         NA      NA       NA    
## Smoker           1.44098    0.15581   9.249 1.15e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.702 on 89 degrees of freedom
## Multiple R-squared:  0.5569, Adjusted R-squared:  0.5071 
## F-statistic: 11.19 on 10 and 89 DF,  p-value: 3.894e-12

Where: response ~ op1 term1 op2 term 2 . opn term n

Optional: Explore the dataset using the squared error, with the code: plot_ss(x = dfAge, y = dfSystolic, showSquares = TRUE), which I found here.

Optimize model

We can optimize this model using stepwise regression to find the best fit, yet still robust model.

reduced.model= step(full.model, direction = "backward")

## Start:  AIC=-60.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + 
##     Location3 + HealthExcellent + HealthFair + HealthGood + HealthPoor + 
##     Smoker
## 
## 
## Step:  AIC=-60.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + 
##     Location3 + HealthExcellent + HealthFair + HealthGood + Smoker
## 
## 
## Step:  AIC=-60.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + 
##     HealthExcellent + HealthFair + HealthGood + Smoker
## 
##                   Df Sum of Sq    RSS     AIC
## - HealthGood       1     0.003 43.869 -62.397
## - Weight           1     0.026 43.892 -62.345
## - HealthExcellent  1     0.037 43.902 -62.320
## - Male             1     0.101 43.967 -62.174
## - Location2        1     0.229 44.095 -61.883
## - Age              1     0.707 44.573 -60.805
## <none>                         43.865 -60.404
## - Location1        1     1.155 45.021 -59.805
## - HealthFair       1     1.396 45.262 -59.271
## - Height           1     1.686 45.552 -58.633
## - Smoker           1    42.158 86.023   4.945
## 
## Step:  AIC=-62.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + 
##     HealthExcellent + HealthFair + Smoker
## 
##                   Df Sum of Sq    RSS     AIC
## - Weight           1     0.026 43.895 -64.338
## - Male             1     0.101 43.969 -64.168
## - HealthExcellent  1     0.122 43.991 -64.119
## - Location2        1     0.234 44.102 -63.866
## - Age              1     0.722 44.590 -62.765
## <none>                         43.869 -62.397
## - Location1        1     1.159 45.028 -61.789
## - Height           1     1.683 45.552 -60.633
## - HealthFair       1     2.606 46.475 -58.626
## - Smoker           1    42.566 86.434   3.421
## 
## Step:  AIC=-64.34
## Systolic ~ Age + Height + Male + Location1 + Location2 + HealthExcellent + 
##     HealthFair + Smoker
## 
##                   Df Sum of Sq    RSS     AIC
## - HealthExcellent  1     0.163 44.058 -65.967
## - Location2        1     0.243 44.138 -65.786
## - Age              1     0.713 44.607 -64.727
## <none>                         43.895 -64.338
## - Male             1     1.104 44.999 -63.853
## - Location1        1     1.285 45.180 -63.452
## - Height           1     1.690 45.585 -62.560
## - HealthFair       1     2.586 46.481 -60.614
## - Smoker           1    42.687 86.581   1.591
## 
## Step:  AIC=-65.97
## Systolic ~ Age + Height + Male + Location1 + Location2 + HealthFair + 
##     Smoker
## 
##              Df Sum of Sq    RSS     AIC
## - Location2   1     0.331 44.389 -67.218
## - Age         1     0.701 44.759 -66.389
## <none>                    44.058 -65.967
## - Male        1     0.991 45.049 -65.743
## - Location1   1     1.330 45.387 -64.993
## - Height      1     1.565 45.623 -64.476
## - HealthFair  1     2.450 46.508 -62.555
## - Smoker      1    43.571 87.629   0.794
## 
## Step:  AIC=-67.22
## Systolic ~ Age + Height + Male + Location1 + HealthFair + Smoker
## 
##              Df Sum of Sq    RSS     AIC
## - Age         1     0.614 45.003 -67.843
## <none>                    44.389 -67.218
## - Location1   1     1.007 45.396 -66.976
## - Male        1     1.042 45.431 -66.898
## - Height      1     1.552 45.941 -65.782
## - HealthFair  1     2.710 47.099 -63.292
## - Smoker      1    43.243 87.632  -1.203
## 
## Step:  AIC=-67.84
## Systolic ~ Height + Male + Location1 + HealthFair + Smoker
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    45.003 -67.843
## - Male        1     1.068 46.071 -67.498
## - Location1   1     1.133 46.137 -67.356
## - Height      1     1.748 46.752 -66.032
## - HealthFair  1     2.677 47.681 -64.065
## - Smoker      1    43.577 88.580  -2.126

summary(reduced.model)

## 
## Call:
## lm(formula = Systolic ~ Height + Male + Location1 + HealthFair + 
##     Smoker, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2726 -0.5124 -0.0512  0.3933  1.8015 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3623     0.1351  -2.681  0.00868 ** 
## Height        0.1983     0.1038   1.911  0.05905 .  
## Male         -0.3101     0.2076  -1.494  0.13864    
## Location1     0.2255     0.1465   1.539  0.12725    
## HealthFair   -0.4643     0.1963  -2.365  0.02010 *  
## Smoker        1.4404     0.1510   9.540 1.74e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6919 on 94 degrees of freedom
## Multiple R-squared:  0.5454, Adjusted R-squared:  0.5212 
## F-statistic: 22.56 on 5 and 94 DF,  p-value: 8.427e-15

plot(reduced.model)

AIC / BIC Model Comparison

When comparing models fitted by maximum likelihood to the same data, the smaller the AIC or BIC, the better the fit. For more information on model selection criteria see R Documentation and this.

ModelComparison_AIC <- AIC(full.model, reduced.model)

print(ModelComparison_AIC)

##               df      AIC
## full.model    12 225.3834
## reduced.model  7 217.9447

ModelComparison_BIC <- BIC(full.model, reduced.model)

print(ModelComparison_BIC)

##               df      BIC
## full.model    12 256.6455
## reduced.model  7 236.1809

Results

A full and a reduced regression model were constructed. Performance varied between the full model, which includes all variables, and the reduced model, which includes a limited set of variables. The reduced model was built using the backward setpwise method to find the most significant variables for predicting the target Systolic variable. It is shown that the reduced model had a lower p-value, lower AIC and a lower BIC.

This reduced model found the following statisically significant predictors: Height, Gender, Hospital Location - County General Hospital, Health Status - Fair, and Smoker.

\[\begin{equation} \hat{Systolic} = 0.20(Height) - 0.31(Male) + 0.23(County General Hospital) \\ - 0.46(HealthFair) + 1.44(Smoker) -0.36 \\ \end{equation}\]

Of note, there is an interesting discussion as to whether stepwise regression should ever be used. Some traditional staticians say it should never be, since you’re leaving model building purely to the math, and not using human intelligence to determine which features to include in the model. Data miners, on the other hand, use an alternative methodology, where you completely rely on the math to select features. Only afterwords, do you determine if the included features are appropriate.

It’s unsettling because the Age of Enlightenment was founded on the scientific method, Isaac Newton himself adopted it. We shouldn’t be readily willing to step off this proven path, should we? But consider this case of why we should. When you’re determining which variables to explain the dependent (here it is Systolic), you’re introducing your own bias. The effect is that of wearing a pair of horse blinders. You’ll dismiss things that you don’t believe have any impact, and only include those factors you believe important. But correlations and causations sometimes surprise us. Herein lies my own case for ex post facto feature selection, and with it, stepwise regression.

R

# #install.packages("pacman")
# library("pacman")

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(base64enc, ggplot2, kableExtra)

patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)

#Preview structure
str(patients)

#Preview top 5 rows
head(patients, n=5)

patientsOriginal <- patients

df <- patients[-c(2, 5)] #deletes columns 2 and 5

df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)

#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")

df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1

scaled_numericdf <- scale(df_numeric)

df <- cbind(scaled_numericdf, df_categorical)

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=df$Age)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Age", x="Age", y="Count")

HeightPlot <- ggplot(data=df, aes(x=df$Height)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Height", x="Height", y="Count")

WeightPlot <- ggplot(data=df, aes(x=df$Weight)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3) +
                  labs(title="Histogram for Weight", x="Weight", y="Count")

SystolicPlot <- ggplot(data=df, aes(x=df$Systolic)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="green",
                  alpha = .3) +
                  labs(title="Histogram for Systolic", x="Systolic", y="Count")

grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)

names(df)[5] <- "Male"
names(df)[6] <- "Location1"
names(df)[7] <- "Location2"
names(df)[8] <- "Location3"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"

# fit <- lm(Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + Location3 + HealthExcellent + HealthFair + HealthGood + HealthPoor + Smoker, data = df)

#Finding the best model using Forward / Backward Stepwise Regression
full.model = lm(Systolic ~ ., data = df) #. means use all columns

summary(full.model)

reduced.model= step(full.model, direction = "backward")

summary(reduced.model)

plot(reduced.model)

ModelComparison_AIC <- AIC(full.model, reduced.model)

print(ModelComparison_AIC)

ModelComparison_BIC <- BIC(full.model, reduced.model)

print(ModelComparison_BIC)

Python

coding: utf-8

In[1]:

import pandas as pd

import numpy as np

In[2]:

Load data

genfromtext is a Numpy function. I prefer this explicit file path.

Because there is header column, set header=0

patients = pd.read_csv(“C:\tmp\patients.csv”, header=0)

Backup patients, just in case we need it later

patientsBackup = patients

### Preview the Data

We want to preview the data to see what we’ll be working with. This will > display any missing values, as well.

In[3]:

quick description of the data

patients.info()

top 3 rows

patients.head(3)

Each attribute contains 100 observations; there are no missing values. > Therefore, we do not need to fill missing values with mean/mode, or drop any > columns/rows.

In[4]:

show a summary of the numerical attributes

patients.describe; # semi-colon to turn off echo (terminology?)

In[5]:

Histogram visualization

hist() relies on matplotlib

import matplotlib.pyplot as plot

%matplotlib inline

patients.hist(bins=20, figsize=(16,8))

plot.show()

##### Cross Correlation Check

Previous function provided in Hands-On Machine Learning book was > depreciated

In[6]:

from pandas.plotting import scatter_matrix

This is not the latest dataframe

attributes = [“Age”, “Diastolic”, “Height”, “Smoker”, “Weight”]

scatter_matrix(patients[attributes], figsize=(16, 8))

the text output below is expected

https://pandas.pydata.org/pandas-docs/stable/visualization.html

Nothing so interesting.

### Data Adjustments

#### First split matrix into y (dependent) and x (independent)

Remember, Python is 0-offset! The “3rd” entry is at position 2.

patientsY = Diastolic

patientsX = Everything else exluding LastName and Systolic

Perfect, clear example on splitting y and x found > here > and here

Final solution on selecting multiple columns found > here

The independent variables consist of numeric, categorical and binary > datatypes. Each will be processed individually.

In[7]:

split dependent variable and independent variables

patientsY = patients[patients.columns[1]]

patientsY = patients.iloc[:,1:1]

The clearest is this:

patientsY = patients[“Diastolic”]

patientsX = patients[“Age”,“Gender”]

patientsX = patients.loc[:,“Age”:“Gender”]

patientsX = patients[[“Age”, “Gender”, “Height”, “Location”, > “SelfAssessedHealthStatus”, “Weight”]]

patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]

Smoker is not pulled with the other categorical data, so this next code line > was added

patientsXBinary = patients[[“Smoker”]]

#### Standardize the Data, or mean removal and variance > scaling

>Standardization of datasets is a common requirement for many machine > learning estimators implemented in scikit-learn; they might behave badly if the > individual features do not more or less look like standard normally distributed > data: Gaussian with zero mean and unit variance.

>Basically, take a matrix and change it so that its mean is equal to 0 and > variance is 1

It matters in our case because Weight has values so much higher than Age. > After fitting, our interpretation of the model will be influenced more by > weight than age, since it has higher values.

We don’t need to standardize the dependent y variable, so we split the matrix > before standardizing the entire X matrix.

here is the clearest example of normalizing and > standardizing.

In[8]:

from sklearn import preprocessing

Standardize

patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]

patientsXNumeric_scaled = preprocessing.scale(patientsXNumeric, axis=0)

Scaled data should have zero mean and unit variance.

In[9]:

Mean

print(“Mean:”, patientsXNumeric_scaled.mean())

Std

print(“Std:”, patientsXNumeric_scaled.std())

print(“Length:”, len(patientsXNumeric_scaled))

## Standardizing removed the header from the row. I need to fix this.

##### One hot encoding preperation

Perform one hot encoding, where 1 = hot, 0 = cold. Each feature value gets > its own binary column.

Excellent tutorial > here

Another one here

In[10]:

Return only object datatypes (non-numeric here)

categories = patientsX.select_dtypes(include=[object])

As you will only be dealing with categorical features in this tutorial, it’s > better to filter them out.

You can create a separate DataFrame consisting of only these features by > running the following command.

The method .copy() is used here so that any changes made in new DataFrame > don’t get reflected in the original one.

categoriesX = patientsX.select_dtypes(include=[object]).copy()

categoriesX.head()

Let’s also check the column-wise distribution of null values:

In[11]:

print(categoriesX.isnull().sum())

print(patientsXBinary.isnull().sum())

No missing values. Good!

Next, count distint cases of each category

In[12]:

print(categoriesX[“Location”].value_counts().count())

print(“Gender:”, categoriesX[“Gender”].value_counts().count())

print(“Location:”, categoriesX[“Location”].value_counts().count())

print(“SelfAssessedHealthStatus:”, > categoriesX[“SelfAssessedHealthStatus”].value_counts().count())

print(“Smoker:”, patientsXBinary[“Smoker”].value_counts().count())

There are not too many unique values that would complicate linear regression > as a result of one-hot encoding.

##### One-Hot Encoding

As said in this terrific one-hot > tutorial:

>There are many libraries out there that support one-hot encoding but the > simplest one is using pandas’ .get_dummies() method.

>

>There are mainly three arguments important here, the first one is the > DataFrame you want to encode on, second being the columns argument which lets > you specify the columns you want to do encoding on, and third, the prefix > argument which lets you specify the prefix for the new columns that will be > created after encoding.

LastName is not to be included in the linear regression.

In[13]:

categoriesX_onehot = categoriesX.copy()

categoriesX_onehot = pd.get_dummies(categoriesX, columns=[“Gender”, “Location”, > “SelfAssessedHealthStatus”], prefix = [“Gender”, “Location”, > “SelfAssessedHealthStatus”])

categoriesXBinary_onehot = pd.get_dummies(patientsXBinary, columns=[“Smoker”], > prefix = [“Smoker”])

Return results

print(categoriesX_onehot.head());

print(categoriesXBinary_onehot.head());

Now that one-hot encoding has split the categorical attributes into many > dummy attributes, they must be concatenated back together. This can be done via > pandas’ .concat() method. The axis argument is set to 1 as you want to merge on > columns.

In[14]:

print(“categoriesX_onehot is:”, type(categoriesX_onehot))

print(categoriesX_onehot.shape)

print(“categoriesXBinary_onehot is:”, type(categoriesXBinary_onehot))

print(categoriesXBinary_onehot.shape)

print(“patientsXNumeric_scaled is:”, type(patientsXNumeric_scaled))

print(patientsXNumeric_scaled.shape)

patientsXNumeric_scaled is an array. I used this SO > post to convert it to a > dataframe.

In[15]:

patientsXNumeric_scaleddf = pd.DataFrame(patientsXNumeric_scaled)

In[16]:

print(“patientsXNumeric_scaleddf is:”, type(patientsXNumeric_scaleddf))

In[17]:

print(patientsXNumeric_scaleddf.head())

Looks better, but it still needs column names.

In[18]:

patientsXNumeric_scaleddf.columns = [“Age”, “Height”, “Weight”]

Now we bring all the columns back together as one dataframe.

In[29]:

Bring them back together

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric, > categoriesXBinary_onehot], axis=1)

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric_scaleddf, categoriesXBinary_onehot], axis=1)

top 3 rows

print(patientsXAll.head(3))

Above, the numeric, one-hot encoded categorical and binary columns have been > concatenated into one dataframe.

##### Binning/Aggregating

None of the features (e.g. Age) require binning/aggregating.

### Build a Linear Regression Model

3. Use variables Age, Gender, Height, Weight, Smoker, Location, > SelfAssessedHealthStatus to build a linear regression model to predict the > systolic blood pressure.

That does not include Diastolic or LastName in the prediction.

In this assignment, there is no need to split the dataset into training and > testing, or training, validation and testing, but if you wanted to, > this is an incredibly clear example > on that.

In[30]:

mdl = fitlm(patientsXAll, patientsY)

import matplotlib.pyplot as plt

import numpy as np

from sklearn import linear_model

from sklearn.metrics import mean_squared_error, r2_score

Create linear regression object

regr = linear_model.LinearRegression()

Train the model using the training sets

trainedmodel = regr.fit(patientsXAll, patientsY)

### Interpretation

4. What are the regression coefficients (thetas)?

In[31]:

yint (theta0)

yint = regr.intercept_

print(“Y intercept: ”, yint, “”)

The coefficients (theta1)

coefficients = regr.coef_

print(“Coefficients: ”, coefficients)

5. How do you interpret those numbers?

In[32]:

print(“Number of coefficients:”, len(coefficients))

print(“Number of columns:”, len(patientsXAll.columns))

Just wanted to quickly check to see if the coefficient count is the same as > my attribute count.

In[33]:

print(patientsXAll.columns)

print(coefficients)

A coefficient of 10 for a numeric, non-dummy attribute indicates that for > every +1 standard deviation in the independent variable (exogenous), the > dependent variable (endogenous) increases by 10 variable. That is, when > Weight increases by 1 standard deviation, then predicted diastolic > increases by 1.76e-1 or 0.176.

A coefficient of 10 for a categorical, dummy attribute indicates that when > the independent variable is 1 (TRUE), the dependent variable increases by 10 > variable, relative to the base assumption. That is, when you Smoke, > predicted diastolic increases by 5.188, relative to the baseline of > not-Smoking.

6. If you need to identify one outlier record, which record is a potential > outlier? How do you reach this conclusion?

There are no outliers for categorical (dummy)/binary attributes. Gender, > LastName, Location, SelfAssessment and Smoker are all irrelevant in the search > for outliers. Hence, we are only interested in the remaining three numeric > attributes Age, Height and Weight. Let’s begin the search with a box plot.

In[34]:

import matplotlib.pyplot as plot

get_ipython().run_line_magic(‘matplotlib’, ‘inline’)

patientsXNumeric_scaleddf.plot.box(figsize=(16,4))

This suggests Height has the largest absolute outlier, which is a minimum. We > can now examine this with scatterplots.

In[35]:

matplotlib.pyplot.scatter(patientsXNumeric_scaleddf[“Age”], patientsY)

In:

matplotlib.pyplot.scatter(patientsXNumeric_scaleddf[“Height”], patientsY)

Here we see that same minimum outlier in Height.

In[36]:

matplotlib.pyplot.scatter(patientsXNumeric_scaleddf[“Weight”], patientsY)

Find the numbers for those minumums and maximums.

In[37]:

patientsXNumeric_scaleddf.min()

In[38]:

patientsXNumeric_scaleddf.max()

So far, the single outlier record I identify using a boxplot is the lowest > Height value, -2.505

In[39]:

patientsXNumeric_scaleddf[“Height”].min()

## But what about Cook or Leverage?

statsmodels.stats.outliers_influence.OLSInfluence. #

But I could not get this working. The above is all I can manage. Think next > time I need to try matlab…

The feature I’d remove first is LastName, since through one-hot encoding > this attribute renders 100 columns. For our dataset consisting of only 100 > rows, this is far too high and might conflict againt maximum degress of > freedom.

Markdown

To view this entire document’s markdown code, click here.

Data

If you don’t have the dataset, copy the table below, and after pasting it into Excel, save it as a comma seperated file named patients.csv in your preferred directory.

Age	Diastolic	Gender	Height	LastName	Location	SelfAssessedHealthStatus	Smoker	Systolic	Weight
38	93	‘Male’	71	‘Smith’	‘County General Hospital’	‘Excellent’	1	124	176
43	77	‘Male’	69	‘Johnson’	‘VA Hospital’	‘Fair’	0	109	163
38	83	‘Female’	64	‘Williams’	‘St. Mary’s Medical Center’	‘Good’	0	125	131
40	75	‘Female’	67	‘Jones’	‘VA Hospital’	‘Fair’	0	117	133
49	80	‘Female’	64	‘Brown’	‘County General Hospital’	‘Good’	0	122	119
46	70	‘Female’	68	‘Davis’	‘St. Mary’s Medical Center’	‘Good’	0	121	142
33	88	‘Female’	64	‘Miller’	‘VA Hospital’	‘Good’	1	130	142
40	82	‘Male’	68	‘Wilson’	‘VA Hospital’	‘Good’	0	115	180
28	78	‘Male’	68	‘Moore’	‘St. Mary’s Medical Center’	‘Excellent’	0	115	183
31	86	‘Female’	66	‘Taylor’	‘County General Hospital’	‘Excellent’	0	118	132
45	77	‘Female’	68	‘Anderson’	‘County General Hospital’	‘Excellent’	0	114	128
42	68	‘Female’	66	‘Thomas’	‘St. Mary’s Medical Center’	‘Poor’	0	115	137
25	74	‘Male’	71	‘Jackson’	‘VA Hospital’	‘Poor’	0	127	174
39	95	‘Male’	72	‘White’	‘VA Hospital’	‘Excellent’	1	130	202
36	79	‘Female’	65	‘Harris’	‘St. Mary’s Medical Center’	‘Good’	0	114	129
48	92	‘Male’	71	‘Martin’	‘VA Hospital’	‘Good’	1	130	181
32	95	‘Male’	69	‘Thompson’	‘St. Mary’s Medical Center’	‘Excellent’	1	124	191
27	79	‘Female’	69	‘Garcia’	‘VA Hospital’	‘Fair’	1	123	131
37	77	‘Male’	70	‘Martinez’	‘County General Hospital’	‘Good’	0	119	179
50	76	‘Male’	68	‘Robinson’	‘County General Hospital’	‘Good’	0	125	172
48	75	‘Female’	65	‘Clark’	‘VA Hospital’	‘Excellent’	0	121	133
39	79	‘Female’	64	‘Rodriguez’	‘VA Hospital’	‘Fair’	0	123	117
41	88	‘Female’	62	‘Lewis’	‘VA Hospital’	‘Fair’	0	114	137
44	90	‘Female’	66	‘Lee’	‘County General Hospital’	‘Fair’	1	128	146
28	96	‘Female’	65	‘Walker’	‘County General Hospital’	‘Good’	1	129	123
25	77	‘Male’	70	‘Hall’	‘VA Hospital’	‘Poor’	0	114	189
39	80	‘Female’	63	‘Allen’	‘VA Hospital’	‘Excellent’	0	113	143
25	76	‘Female’	63	‘Young’	‘County General Hospital’	‘Good’	0	125	114
36	83	‘Male’	68	‘Hernandez’	‘County General Hospital’	‘Poor’	0	120	166
30	89	‘Male’	67	‘King’	‘County General Hospital’	‘Excellent’	1	127	186
45	92	‘Female’	70	‘Wright’	‘VA Hospital’	‘Excellent’	1	134	126
40	83	‘Female’	66	‘Lopez’	‘VA Hospital’	‘Poor’	0	121	137
25	80	‘Female’	64	‘Hill’	‘St. Mary’s Medical Center’	‘Excellent’	0	115	138
47	84	‘Male’	70	‘Scott’	‘St. Mary’s Medical Center’	‘Excellent’	0	127	187
44	92	‘Male’	71	‘Green’	‘County General Hospital’	‘Good’	0	121	193
48	83	‘Female’	66	‘Adams’	‘VA Hospital’	‘Excellent’	0	127	137
44	90	‘Male’	71	‘Baker’	‘VA Hospital’	‘Good’	1	136	192
35	85	‘Female’	66	‘Gonzalez’	‘St. Mary’s Medical Center’	‘Fair’	0	117	118
33	90	‘Male’	66	‘Nelson’	‘St. Mary’s Medical Center’	‘Good’	1	124	180
38	74	‘Female’	63	‘Carter’	‘St. Mary’s Medical Center’	‘Good’	0	120	128
39	92	‘Male’	71	‘Mitchell’	‘County General Hospital’	‘Fair’	1	128	164
44	80	‘Male’	69	‘Perez’	‘VA Hospital’	‘Excellent’	0	116	183
44	89	‘Male’	70	‘Roberts’	‘VA Hospital’	‘Good’	1	132	169
37	96	‘Male’	70	‘Turner’	‘VA Hospital’	‘Excellent’	1	137	194
45	89	‘Male’	67	‘Phillips’	‘VA Hospital’	‘Good’	0	117	172
37	77	‘Female’	65	‘Campbell’	‘County General Hospital’	‘Fair’	0	116	135
30	81	‘Male’	68	‘Parker’	‘VA Hospital’	‘Poor’	0	119	182
39	76	‘Female’	62	‘Evans’	‘County General Hospital’	‘Good’	0	123	121
42	83	‘Male’	70	‘Edwards’	‘County General Hospital’	‘Excellent’	0	116	158
42	78	‘Male’	67	‘Collins’	‘County General Hospital’	‘Good’	1	124	179
49	95	‘Male’	68	‘Stewart’	‘County General Hospital’	‘Poor’	1	129	170
44	91	‘Female’	62	‘Sanchez’	‘St. Mary’s Medical Center’	‘Good’	1	130	136
43	91	‘Female’	64	‘Morris’	‘County General Hospital’	‘Poor’	1	132	135
47	86	‘Female’	66	‘Rogers’	‘VA Hospital’	‘Excellent’	0	117	147
50	89	‘Male’	72	‘Reed’	‘VA Hospital’	‘Excellent’	1	129	186
38	79	‘Female’	63	‘Cook’	‘VA Hospital’	‘Excellent’	0	118	124
41	74	‘Female’	66	‘Morgan’	‘St. Mary’s Medical Center’	‘Good’	0	120	134
45	82	‘Male’	70	‘Bell’	‘St. Mary’s Medical Center’	‘Good’	1	138	170
36	76	‘Male’	71	‘Murphy’	‘VA Hospital’	‘Good’	0	117	180
38	81	‘Female’	68	‘Bailey’	‘St. Mary’s Medical Center’	‘Good’	0	113	130
29	77	‘Female’	63	‘Rivera’	‘County General Hospital’	‘Excellent’	0	122	130
28	73	‘Female’	65	‘Cooper’	‘VA Hospital’	‘Good’	0	115	127
30	85	‘Female’	67	‘Richardson’	‘County General Hospital’	‘Excellent’	0	120	141
28	76	‘Female’	66	‘Cox’	‘County General Hospital’	‘Good’	0	117	111
29	80	‘Female’	68	‘Howard’	‘VA Hospital’	‘Excellent’	0	123	134
36	80	‘Male’	71	‘Ward’	‘St. Mary’s Medical Center’	‘Good’	0	123	189
45	79	‘Female’	70	‘Torres’	‘County General Hospital’	‘Excellent’	0	119	137
32	82	‘Female’	60	‘Peterson’	‘County General Hospital’	‘Excellent’	0	110	136
31	79	‘Female’	64	‘Gray’	‘VA Hospital’	‘Excellent’	0	121	130
48	82	‘Female’	64	‘Ramirez’	‘County General Hospital’	‘Excellent’	1	138	137
25	75	‘Male’	66	‘James’	‘County General Hospital’	‘Good’	0	125	186
40	91	‘Female’	64	‘Watson’	‘VA Hospital’	‘Fair’	1	122	127
39	74	‘Male’	72	‘Brooks’	‘St. Mary’s Medical Center’	‘Excellent’	0	120	176
41	78	‘Female’	65	‘Kelly’	‘St. Mary’s Medical Center’	‘Poor’	0	117	127
33	85	‘Female’	67	‘Sanders’	‘St. Mary’s Medical Center’	‘Excellent’	1	125	115
31	84	‘Male’	72	‘Price’	‘VA Hospital’	‘Fair’	1	124	178
35	75	‘Female’	64	‘Bennett’	‘County General Hospital’	‘Fair’	0	121	131
32	78	‘Male’	68	‘Wood’	‘St. Mary’s Medical Center’	‘Poor’	0	118	183
42	81	‘Male’	66	‘Barnes’	‘County General Hospital’	‘Excellent’	0	120	194
48	79	‘Female’	64	‘Ross’	‘VA Hospital’	‘Good’	0	118	126
34	85	‘Male’	68	‘Henderson’	‘St. Mary’s Medical Center’	‘Good’	0	118	186
39	79	‘Male’	69	‘Coleman’	‘VA Hospital’	‘Excellent’	0	122	188
28	82	‘Male’	69	‘Jenkins’	‘County General Hospital’	‘Good’	1	134	189
29	80	‘Female’	64	‘Perry’	‘St. Mary’s Medical Center’	‘Good’	0	131	120
32	80	‘Female’	63	‘Powell’	‘VA Hospital’	‘Excellent’	0	113	132
39	92	‘Male’	68	‘Long’	‘County General Hospital’	‘Good’	1	125	182
37	92	‘Female’	65	‘Patterson’	‘County General Hospital’	‘Poor’	1	135	120
49	96	‘Female’	63	‘Hughes’	‘County General Hospital’	‘Good’	1	128	123
31	87	‘Female’	66	‘Flores’	‘VA Hospital’	‘Good’	1	123	141
37	81	‘Female’	65	‘Washington’	‘St. Mary’s Medical Center’	‘Good’	0	122	129
38	90	‘Male’	68	‘Butler’	‘County General Hospital’	‘Excellent’	1	138	184
45	77	‘Male’	71	‘Simmons’	‘VA Hospital’	‘Excellent’	0	124	181
30	91	‘Female’	70	‘Foster’	‘St. Mary’s Medical Center’	‘Fair’	0	130	124
48	79	‘Male’	71	‘Gonzales’	‘County General Hospital’	‘Good’	0	123	174
48	73	‘Female’	66	‘Bryant’	‘County General Hospital’	‘Excellent’	0	129	134
25	99	‘Male’	69	‘Alexander’	‘County General Hospital’	‘Good’	1	128	171
44	92	‘Male’	69	‘Russell’	‘VA Hospital’	‘Good’	1	124	188
49	74	‘Male’	70	‘Griffin’	‘County General Hospital’	‘Fair’	0	119	186
45	93	‘Male’	68	‘Diaz’	‘County General Hospital’	‘Good’	1	136	172
48	86	‘Male’	66	‘Hayes’	‘County General Hospital’	‘Fair’	0	114	177

Publications

Videos

I’ve recorded a 45 minute video on how to bring machine learning to the next level in an applied Wine Quality Prediction Project.

If you’re not ready for that and want a tutorial on the basics of Machine Learning, my 1.5 hour Overview of Machine Learning might be better. It will guide you through many of the general concepts, as well as some of the various models listed below.

More RPubs:

Contact

Email: mortensengarth@hotmail.com

LinkedIn: https://www.linkedin.com/in/mortensengarth/.

Stepwise Regression: Predicting Systolic

Garth Mortensen

September 2, 2018

Overview

Pre-Modeling

Load Required Packages

Load Data

Preview Data

Preprocessing

Plot histograms

Modeling

Optimize model

AIC / BIC Model Comparison

Results

R

Python

coding: utf-8

In[1]:

In[2]:

Load data

genfromtext is a Numpy function. I prefer this explicit file path.

Because there is header column, set header=0

Backup patients, just in case we need it later

### Preview the Data

We want to preview the data to see what we’ll be working with. This will > display any missing values, as well.

In[3]:

quick description of the data

top 3 rows

Each attribute contains 100 observations; there are no missing values. > Therefore, we do not need to fill missing values with mean/mode, or drop any > columns/rows.

In[4]:

show a summary of the numerical attributes

In[5]:

Histogram visualization

hist() relies on matplotlib

import matplotlib.pyplot as plot

%matplotlib inline

patients.hist(bins=20, figsize=(16,8))

plot.show()

##### Cross Correlation Check

Previous function provided in Hands-On Machine Learning book was > depreciated

In[6]:

This is not the latest dataframe

the text output below is expected

https://pandas.pydata.org/pandas-docs/stable/visualization.html

Nothing so interesting.

### Data Adjustments

#### First split matrix into y (dependent) and x (independent)

Remember, Python is 0-offset! The “3rd” entry is at position 2.

patientsY = Diastolic

patientsX = Everything else exluding LastName and Systolic

Perfect, clear example on splitting y and x found > here > and here

Final solution on selecting multiple columns found > here

The independent variables consist of numeric, categorical and binary > datatypes. Each will be processed individually.

In[7]:

split dependent variable and independent variables

patientsY = patients[patients.columns[1]]

patientsY = patients.iloc[:,1:1]

The clearest is this:

patientsX = patients[“Age”,“Gender”]

patientsX = patients.loc[:,“Age”:“Gender”]

Smoker is not pulled with the other categorical data, so this next code line > was added

#### Standardize the Data, or mean removal and variance > scaling

>Standardization of datasets is a common requirement for many machine > learning estimators implemented in scikit-learn; they might behave badly if the > individual features do not more or less look like standard normally distributed > data: Gaussian with zero mean and unit variance.

>Basically, take a matrix and change it so that its mean is equal to 0 and > variance is 1

It matters in our case because Weight has values so much higher than Age. > After fitting, our interpretation of the model will be influenced more by > weight than age, since it has higher values.

We don’t need to standardize the dependent y variable, so we split the matrix > before standardizing the entire X matrix.

here is the clearest example of normalizing and > standardizing.

In[8]:

Standardize

Scaled data should have zero mean and unit variance.

In[9]:

Mean

Std

## Standardizing removed the header from the row. I need to fix this.

##### One hot encoding preperation

Perform one hot encoding, where 1 = hot, 0 = cold. Each feature value gets > its own binary column.

Excellent tutorial > here

Another one here

In[10]:

Return only object datatypes (non-numeric here)