Image credit: rawpixel.
This analysis fits a stepwise linear regression model to predict Systolic blood pressure. Included are seven numeric and categorical features: Age, Height, Weight, Gender, Hospital Location (CGH - County General Hospital, SMMC - St. Mary’s Medical Center, VA - VA Hospital), Status of Health (Excellent, Fair, Good, Poor), and Smoker. Our regression will include the general form of:
\[\begin{equation} \hat{Systolic} = \theta(Age) + \theta(Height) + \theta(Weight) + ... \\ + \theta(HealthFair) + \theta(HealthPoor) + \theta(Smoker) + b \\ \end{equation}\]This is just one of the five machine learning modeling guides you can find here.
The following achieves the following requirements:
load patient self evaluation dataset.
What are the regression coefficients (thetas)?
Create a reduced model using stepwise regression.
There are two ways to load the required packages.
# #install.packages("pacman")
# library("pacman")
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(base64enc, ggplot2, kableExtra)
The dataset we will be loading appears as:
Document Preview
patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)
Examine the data structure.
#Preview structure
str(patients)
## 'data.frame': 100 obs. of 10 variables:
## $ Age : int 38 43 38 40 49 46 33 40 28 31 ...
## $ Diastolic : int 93 77 83 75 80 70 88 82 78 86 ...
## $ Gender : Factor w/ 2 levels "'Female'","'Male'": 2 2 1 1 1 1 1 2 2 1 ...
## $ Height : int 71 69 64 67 64 68 64 68 68 66 ...
## $ LastName : Factor w/ 100 levels "'Adams'","'Alexander'",..: 84 45 96 46 11 22 55 97 57 86 ...
## $ Location : Factor w/ 3 levels "'County General Hospital'",..: 1 3 2 3 1 2 3 3 2 1 ...
## $ SelfAssessedHealthStatus: Factor w/ 4 levels "'Excellent'",..: 1 2 3 2 3 3 3 3 1 1 ...
## $ Smoker : int 1 0 0 0 0 0 1 0 0 0 ...
## $ Systolic : int 124 109 125 117 122 121 130 115 115 118 ...
## $ Weight : int 176 163 131 133 119 142 142 180 183 132 ...
Examine the top 5 rows.
#Preview top 5 rows
head(patients, n=5)
## Age Diastolic Gender Height LastName Location
## 1 38 93 'Male' 71 'Smith' 'County General Hospital'
## 2 43 77 'Male' 69 'Johnson' 'VA Hospital'
## 3 38 83 'Female' 64 'Williams' 'St. Mary's Medical Center'
## 4 40 75 'Female' 67 'Jones' 'VA Hospital'
## 5 49 80 'Female' 64 'Brown' 'County General Hospital'
## SelfAssessedHealthStatus Smoker Systolic Weight
## 1 'Excellent' 1 124 176
## 2 'Fair' 0 109 163
## 3 'Good' 0 125 131
## 4 'Fair' 0 117 133
## 5 'Good' 0 122 119
Each attribute contains 100 observations; there are no missing values. Therefore, we do not need to fill missing values with mean/mode, or drop any columns/rows.
Now remove from patients table the unwanted columns of Diastolic and LastName.
patientsOriginal <- patients
df <- patients[-c(2, 5)] #deletes columns 2 and 5
Split dataframes into categorical and numeric
df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]
One-hot encode categorical columns.
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)
#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")
df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1
Standardize numerics, and then recombine ordered dataframes.
An additional interesting discussion on when to standardize is here.
scaled_numericdf <- scale(df_numeric)
Recombine forked categorical and numeric dataframes together using column bind.
df <- cbind(scaled_numericdf, df_categorical)
Plot histogram of numeric columns. For bin specification, see here.
For plotting multiples, see here.
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)
AgePlot <- ggplot(data=df, aes(x=df$Age)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3) +
labs(title="Histogram for Age", x="Age", y="Count")
HeightPlot <- ggplot(data=df, aes(x=df$Height)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3) +
labs(title="Histogram for Height", x="Height", y="Count")
WeightPlot <- ggplot(data=df, aes(x=df$Weight)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3) +
labs(title="Histogram for Weight", x="Weight", y="Count")
SystolicPlot <- ggplot(data=df, aes(x=df$Systolic)) +
geom_histogram(bins = 20,
col="black",
fill="green",
alpha = .3) +
labs(title="Histogram for Systolic", x="Systolic", y="Count")
grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)
Weight distribution is a bit unusual, as you would expect it to be fairly normally distributed.
Optional: Explore the dataset using a scatterplot.
Rename column headers for easier interpretation and reference.
names(df)[5] <- "Male"
names(df)[6] <- "Location1"
names(df)[7] <- "Location2"
names(df)[8] <- "Location3"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"
Having completed the pre-processing and data exploration phases, we now move onto building a multiple linear regression model.
# fit <- lm(Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + Location3 + HealthExcellent + HealthFair + HealthGood + HealthPoor + Smoker, data = df)
#Finding the best model using Forward / Backward Stepwise Regression
full.model = lm(Systolic ~ ., data = df) #. means use all columns
summary(full.model)
##
## Call:
## lm(formula = Systolic ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2853 -0.4662 -0.1028 0.4681 1.7636
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.43112 0.33094 -1.303 0.1960
## Age 0.08627 0.07201 1.198 0.2341
## Height 0.19844 0.10729 1.850 0.0677 .
## Weight -0.05311 0.23105 -0.230 0.8187
## Male -0.22038 0.48649 -0.453 0.6516
## Location1 0.25844 0.16881 1.531 0.1293
## Location2 0.13084 0.19193 0.682 0.4972
## Location3 NA NA NA NA
## HealthExcellent -0.06843 0.24970 -0.274 0.7847
## HealthFair -0.47823 0.28414 -1.683 0.0959 .
## HealthGood 0.01892 0.24150 0.078 0.9377
## HealthPoor NA NA NA NA
## Smoker 1.44098 0.15581 9.249 1.15e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.702 on 89 degrees of freedom
## Multiple R-squared: 0.5569, Adjusted R-squared: 0.5071
## F-statistic: 11.19 on 10 and 89 DF, p-value: 3.894e-12
Where: response ~ op1 term1 op2 term 2 . opn term n
Optional: Explore the dataset using the squared error, with the code: plot_ss(x = dfAge, y = dfSystolic, showSquares = TRUE), which I found here.
We can optimize this model using stepwise regression to find the best fit, yet still robust model.
reduced.model= step(full.model, direction = "backward")
## Start: AIC=-60.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 +
## Location3 + HealthExcellent + HealthFair + HealthGood + HealthPoor +
## Smoker
##
##
## Step: AIC=-60.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 +
## Location3 + HealthExcellent + HealthFair + HealthGood + Smoker
##
##
## Step: AIC=-60.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 +
## HealthExcellent + HealthFair + HealthGood + Smoker
##
## Df Sum of Sq RSS AIC
## - HealthGood 1 0.003 43.869 -62.397
## - Weight 1 0.026 43.892 -62.345
## - HealthExcellent 1 0.037 43.902 -62.320
## - Male 1 0.101 43.967 -62.174
## - Location2 1 0.229 44.095 -61.883
## - Age 1 0.707 44.573 -60.805
## <none> 43.865 -60.404
## - Location1 1 1.155 45.021 -59.805
## - HealthFair 1 1.396 45.262 -59.271
## - Height 1 1.686 45.552 -58.633
## - Smoker 1 42.158 86.023 4.945
##
## Step: AIC=-62.4
## Systolic ~ Age + Height + Weight + Male + Location1 + Location2 +
## HealthExcellent + HealthFair + Smoker
##
## Df Sum of Sq RSS AIC
## - Weight 1 0.026 43.895 -64.338
## - Male 1 0.101 43.969 -64.168
## - HealthExcellent 1 0.122 43.991 -64.119
## - Location2 1 0.234 44.102 -63.866
## - Age 1 0.722 44.590 -62.765
## <none> 43.869 -62.397
## - Location1 1 1.159 45.028 -61.789
## - Height 1 1.683 45.552 -60.633
## - HealthFair 1 2.606 46.475 -58.626
## - Smoker 1 42.566 86.434 3.421
##
## Step: AIC=-64.34
## Systolic ~ Age + Height + Male + Location1 + Location2 + HealthExcellent +
## HealthFair + Smoker
##
## Df Sum of Sq RSS AIC
## - HealthExcellent 1 0.163 44.058 -65.967
## - Location2 1 0.243 44.138 -65.786
## - Age 1 0.713 44.607 -64.727
## <none> 43.895 -64.338
## - Male 1 1.104 44.999 -63.853
## - Location1 1 1.285 45.180 -63.452
## - Height 1 1.690 45.585 -62.560
## - HealthFair 1 2.586 46.481 -60.614
## - Smoker 1 42.687 86.581 1.591
##
## Step: AIC=-65.97
## Systolic ~ Age + Height + Male + Location1 + Location2 + HealthFair +
## Smoker
##
## Df Sum of Sq RSS AIC
## - Location2 1 0.331 44.389 -67.218
## - Age 1 0.701 44.759 -66.389
## <none> 44.058 -65.967
## - Male 1 0.991 45.049 -65.743
## - Location1 1 1.330 45.387 -64.993
## - Height 1 1.565 45.623 -64.476
## - HealthFair 1 2.450 46.508 -62.555
## - Smoker 1 43.571 87.629 0.794
##
## Step: AIC=-67.22
## Systolic ~ Age + Height + Male + Location1 + HealthFair + Smoker
##
## Df Sum of Sq RSS AIC
## - Age 1 0.614 45.003 -67.843
## <none> 44.389 -67.218
## - Location1 1 1.007 45.396 -66.976
## - Male 1 1.042 45.431 -66.898
## - Height 1 1.552 45.941 -65.782
## - HealthFair 1 2.710 47.099 -63.292
## - Smoker 1 43.243 87.632 -1.203
##
## Step: AIC=-67.84
## Systolic ~ Height + Male + Location1 + HealthFair + Smoker
##
## Df Sum of Sq RSS AIC
## <none> 45.003 -67.843
## - Male 1 1.068 46.071 -67.498
## - Location1 1 1.133 46.137 -67.356
## - Height 1 1.748 46.752 -66.032
## - HealthFair 1 2.677 47.681 -64.065
## - Smoker 1 43.577 88.580 -2.126
summary(reduced.model)
##
## Call:
## lm(formula = Systolic ~ Height + Male + Location1 + HealthFair +
## Smoker, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2726 -0.5124 -0.0512 0.3933 1.8015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3623 0.1351 -2.681 0.00868 **
## Height 0.1983 0.1038 1.911 0.05905 .
## Male -0.3101 0.2076 -1.494 0.13864
## Location1 0.2255 0.1465 1.539 0.12725
## HealthFair -0.4643 0.1963 -2.365 0.02010 *
## Smoker 1.4404 0.1510 9.540 1.74e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6919 on 94 degrees of freedom
## Multiple R-squared: 0.5454, Adjusted R-squared: 0.5212
## F-statistic: 22.56 on 5 and 94 DF, p-value: 8.427e-15
plot(reduced.model)
When comparing models fitted by maximum likelihood to the same data, the smaller the AIC or BIC, the better the fit. For more information on model selection criteria see R Documentation and this.
ModelComparison_AIC <- AIC(full.model, reduced.model)
print(ModelComparison_AIC)
## df AIC
## full.model 12 225.3834
## reduced.model 7 217.9447
ModelComparison_BIC <- BIC(full.model, reduced.model)
print(ModelComparison_BIC)
## df BIC
## full.model 12 256.6455
## reduced.model 7 236.1809
A full and a reduced regression model were constructed. Performance varied between the full model, which includes all variables, and the reduced model, which includes a limited set of variables. The reduced model was built using the backward setpwise method to find the most significant variables for predicting the target Systolic variable. It is shown that the reduced model had a lower p-value, lower AIC and a lower BIC.
This reduced model found the following statisically significant predictors: Height, Gender, Hospital Location - County General Hospital, Health Status - Fair, and Smoker.
\[\begin{equation} \hat{Systolic} = 0.20(Height) - 0.31(Male) + 0.23(County General Hospital) \\ - 0.46(HealthFair) + 1.44(Smoker) -0.36 \\ \end{equation}\]Of note, there is an interesting discussion as to whether stepwise regression should ever be used. Some traditional staticians say it should never be, since you’re leaving model building purely to the math, and not using human intelligence to determine which features to include in the model. Data miners, on the other hand, use an alternative methodology, where you completely rely on the math to select features. Only afterwords, do you determine if the included features are appropriate.
It’s unsettling because the Age of Enlightenment was founded on the scientific method, Isaac Newton himself adopted it. We shouldn’t be readily willing to step off this proven path, should we? But consider this case of why we should. When you’re determining which variables to explain the dependent (here it is Systolic), you’re introducing your own bias. The effect is that of wearing a pair of horse blinders. You’ll dismiss things that you don’t believe have any impact, and only include those factors you believe important. But correlations and causations sometimes surprise us. Herein lies my own case for ex post facto feature selection, and with it, stepwise regression.
# #install.packages("pacman")
# library("pacman")
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(base64enc, ggplot2, kableExtra)
patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)
#Preview structure
str(patients)
#Preview top 5 rows
head(patients, n=5)
patientsOriginal <- patients
df <- patients[-c(2, 5)] #deletes columns 2 and 5
df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)
#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")
df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1
scaled_numericdf <- scale(df_numeric)
df <- cbind(scaled_numericdf, df_categorical)
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)
AgePlot <- ggplot(data=df, aes(x=df$Age)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3) +
labs(title="Histogram for Age", x="Age", y="Count")
HeightPlot <- ggplot(data=df, aes(x=df$Height)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3) +
labs(title="Histogram for Height", x="Height", y="Count")
WeightPlot <- ggplot(data=df, aes(x=df$Weight)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3) +
labs(title="Histogram for Weight", x="Weight", y="Count")
SystolicPlot <- ggplot(data=df, aes(x=df$Systolic)) +
geom_histogram(bins = 20,
col="black",
fill="green",
alpha = .3) +
labs(title="Histogram for Systolic", x="Systolic", y="Count")
grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)
names(df)[5] <- "Male"
names(df)[6] <- "Location1"
names(df)[7] <- "Location2"
names(df)[8] <- "Location3"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"
# fit <- lm(Systolic ~ Age + Height + Weight + Male + Location1 + Location2 + Location3 + HealthExcellent + HealthFair + HealthGood + HealthPoor + Smoker, data = df)
#Finding the best model using Forward / Backward Stepwise Regression
full.model = lm(Systolic ~ ., data = df) #. means use all columns
summary(full.model)
reduced.model= step(full.model, direction = "backward")
summary(reduced.model)
plot(reduced.model)
ModelComparison_AIC <- AIC(full.model, reduced.model)
print(ModelComparison_AIC)
ModelComparison_BIC <- BIC(full.model, reduced.model)
print(ModelComparison_BIC)
coding: utf-8
In[1]:
import pandas as pd
import numpy as np
In[2]:
Load data
genfromtext is a Numpy function. I prefer this explicit file path.
Because there is header column, set header=0
patients = pd.read_csv(“C:\tmp\patients.csv”, header=0)
Backup patients, just in case we need it later
patientsBackup = patients
### Preview the Data
We want to preview the data to see what we’ll be working with. This will > display any missing values, as well.
In[3]:
quick description of the data
patients.info()
top 3 rows
patients.head(3)
Each attribute contains 100 observations; there are no missing values. > Therefore, we do not need to fill missing values with mean/mode, or drop any > columns/rows.
In[4]:
show a summary of the numerical attributes
patients.describe; # semi-colon to turn off echo (terminology?)
In[5]:
Histogram visualization
hist() relies on matplotlib
import matplotlib.pyplot as plot
%matplotlib inline
patients.hist(bins=20, figsize=(16,8))
plot.show()
##### Cross Correlation Check
Previous function provided in Hands-On Machine Learning book was > depreciated
In[6]:
from pandas.plotting import scatter_matrix
This is not the latest dataframe
attributes = [“Age”, “Diastolic”, “Height”, “Smoker”, “Weight”]
scatter_matrix(patients[attributes], figsize=(16, 8))
the text output below is expected
https://pandas.pydata.org/pandas-docs/stable/visualization.html
Nothing so interesting.
### Data Adjustments
#### First split matrix into y (dependent) and x (independent)
Remember, Python is 0-offset! The “3rd” entry is at position 2.
patientsY = Diastolic
patientsX = Everything else exluding LastName and Systolic
Perfect, clear example on splitting y and x found > here > and here
Final solution on selecting multiple columns found > here
The independent variables consist of numeric, categorical and binary > datatypes. Each will be processed individually.
In[7]:
split dependent variable and independent variables
patientsY = patients[patients.columns[1]]
patientsY = patients.iloc[:,1:1]
The clearest is this:
patientsY = patients[“Diastolic”]
patientsX = patients[“Age”,“Gender”]
patientsX = patients.loc[:,“Age”:“Gender”]
patientsX = patients[[“Age”, “Gender”, “Height”, “Location”, > “SelfAssessedHealthStatus”, “Weight”]]
patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]
Smoker is not pulled with the other categorical data, so this next code line > was added
patientsXBinary = patients[[“Smoker”]]
#### Standardize the Data, or mean removal and variance > scaling
>Standardization of datasets is a common requirement for many machine > learning estimators implemented in scikit-learn; they might behave badly if the > individual features do not more or less look like standard normally distributed > data: Gaussian with zero mean and unit variance.
>Basically, take a matrix and change it so that its mean is equal to 0 and > variance is 1
It matters in our case because Weight has values so much higher than Age. > After fitting, our interpretation of the model will be influenced more by > weight than age, since it has higher values.
We don’t need to standardize the dependent y variable, so we split the matrix > before standardizing the entire X matrix.
here is the clearest example of normalizing and > standardizing.
In[8]:
from sklearn import preprocessing
Standardize
patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]
patientsXNumeric_scaled = preprocessing.scale(patientsXNumeric, axis=0)
Scaled data should have zero mean and unit variance.
In[9]:
Mean
print(“Mean:”, patientsXNumeric_scaled.mean())
Std
print(“Std:”, patientsXNumeric_scaled.std())
print(“Length:”, len(patientsXNumeric_scaled))
## Standardizing removed the header from the row. I need to fix this.
##### One hot encoding preperation
Perform one hot encoding, where 1 = hot, 0 = cold. Each feature value gets > its own binary column.
Excellent tutorial > here
Another one here
In[10]:
Return only object datatypes (non-numeric here)
categories = patientsX.select_dtypes(include=[object])
As you will only be dealing with categorical features in this tutorial, it’s > better to filter them out.
You can create a separate DataFrame consisting of only these features by > running the following command.
The method .copy() is used here so that any changes made in new DataFrame > don’t get reflected in the original one.
categoriesX = patientsX.select_dtypes(include=[object]).copy()
categoriesX.head()
Let’s also check the column-wise distribution of null values:
In[11]:
print(categoriesX.isnull().sum())
print(patientsXBinary.isnull().sum())
No missing values. Good!
Next, count distint cases of each category
In[12]:
print(categoriesX[“Location”].value_counts().count())
print(“Gender:”, categoriesX[“Gender”].value_counts().count())
print(“Location:”, categoriesX[“Location”].value_counts().count())
print(“SelfAssessedHealthStatus:”, > categoriesX[“SelfAssessedHealthStatus”].value_counts().count())
print(“Smoker:”, patientsXBinary[“Smoker”].value_counts().count())
There are not too many unique values that would complicate linear regression > as a result of one-hot encoding.
##### One-Hot Encoding
As said in this terrific one-hot > tutorial:
>There are many libraries out there that support one-hot encoding but the > simplest one is using pandas’ .get_dummies() method.
>
>There are mainly three arguments important here, the first one is the > DataFrame you want to encode on, second being the columns argument which lets > you specify the columns you want to do encoding on, and third, the prefix > argument which lets you specify the prefix for the new columns that will be > created after encoding.
LastName is not to be included in the linear regression.
In[13]:
categoriesX_onehot = categoriesX.copy()
categoriesX_onehot = pd.get_dummies(categoriesX, columns=[“Gender”, “Location”, > “SelfAssessedHealthStatus”], prefix = [“Gender”, “Location”, > “SelfAssessedHealthStatus”])
categoriesXBinary_onehot = pd.get_dummies(patientsXBinary, columns=[“Smoker”], > prefix = [“Smoker”])
Return results
print(categoriesX_onehot.head());
print(categoriesXBinary_onehot.head());
Now that one-hot encoding has split the categorical attributes into many > dummy attributes, they must be concatenated back together. This can be done via > pandas’ .concat() method. The axis argument is set to 1 as you want to merge on > columns.
In[14]:
print(“categoriesX_onehot is:”, type(categoriesX_onehot))
print(categoriesX_onehot.shape)
print(“categoriesXBinary_onehot is:”, type(categoriesXBinary_onehot))
print(categoriesXBinary_onehot.shape)
print(“patientsXNumeric_scaled is:”, type(patientsXNumeric_scaled))
print(patientsXNumeric_scaled.shape)
patientsXNumeric_scaled is an array. I used this SO > post to convert it to a > dataframe.
In[15]:
patientsXNumeric_scaleddf = pd.DataFrame(patientsXNumeric_scaled)
In[16]:
print(“patientsXNumeric_scaleddf is:”, type(patientsXNumeric_scaleddf))
In[17]:
print(patientsXNumeric_scaleddf.head())
Looks better, but it still needs column names.
In[18]:
patientsXNumeric_scaleddf.columns = [“Age”, “Height”, “Weight”]
Now we bring all the columns back together as one dataframe.
In[29]:
Bring them back together
patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric, > categoriesXBinary_onehot], axis=1)
patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric_scaleddf, categoriesXBinary_onehot], axis=1)
top 3 rows
print(patientsXAll.head(3))
Above, the numeric, one-hot encoded categorical and binary columns have been > concatenated into one dataframe.
##### Binning/Aggregating
None of the features (e.g. Age) require binning/aggregating.
### Build a Linear Regression Model
3. Use variables Age, Gender, Height, Weight, Smoker, Location, > SelfAssessedHealthStatus to build a linear regression model to predict the > systolic blood pressure.
That does not include Diastolic or LastName in the prediction.
In this assignment, there is no need to split the dataset into training and > testing, or training, validation and testing, but if you wanted to, > this is an incredibly clear example > on that.
In[30]:
mdl = fitlm(patientsXAll, patientsY)
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
Create linear regression object
regr = linear_model.LinearRegression()
Train the model using the training sets
trainedmodel = regr.fit(patientsXAll, patientsY)
### Interpretation
4. What are the regression coefficients (thetas)?
In[31]:
yint (theta0)
yint = regr.intercept_
print(“Y intercept: ”, yint, “”)
The coefficients (theta1)
coefficients = regr.coef_
print(“Coefficients: ”, coefficients)
5. How do you interpret those numbers?
In[32]:
print(“Number of coefficients:”, len(coefficients))
print(“Number of columns:”, len(patientsXAll.columns))
Just wanted to quickly check to see if the coefficient count is the same as > my attribute count.
In[33]:
print(patientsXAll.columns)
print(coefficients)
A coefficient of 10 for a numeric, non-dummy attribute indicates that for > every +1 standard deviation in the independent variable (exogenous), the > dependent variable (endogenous) increases by 10 variable. That is, when > Weight increases by 1 standard deviation, then predicted diastolic > increases by 1.76e-1 or 0.176.
A coefficient of 10 for a categorical, dummy attribute indicates that when > the independent variable is 1 (TRUE), the dependent variable increases by 10 > variable, relative to the base assumption. That is, when you Smoke, > predicted diastolic increases by 5.188, relative to the baseline of > not-Smoking.
6. If you need to identify one outlier record, which record is a potential > outlier? How do you reach this conclusion?
There are no outliers for categorical (dummy)/binary attributes. Gender, > LastName, Location, SelfAssessment and Smoker are all irrelevant in the search > for outliers. Hence, we are only interested in the remaining three numeric > attributes Age, Height and Weight. Let’s begin the search with a box plot.
In[34]:
import matplotlib.pyplot as plot
get_ipython().run_line_magic(‘matplotlib’, ‘inline’)
patientsXNumeric_scaleddf.plot.box(figsize=(16,4))
This suggests Height has the largest absolute outlier, which is a minimum. We > can now examine this with scatterplots.
In[35]:
matplotlib.pyplot.scatter(patientsXNumeric_scaleddf[“Age”], patientsY)
In:
matplotlib.pyplot.scatter(patientsXNumeric_scaleddf[“Height”], patientsY)
Here we see that same minimum outlier in Height.
In[36]:
matplotlib.pyplot.scatter(patientsXNumeric_scaleddf[“Weight”], patientsY)
Find the numbers for those minumums and maximums.
In[37]:
patientsXNumeric_scaleddf.min()
In[38]:
patientsXNumeric_scaleddf.max()
So far, the single outlier record I identify using a boxplot is the lowest > Height value, -2.505
In[39]:
patientsXNumeric_scaleddf[“Height”].min()
## But what about Cook or Leverage?
statsmodels.stats.outliers_influence.OLSInfluence. #
But I could not get this working. The above is all I can manage. Think next > time I need to try matlab…
The feature I’d remove first is LastName, since through one-hot encoding > this attribute renders 100 columns. For our dataset consisting of only 100 > rows, this is far too high and might conflict againt maximum degress of > freedom.
To view this entire document’s markdown code, click here.
If you don’t have the dataset, copy the table below, and after pasting it into Excel, save it as a comma seperated file named patients.csv in your preferred directory.
| Age | Diastolic | Gender | Height | LastName | Location | SelfAssessedHealthStatus | Smoker | Systolic | Weight |
|---|---|---|---|---|---|---|---|---|---|
| 38 | 93 | ‘Male’ | 71 | ‘Smith’ | ‘County General Hospital’ | ‘Excellent’ | 1 | 124 | 176 |
| 43 | 77 | ‘Male’ | 69 | ‘Johnson’ | ‘VA Hospital’ | ‘Fair’ | 0 | 109 | 163 |
| 38 | 83 | ‘Female’ | 64 | ‘Williams’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 125 | 131 |
| 40 | 75 | ‘Female’ | 67 | ‘Jones’ | ‘VA Hospital’ | ‘Fair’ | 0 | 117 | 133 |
| 49 | 80 | ‘Female’ | 64 | ‘Brown’ | ‘County General Hospital’ | ‘Good’ | 0 | 122 | 119 |
| 46 | 70 | ‘Female’ | 68 | ‘Davis’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 121 | 142 |
| 33 | 88 | ‘Female’ | 64 | ‘Miller’ | ‘VA Hospital’ | ‘Good’ | 1 | 130 | 142 |
| 40 | 82 | ‘Male’ | 68 | ‘Wilson’ | ‘VA Hospital’ | ‘Good’ | 0 | 115 | 180 |
| 28 | 78 | ‘Male’ | 68 | ‘Moore’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 0 | 115 | 183 |
| 31 | 86 | ‘Female’ | 66 | ‘Taylor’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 118 | 132 |
| 45 | 77 | ‘Female’ | 68 | ‘Anderson’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 114 | 128 |
| 42 | 68 | ‘Female’ | 66 | ‘Thomas’ | ‘St. Mary’s Medical Center’ | ‘Poor’ | 0 | 115 | 137 |
| 25 | 74 | ‘Male’ | 71 | ‘Jackson’ | ‘VA Hospital’ | ‘Poor’ | 0 | 127 | 174 |
| 39 | 95 | ‘Male’ | 72 | ‘White’ | ‘VA Hospital’ | ‘Excellent’ | 1 | 130 | 202 |
| 36 | 79 | ‘Female’ | 65 | ‘Harris’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 114 | 129 |
| 48 | 92 | ‘Male’ | 71 | ‘Martin’ | ‘VA Hospital’ | ‘Good’ | 1 | 130 | 181 |
| 32 | 95 | ‘Male’ | 69 | ‘Thompson’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 1 | 124 | 191 |
| 27 | 79 | ‘Female’ | 69 | ‘Garcia’ | ‘VA Hospital’ | ‘Fair’ | 1 | 123 | 131 |
| 37 | 77 | ‘Male’ | 70 | ‘Martinez’ | ‘County General Hospital’ | ‘Good’ | 0 | 119 | 179 |
| 50 | 76 | ‘Male’ | 68 | ‘Robinson’ | ‘County General Hospital’ | ‘Good’ | 0 | 125 | 172 |
| 48 | 75 | ‘Female’ | 65 | ‘Clark’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 121 | 133 |
| 39 | 79 | ‘Female’ | 64 | ‘Rodriguez’ | ‘VA Hospital’ | ‘Fair’ | 0 | 123 | 117 |
| 41 | 88 | ‘Female’ | 62 | ‘Lewis’ | ‘VA Hospital’ | ‘Fair’ | 0 | 114 | 137 |
| 44 | 90 | ‘Female’ | 66 | ‘Lee’ | ‘County General Hospital’ | ‘Fair’ | 1 | 128 | 146 |
| 28 | 96 | ‘Female’ | 65 | ‘Walker’ | ‘County General Hospital’ | ‘Good’ | 1 | 129 | 123 |
| 25 | 77 | ‘Male’ | 70 | ‘Hall’ | ‘VA Hospital’ | ‘Poor’ | 0 | 114 | 189 |
| 39 | 80 | ‘Female’ | 63 | ‘Allen’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 113 | 143 |
| 25 | 76 | ‘Female’ | 63 | ‘Young’ | ‘County General Hospital’ | ‘Good’ | 0 | 125 | 114 |
| 36 | 83 | ‘Male’ | 68 | ‘Hernandez’ | ‘County General Hospital’ | ‘Poor’ | 0 | 120 | 166 |
| 30 | 89 | ‘Male’ | 67 | ‘King’ | ‘County General Hospital’ | ‘Excellent’ | 1 | 127 | 186 |
| 45 | 92 | ‘Female’ | 70 | ‘Wright’ | ‘VA Hospital’ | ‘Excellent’ | 1 | 134 | 126 |
| 40 | 83 | ‘Female’ | 66 | ‘Lopez’ | ‘VA Hospital’ | ‘Poor’ | 0 | 121 | 137 |
| 25 | 80 | ‘Female’ | 64 | ‘Hill’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 0 | 115 | 138 |
| 47 | 84 | ‘Male’ | 70 | ‘Scott’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 0 | 127 | 187 |
| 44 | 92 | ‘Male’ | 71 | ‘Green’ | ‘County General Hospital’ | ‘Good’ | 0 | 121 | 193 |
| 48 | 83 | ‘Female’ | 66 | ‘Adams’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 127 | 137 |
| 44 | 90 | ‘Male’ | 71 | ‘Baker’ | ‘VA Hospital’ | ‘Good’ | 1 | 136 | 192 |
| 35 | 85 | ‘Female’ | 66 | ‘Gonzalez’ | ‘St. Mary’s Medical Center’ | ‘Fair’ | 0 | 117 | 118 |
| 33 | 90 | ‘Male’ | 66 | ‘Nelson’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 1 | 124 | 180 |
| 38 | 74 | ‘Female’ | 63 | ‘Carter’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 120 | 128 |
| 39 | 92 | ‘Male’ | 71 | ‘Mitchell’ | ‘County General Hospital’ | ‘Fair’ | 1 | 128 | 164 |
| 44 | 80 | ‘Male’ | 69 | ‘Perez’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 116 | 183 |
| 44 | 89 | ‘Male’ | 70 | ‘Roberts’ | ‘VA Hospital’ | ‘Good’ | 1 | 132 | 169 |
| 37 | 96 | ‘Male’ | 70 | ‘Turner’ | ‘VA Hospital’ | ‘Excellent’ | 1 | 137 | 194 |
| 45 | 89 | ‘Male’ | 67 | ‘Phillips’ | ‘VA Hospital’ | ‘Good’ | 0 | 117 | 172 |
| 37 | 77 | ‘Female’ | 65 | ‘Campbell’ | ‘County General Hospital’ | ‘Fair’ | 0 | 116 | 135 |
| 30 | 81 | ‘Male’ | 68 | ‘Parker’ | ‘VA Hospital’ | ‘Poor’ | 0 | 119 | 182 |
| 39 | 76 | ‘Female’ | 62 | ‘Evans’ | ‘County General Hospital’ | ‘Good’ | 0 | 123 | 121 |
| 42 | 83 | ‘Male’ | 70 | ‘Edwards’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 116 | 158 |
| 42 | 78 | ‘Male’ | 67 | ‘Collins’ | ‘County General Hospital’ | ‘Good’ | 1 | 124 | 179 |
| 49 | 95 | ‘Male’ | 68 | ‘Stewart’ | ‘County General Hospital’ | ‘Poor’ | 1 | 129 | 170 |
| 44 | 91 | ‘Female’ | 62 | ‘Sanchez’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 1 | 130 | 136 |
| 43 | 91 | ‘Female’ | 64 | ‘Morris’ | ‘County General Hospital’ | ‘Poor’ | 1 | 132 | 135 |
| 47 | 86 | ‘Female’ | 66 | ‘Rogers’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 117 | 147 |
| 50 | 89 | ‘Male’ | 72 | ‘Reed’ | ‘VA Hospital’ | ‘Excellent’ | 1 | 129 | 186 |
| 38 | 79 | ‘Female’ | 63 | ‘Cook’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 118 | 124 |
| 41 | 74 | ‘Female’ | 66 | ‘Morgan’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 120 | 134 |
| 45 | 82 | ‘Male’ | 70 | ‘Bell’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 1 | 138 | 170 |
| 36 | 76 | ‘Male’ | 71 | ‘Murphy’ | ‘VA Hospital’ | ‘Good’ | 0 | 117 | 180 |
| 38 | 81 | ‘Female’ | 68 | ‘Bailey’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 113 | 130 |
| 29 | 77 | ‘Female’ | 63 | ‘Rivera’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 122 | 130 |
| 28 | 73 | ‘Female’ | 65 | ‘Cooper’ | ‘VA Hospital’ | ‘Good’ | 0 | 115 | 127 |
| 30 | 85 | ‘Female’ | 67 | ‘Richardson’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 120 | 141 |
| 28 | 76 | ‘Female’ | 66 | ‘Cox’ | ‘County General Hospital’ | ‘Good’ | 0 | 117 | 111 |
| 29 | 80 | ‘Female’ | 68 | ‘Howard’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 123 | 134 |
| 36 | 80 | ‘Male’ | 71 | ‘Ward’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 123 | 189 |
| 45 | 79 | ‘Female’ | 70 | ‘Torres’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 119 | 137 |
| 32 | 82 | ‘Female’ | 60 | ‘Peterson’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 110 | 136 |
| 31 | 79 | ‘Female’ | 64 | ‘Gray’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 121 | 130 |
| 48 | 82 | ‘Female’ | 64 | ‘Ramirez’ | ‘County General Hospital’ | ‘Excellent’ | 1 | 138 | 137 |
| 25 | 75 | ‘Male’ | 66 | ‘James’ | ‘County General Hospital’ | ‘Good’ | 0 | 125 | 186 |
| 40 | 91 | ‘Female’ | 64 | ‘Watson’ | ‘VA Hospital’ | ‘Fair’ | 1 | 122 | 127 |
| 39 | 74 | ‘Male’ | 72 | ‘Brooks’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 0 | 120 | 176 |
| 41 | 78 | ‘Female’ | 65 | ‘Kelly’ | ‘St. Mary’s Medical Center’ | ‘Poor’ | 0 | 117 | 127 |
| 33 | 85 | ‘Female’ | 67 | ‘Sanders’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 1 | 125 | 115 |
| 31 | 84 | ‘Male’ | 72 | ‘Price’ | ‘VA Hospital’ | ‘Fair’ | 1 | 124 | 178 |
| 35 | 75 | ‘Female’ | 64 | ‘Bennett’ | ‘County General Hospital’ | ‘Fair’ | 0 | 121 | 131 |
| 32 | 78 | ‘Male’ | 68 | ‘Wood’ | ‘St. Mary’s Medical Center’ | ‘Poor’ | 0 | 118 | 183 |
| 42 | 81 | ‘Male’ | 66 | ‘Barnes’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 120 | 194 |
| 48 | 79 | ‘Female’ | 64 | ‘Ross’ | ‘VA Hospital’ | ‘Good’ | 0 | 118 | 126 |
| 34 | 85 | ‘Male’ | 68 | ‘Henderson’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 118 | 186 |
| 39 | 79 | ‘Male’ | 69 | ‘Coleman’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 122 | 188 |
| 28 | 82 | ‘Male’ | 69 | ‘Jenkins’ | ‘County General Hospital’ | ‘Good’ | 1 | 134 | 189 |
| 29 | 80 | ‘Female’ | 64 | ‘Perry’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 131 | 120 |
| 32 | 80 | ‘Female’ | 63 | ‘Powell’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 113 | 132 |
| 39 | 92 | ‘Male’ | 68 | ‘Long’ | ‘County General Hospital’ | ‘Good’ | 1 | 125 | 182 |
| 37 | 92 | ‘Female’ | 65 | ‘Patterson’ | ‘County General Hospital’ | ‘Poor’ | 1 | 135 | 120 |
| 49 | 96 | ‘Female’ | 63 | ‘Hughes’ | ‘County General Hospital’ | ‘Good’ | 1 | 128 | 123 |
| 31 | 87 | ‘Female’ | 66 | ‘Flores’ | ‘VA Hospital’ | ‘Good’ | 1 | 123 | 141 |
| 37 | 81 | ‘Female’ | 65 | ‘Washington’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 122 | 129 |
| 38 | 90 | ‘Male’ | 68 | ‘Butler’ | ‘County General Hospital’ | ‘Excellent’ | 1 | 138 | 184 |
| 45 | 77 | ‘Male’ | 71 | ‘Simmons’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 124 | 181 |
| 30 | 91 | ‘Female’ | 70 | ‘Foster’ | ‘St. Mary’s Medical Center’ | ‘Fair’ | 0 | 130 | 124 |
| 48 | 79 | ‘Male’ | 71 | ‘Gonzales’ | ‘County General Hospital’ | ‘Good’ | 0 | 123 | 174 |
| 48 | 73 | ‘Female’ | 66 | ‘Bryant’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 129 | 134 |
| 25 | 99 | ‘Male’ | 69 | ‘Alexander’ | ‘County General Hospital’ | ‘Good’ | 1 | 128 | 171 |
| 44 | 92 | ‘Male’ | 69 | ‘Russell’ | ‘VA Hospital’ | ‘Good’ | 1 | 124 | 188 |
| 49 | 74 | ‘Male’ | 70 | ‘Griffin’ | ‘County General Hospital’ | ‘Fair’ | 0 | 119 | 186 |
| 45 | 93 | ‘Male’ | 68 | ‘Diaz’ | ‘County General Hospital’ | ‘Good’ | 1 | 136 | 172 |
| 48 | 86 | ‘Male’ | 66 | ‘Hayes’ | ‘County General Hospital’ | ‘Fair’ | 0 | 114 | 177 |
I’ve recorded a 45 minute video on how to bring machine learning to the next level in an applied Wine Quality Prediction Project.
If you’re not ready for that and want a tutorial on the basics of Machine Learning, my 1.5 hour Overview of Machine Learning might be better. It will guide you through many of the general concepts, as well as some of the various models listed below.