Source file ⇒ 2017-lec27.Rmd
## Announcements

  1. Today is the last class.
  2. You have lab this week (review for final exam)
  3. I will be out of town on Thursday so no OH but you can reach me by email if you need me.
  4. I will send out information about presentations very soon.

Today

  1. Supervised versus unsupervised learning
  2. linear and recursive partition models
  3. Choosing the optimal number of explanatory variables for linear regression.

1. Supervised versus unsupervised learning

Two important purposes of learning from data:

supervised learning (aka statistical modeling)

  • We have a set of cases, \(i = 1, 2, 3, \ldots, n\), called the training data.
  • For each case, we have an input \({\mathbf X_i}\) consisting potentially of several variables measured on that case.
    • The subscript \(i\) in \({\mathbf X_i}\) means that we have one \({\mathbf X}\) for each case. The boldface \({\bf X}\) means that the input can be multi-dimensional, that is, consisting of multiple variables.
  • For each case, we have an output \(Y_i\).
  • We want to learn the overall pattern of the relationship between the inputs \({\mathbf X}\) and the outputs \(Y\), not just for our \(n\) training cases, but for potential cases that we have not encountered yet. These as yet unencountered cases are thought of as the testing data. We are going to represent the pattern with a function \(\hat{f}: \mathbf{X} \rightarrow Y\). I will sometimes use the word model instead of function.

For instance in the NCHS data table there is already a variable indicating whether or not a person has diabetes.

NCHS %>% select(diabetic, weight, sex, age, ethnicity, bmi) %>% head(3)
diabetic weight sex age ethnicity bmi
0 12.5 female 2 Non-Hispanic Black 14.89769
0 75.4 male 77 Non-Hispanic White 24.90421
0 32.9 female 10 Non-Hispanic White 17.63171

Building a model to explore or descrie other variables related to diabetes (weight?, age?, smoking?, bmi?) is supervised learning.

library(rpart)
library(statisticalModeling)
linear_model <- rpart(diabetic~bmi, data=NCHS)
fmodel(linear_model)

We see that at around a BMI of 25 there is an increase in the chance of getting diabeties from 2% to 10%. A BMI of 25-30 is considered overweight.

unsupervised learning

  • In supervised learning, with have an output (response variable) \(Y\) which we want to generate from input \(\mathbf{X}\). We train a function \(\hat{f}: \mathbf{X} \rightarrow Y\)
    • In unsupervised learning, there is no identified response variable. Instead of modeling the response as a function of \(\mathbf{X}\), we look for patterns within \(\mathbf{X}\).

For example we can try to identify the 6 continents (six land clusters excluding Antartica) by knowing only the latitude and longitude of the 4000 largest cities in the world. We use a “kmeans algorithm” in which n observations is partitioned into k clusters with each observation belonging to the cluster with the closest mean.

BigCities <- WorldCities %>%
  arrange(desc(population)) %>%
  head(4000) %>%
  select(longitude,latitude)
head(BigCities,3)
longitude latitude
121.45806 31.22222
-58.37723 -34.61315
72.88261 19.07283
clusters <- BigCities %>%
kmeans(centers=6)
clusters[["centers"]]
longitude latitude
112.55362 27.241004
18.50008 32.803104
115.64993 -3.703634
-78.43341 11.035750
73.47268 27.557006
133.09077 38.136721
clusters[["size"]]
## [1]  430 1318  237  909  729  377
clusters[["cluster"]] %>% head(20)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  1  4  5  4  5  2  5  1  2  5  6  4  2  3  6  1  4  1  2  4
clusters <- clusters %>% 
  fitted("classes") %>%
  as.character()
head(clusters)
## [1] "1" "4" "5" "4" "5" "2"
BigCities <-  BigCities %>% mutate(cluster=clusters)
head(BigCities)
longitude latitude cluster
121.45806 31.22222 1
-58.37723 -34.61315 4
72.88261 19.07283 5
-99.12766 19.42847 4
67.08220 24.90560 5
28.94966 41.01384 2
BigCities %>% ggplot(aes(x=longitude, y=latitude)) + geom_point(aes(color=cluster, shape=cluster))

Compare to:

2. Linear and Recursive Partition model

We will focus on unsupervised learning where the data has an outcome variable.

We will need to install and load number of packages

library(statisticalModeling)
library(rpart)
library(rpart.plot)
library(stats)
library(mosaicData)

Continuous outcome variable (Prediction).

The most basic form of statistical modeling is when you have a continuous (quatitative) outcome variable. There are two popular forms of statistical modeling in this case, linear regression and the recursive partition model.

linear model

This is a parametric model. We assume that the response variable is related to the explanatary variables by a linear relationship.

Given a subset of our observations we learn how to best approximate our outcomes as a linear combination of expanatory variables.

For each case, we have an input \({\mathbf X_i}\) consisting potentially of several variables measured on that case.

For each case, we have an output \(Y_i\).

For example consider the data table Runners from the ten mile Cherry Blossom race.

Runners %>% head(3)
age net gun sex year previous nruns start_position
31 85.35 86.88 M 2005 0 4 calm
32 91.75 92.53 M 2006 1 4 eager
33 NA 92.40 M 2007 2 4 NA

Suppose we wish to model runner’s net time, \(Y_i\) as a function of age and sex. Here \(Y_i\) is the net time and \(X_i\) is the vector (age,sex) for the ith runner.

The formula \({Y_i}\) ~ \({\mathbf X_i}\) is used to describe the form of relationships among variables.

It is important to see whether a linear model is appropriate. Most basically you should make a scatter plot of y against each of the continuous explanatory variables, for all possible values of the categorical variables.

Runners %>% ggplot(aes(x=age,y=net)) + geom_point() + geom_smooth(method="lm") +facet_grid(sex~.)

Here we see that a linear model is appropriate.

The syntax is of the function (linear model) lm is: lm(formula, data)

For example:

lm(net~age + sex, Runners)
## 
## Call:
## lm(formula = net ~ age + sex, data = Runners)
## 
## Coefficients:
## (Intercept)          age         sexM  
##     82.7591       0.3117     -12.1512

We see that for a given age and sex, the expected net running time is \[\widehat{net}= 0.3117(age) + (-12.1512)(sex) + 82.7591.\] \(\widehat{net}\) is the expected (fitted) net running time.

We can see from the formula that there is a roughly 12 minute handicap on average for being a woman runner and runners get slower with age at a rate of .3 minutes per year for a 10 mile race.

To see whether the slopes and y-intercept are statistically significant we use the function summary().

lm(net~age + sex, Runners) %>% summary()
## 
## Call:
## lm(formula = net ~ age + sex, data = Runners)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.925  -9.318  -0.823   8.063  85.045 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  82.759116   0.395270  209.37   <2e-16 ***
## age           0.311674   0.009556   32.62   <2e-16 ***
## sexM        -12.151242   0.210166  -57.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.49 on 18156 degrees of freedom
##   (6175 observations deleted due to missingness)
## Multiple R-squared:  0.1701, Adjusted R-squared:   0.17 
## F-statistic:  1861 on 2 and 18156 DF,  p-value: < 2.2e-16

The three stars to the right of the slopes and intercept indicate that they are significantly different than zero (i.e. statistically significant). We conclude that age and sex do effect running times in the 10 mile race.

One of the advantages of a linear model is that they are very simple and interpretable. A disadvantage is that it is restrictive (i.e. a straight line may not be a good fit through your data in which case you shouldn’t use it).

Notice that according to the model that a zero year old woman can run a 10 miler on average in 83 minutes. This shows you not to extrapolate beyond the range of reasonable explanatory values.

Using matrices we can write this linear relationship between expected outcome variable and explanatory variables as \(\widehat{Y_i}=A{\mathbf X_i}\).

We can visualize the output of the model using the function fmodel().

lm(net~age + sex, Runners) %>% fmodel()

Use the function evaluate_model() to see the model output for several values of inputs. The syntax is: evaluate_model(model, at) where at is a named list giving specific values at which to hold the explanatory variables.

For example:

lm(net~age + sex, Runners) %>%
evaluate_model(at=list(age=20, sex="M"))
age sex model_output
20 M 76.84136

This says on average, 20 year old males ran 76.64 minutes for 10 miles.

We can explore the effect size of age or sex on net running time using the command effect_size. The syntax is:

effect_size(model, formula, to, step, ...) where the right side of formula is the variable with respect to which the effect size is to be calculated (for example ~Age to get the effect size of Age on the model’s response variable. step is the size change to make in the expanatory variable. Here ... are variables you wish to keep fixed.

For example:

lm(net~age + sex, Runners) %>%
effect_size(~age, step=1, sex="M")
slope age to:age sex
0.3116741 40 41 M

This says that men get .31 minutes slower with every passing year.

Recursive partition model (regression trees)

This is a nonparametric model making no assumptions about the relationship between your variables (for example we don’t assume the response variable is a linear function of the explanatary varaiables as in linear regression).

The main characteristic of the model is the data is recursively partitioned such that observations with similar response values are grouped.

After the partition is completed, a constant value of the response variable is predicted within each group (these are the leaves of the tree).

For example consider running times, net, in the 10 mile running race as a function of age and sex.:

rpart(net~ age + sex,  data=Runners) %>% rpart.plot::prp(type=3)

We get significantly different net times by partitioning the data first by sex. The new subdivided data can be further partitioned by certain age for men and women where the net value significantly changes.

You can also visualize this decision tree as a graph:

rpart(net~ age + sex,  data=Runners) %>% fmodel()

You can see from the tree or the graph that at around age 50 there is a drop in running time performance for men and women.

There is a complexity parameter, called cp that adjust the threshold for how different response variable must be to make a new split. The default value is cp=.02. Smaller cp values correspond to more complex trees.

I suggest you just use the default value unless you have a good reason not to. Smaller cp values can lead to over fitting your model which means that it works great for your training data but not for new data.

rpart(net~ age + sex,  data=Runners, cp=.002) %>% rpart.plot::prp(type=3)

Advantages of a recursive partition model is that it is easy to interpret and doesn’t make any assumptions about your random variables.

Categorical outcome variable (Classification).

  • Logistic regression is the linear regression analog for a categorical response variable. You can read about it here:

https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/

  • Recursive partition model can be used for a categorical response variable or a continuous response variable. Here is an example:
head(Runners)
age net gun sex year previous nruns start_position
31 85.35 86.88 M 2005 0 4 calm
32 91.75 92.53 M 2006 1 4 eager
33 NA 92.40 M 2007 2 4 NA
34 NA 89.28 M 2008 3 4 NA
25 67.33 67.48 M 2001 0 3 eager
26 70.45 70.70 M 2002 1 3 eager

Lets explore the relationship between exploratory variables age, net and response variable start_position.

rpart(start_position~net + age,data=Runners ) %>% rpart.plot::prp(type=3)

it doesn’t appear that age plays is a factor.

The partitioning works in the same way except that now with each split the purity of the groups increases. For example in the first split in the above example we remove most of the eager runners.

Sometimes it makes sense just to roll the dice and see what comes up. In the context of modeling, this means throwing a big set of potential explanatory variables into a model and seeing if the process of model training finds something of interest. (Only the rpart() architecture provides an opportunity to automatically choose a subset of the explanatory variables. lm() will put every variable you give it into the model.)

rpart(start_position~.,data=Runners ) %>% rpart.plot::prp(type=3)

It appears that only net and gun are important variables. gun is the time (in min.) from when the starter’s gun was fired to when the runner finished the race.

Being able to explore what are the significant variables in this way makes recursive partition modeling very popular.

Choosing the optimal number of explanatory variables for linear regression.

In linear regression we measure how good our regression plane fits our data using the coefficient of determination \[R^2=1-SSE/SST\] where SSE is the sum of squared errors and SST is the total sum of squares.

Here is a picture for simple linear regression.

Here is a picture for when you have two explanatory variables.

The regression surface is the linear combination of your explanatory variables so as to your best fit your data (i.e. minimize the SSE). Adding additional explanatory variables can’t possibly make the SSE worse but only better because the their coefficient can just be zero. It is possible however that it isn’t zero which will make it seem like your added variable is significant.

The adjusted coefficient of determination, \(\mathrm{Adjusted} R^2\) provides a penalty for adding additional explanatory variables and is a better guage of whether adding an explanatory variable is helping the fit. Still it isn’t perfect.

This is from before.

lm(net~age + sex, Runners) %>% summary()
## 
## Call:
## lm(formula = net ~ age + sex, data = Runners)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.925  -9.318  -0.823   8.063  85.045 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  82.759116   0.395270  209.37   <2e-16 ***
## age           0.311674   0.009556   32.62   <2e-16 ***
## sexM        -12.151242   0.210166  -57.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.49 on 18156 degrees of freedom
##   (6175 observations deleted due to missingness)
## Multiple R-squared:  0.1701, Adjusted R-squared:   0.17 
## F-statistic:  1861 on 2 and 18156 DF,  p-value: < 2.2e-16

Now I add a new variable to Runners which is age plus some random noise.

newRunners <- Runners %>% mutate(new=age+ rnorm(nrow(Runners)))
head(newRunners)
age net gun sex year previous nruns start_position new
31 85.35 86.88 M 2005 0 4 calm 30.99919
32 91.75 92.53 M 2006 1 4 eager 30.67126
33 NA 92.40 M 2007 2 4 NA 33.20056
34 NA 89.28 M 2008 3 4 NA 33.21155
25 67.33 67.48 M 2001 0 3 eager 26.40780
26 70.45 70.70 M 2002 1 3 eager 25.76742
lm(net~age + sex + new, newRunners) %>% summary()
## 
## Call:
## lm(formula = net ~ age + sex + new, data = newRunners)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.985  -9.293  -0.823   8.086  84.787 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  82.76241    0.39526 209.387  < 2e-16 ***
## age           0.46482    0.10018   4.640 3.51e-06 ***
## sexM        -12.15011    0.21016 -57.814  < 2e-16 ***
## new          -0.15322    0.09978  -1.536    0.125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.48 on 18155 degrees of freedom
##   (6175 observations deleted due to missingness)
## Multiple R-squared:  0.1702, Adjusted R-squared:  0.1701 
## F-statistic:  1241 on 3 and 18155 DF,  p-value: < 2.2e-16

To test how good your model is you should do cross validation where you test how well your model fits new data.

most basically you want to do something like this:

Generate a random TRUE or FALSE for each case in Runners

newRunnersTF<- rnorm(nrow(newRunners)) > 0
head(newRunnersTF)
## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE

Build base model net ~ age + sex + new with training cases

train_model <- lm(net ~ age + sex + new, data = subset(newRunners, newRunnersTF))
summary(train_model)
## 
## Call:
## lm(formula = net ~ age + sex + new, data = subset(newRunners, 
##     newRunnersTF))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.712  -9.270  -0.754   8.042  76.611 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  82.90738    0.54941 150.902   <2e-16 ***
## age           0.28291    0.14089   2.008   0.0447 *  
## sexM        -12.04029    0.29395 -40.960   <2e-16 ***
## new           0.02214    0.14042   0.158   0.8747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.44 on 9238 degrees of freedom
##   (3082 observations deleted due to missingness)
## Multiple R-squared:  0.1675, Adjusted R-squared:  0.1672 
## F-statistic: 619.6 on 3 and 9238 DF,  p-value: < 2.2e-16

Evaluate the model for the testing cases

Preds_train <- evaluate_model(train_model, data = subset(newRunners, newRunnersTF))
head(Preds_train,3)
age net gun sex year previous nruns start_position new model_output
2 32 91.75 92.53 M 2006 1 4 eager 30.67126 80.59919
6 26 70.45 70.70 M 2002 1 3 eager 25.76742 78.79317
10 29 69.32 69.68 M 2004 2 4 eager 28.43851 79.70103
Preds_test <- evaluate_model(train_model, data = subset(newRunners, !newRunnersTF))
head(Preds_test,3)
age net gun sex year previous nruns start_position new model_output
1 31 85.35 86.88 M 2005 0 4 calm 30.99919 80.32354
3 33 NA 92.40 M 2007 2 4 NA 33.20056 80.93809
4 34 NA 89.28 M 2008 3 4 NA 33.21155 81.22124

Compare the SSE on the training and testing data

with(data = Preds_train, mean((model_output - net)^2,na.rm=TRUE))
## [1] 180.5936
with(data = Preds_test, mean((model_output - net)^2,na.rm=TRUE))
## [1] 183.1258

You can see that the MSE=SSE/n went up with the test model compared to the train model. This indicates that the new variable doesn’t have any additional predictive value.

This is the idea behind cross validation discussed in data camp’s statistical modeling course that you did for homework.