Source file ⇒ 2017-lec27.Rmd
## Announcements
Two important purposes of learning from data:
Make predictions. Given new inputs, e.g. data about an event, predict what the result of the event will be. e.g. weather forecasting, credit card fraud, success in college, …. In statistics, this is sometimes thought of as “analyzing data from an observational study.” (supervised learning)
Find structure in a mass of data (unsupervised learning)
For instance in the NCHS
data table there is already a variable indicating whether or not a person has diabetes.
NCHS %>% select(diabetic, weight, sex, age, ethnicity, bmi) %>% head(3)
diabetic | weight | sex | age | ethnicity | bmi |
---|---|---|---|---|---|
0 | 12.5 | female | 2 | Non-Hispanic Black | 14.89769 |
0 | 75.4 | male | 77 | Non-Hispanic White | 24.90421 |
0 | 32.9 | female | 10 | Non-Hispanic White | 17.63171 |
Building a model to explore or descrie other variables related to diabetes (weight?, age?, smoking?, bmi?) is supervised learning.
library(rpart)
library(statisticalModeling)
linear_model <- rpart(diabetic~bmi, data=NCHS)
fmodel(linear_model)
We see that at around a BMI of 25 there is an increase in the chance of getting diabeties from 2% to 10%. A BMI of 25-30 is considered overweight.
For example we can try to identify the 6 continents (six land clusters excluding Antartica) by knowing only the latitude and longitude of the 4000 largest cities in the world. We use a “kmeans algorithm” in which n observations is partitioned into k clusters with each observation belonging to the cluster with the closest mean.
BigCities <- WorldCities %>%
arrange(desc(population)) %>%
head(4000) %>%
select(longitude,latitude)
head(BigCities,3)
longitude | latitude |
---|---|
121.45806 | 31.22222 |
-58.37723 | -34.61315 |
72.88261 | 19.07283 |
clusters <- BigCities %>%
kmeans(centers=6)
clusters[["centers"]]
longitude | latitude |
---|---|
112.55362 | 27.241004 |
18.50008 | 32.803104 |
115.64993 | -3.703634 |
-78.43341 | 11.035750 |
73.47268 | 27.557006 |
133.09077 | 38.136721 |
clusters[["size"]]
## [1] 430 1318 237 909 729 377
clusters[["cluster"]] %>% head(20)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 4 5 4 5 2 5 1 2 5 6 4 2 3 6 1 4 1 2 4
clusters <- clusters %>%
fitted("classes") %>%
as.character()
head(clusters)
## [1] "1" "4" "5" "4" "5" "2"
BigCities <- BigCities %>% mutate(cluster=clusters)
head(BigCities)
longitude | latitude | cluster |
---|---|---|
121.45806 | 31.22222 | 1 |
-58.37723 | -34.61315 | 4 |
72.88261 | 19.07283 | 5 |
-99.12766 | 19.42847 | 4 |
67.08220 | 24.90560 | 5 |
28.94966 | 41.01384 | 2 |
BigCities %>% ggplot(aes(x=longitude, y=latitude)) + geom_point(aes(color=cluster, shape=cluster))
Compare to:
We will focus on unsupervised learning where the data has an outcome variable.
We will need to install and load number of packages
library(statisticalModeling)
library(rpart)
library(rpart.plot)
library(stats)
library(mosaicData)
The most basic form of statistical modeling is when you have a continuous (quatitative) outcome variable. There are two popular forms of statistical modeling in this case, linear regression and the recursive partition model.
This is a parametric model. We assume that the response variable is related to the explanatary variables by a linear relationship.
Given a subset of our observations we learn how to best approximate our outcomes as a linear combination of expanatory variables.
For each case, we have an input \({\mathbf X_i}\) consisting potentially of several variables measured on that case.
For each case, we have an output \(Y_i\).
For example consider the data table Runners
from the ten mile Cherry Blossom race.
Runners %>% head(3)
age | net | gun | sex | year | previous | nruns | start_position |
---|---|---|---|---|---|---|---|
31 | 85.35 | 86.88 | M | 2005 | 0 | 4 | calm |
32 | 91.75 | 92.53 | M | 2006 | 1 | 4 | eager |
33 | NA | 92.40 | M | 2007 | 2 | 4 | NA |
Suppose we wish to model runner’s net
time, \(Y_i\) as a function of age
and sex
. Here \(Y_i\) is the net
time and \(X_i\) is the vector (age
,sex
) for the ith runner.
The formula \({Y_i}\) ~ \({\mathbf X_i}\) is used to describe the form of relationships among variables.
It is important to see whether a linear model is appropriate. Most basically you should make a scatter plot of y against each of the continuous explanatory variables, for all possible values of the categorical variables.
Runners %>% ggplot(aes(x=age,y=net)) + geom_point() + geom_smooth(method="lm") +facet_grid(sex~.)
Here we see that a linear model is appropriate.
The syntax is of the function (linear model) lm
is: lm(formula, data)
For example:
lm(net~age + sex, Runners)
##
## Call:
## lm(formula = net ~ age + sex, data = Runners)
##
## Coefficients:
## (Intercept) age sexM
## 82.7591 0.3117 -12.1512
We see that for a given age and sex, the expected net running time is \[\widehat{net}= 0.3117(age) + (-12.1512)(sex) + 82.7591.\] \(\widehat{net}\) is the expected (fitted) net running time.
We can see from the formula that there is a roughly 12 minute handicap on average for being a woman runner and runners get slower with age at a rate of .3 minutes per year for a 10 mile race.
To see whether the slopes and y-intercept are statistically significant we use the function summary()
.
lm(net~age + sex, Runners) %>% summary()
##
## Call:
## lm(formula = net ~ age + sex, data = Runners)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.925 -9.318 -0.823 8.063 85.045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.759116 0.395270 209.37 <2e-16 ***
## age 0.311674 0.009556 32.62 <2e-16 ***
## sexM -12.151242 0.210166 -57.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.49 on 18156 degrees of freedom
## (6175 observations deleted due to missingness)
## Multiple R-squared: 0.1701, Adjusted R-squared: 0.17
## F-statistic: 1861 on 2 and 18156 DF, p-value: < 2.2e-16
The three stars to the right of the slopes and intercept indicate that they are significantly different than zero (i.e. statistically significant). We conclude that age and sex do effect running times in the 10 mile race.
One of the advantages of a linear model is that they are very simple and interpretable. A disadvantage is that it is restrictive (i.e. a straight line may not be a good fit through your data in which case you shouldn’t use it).
Notice that according to the model that a zero year old woman can run a 10 miler on average in 83 minutes. This shows you not to extrapolate beyond the range of reasonable explanatory values.
Using matrices we can write this linear relationship between expected outcome variable and explanatory variables as \(\widehat{Y_i}=A{\mathbf X_i}\).
We can visualize the output of the model using the function fmodel()
.
lm(net~age + sex, Runners) %>% fmodel()
Use the function evaluate_model()
to see the model output for several values of inputs. The syntax is: evaluate_model(model, at)
where at
is a named list giving specific values at which to hold the explanatory variables.
For example:
lm(net~age + sex, Runners) %>%
evaluate_model(at=list(age=20, sex="M"))
age | sex | model_output |
---|---|---|
20 | M | 76.84136 |
This says on average, 20 year old males ran 76.64 minutes for 10 miles.
We can explore the effect size of age or sex on net running time using the command effect_size
. The syntax is:
effect_size(model, formula, to, step, ...)
where the right side of formula
is the variable with respect to which the effect size is to be calculated (for example ~Age
to get the effect size of Age
on the model’s response variable. step
is the size change to make in the expanatory variable. Here ...
are variables you wish to keep fixed.
For example:
lm(net~age + sex, Runners) %>%
effect_size(~age, step=1, sex="M")
slope | age | to:age | sex |
---|---|---|---|
0.3116741 | 40 | 41 | M |
This says that men get .31 minutes slower with every passing year.
This is a nonparametric model making no assumptions about the relationship between your variables (for example we don’t assume the response variable is a linear function of the explanatary varaiables as in linear regression).
The main characteristic of the model is the data is recursively partitioned such that observations with similar response values are grouped.
After the partition is completed, a constant value of the response variable is predicted within each group (these are the leaves of the tree).
For example consider running times, net
, in the 10 mile running race as a function of age and sex.:
rpart(net~ age + sex, data=Runners) %>% rpart.plot::prp(type=3)
We get significantly different net
times by partitioning the data first by sex. The new subdivided data can be further partitioned by certain age for men and women where the net
value significantly changes.
You can also visualize this decision tree as a graph:
rpart(net~ age + sex, data=Runners) %>% fmodel()
You can see from the tree or the graph that at around age 50 there is a drop in running time performance for men and women.
There is a complexity parameter, called cp
that adjust the threshold for how different response variable must be to make a new split. The default value is cp=.02
. Smaller cp
values correspond to more complex trees.
I suggest you just use the default value unless you have a good reason not to. Smaller cp
values can lead to over fitting your model which means that it works great for your training data but not for new data.
rpart(net~ age + sex, data=Runners, cp=.002) %>% rpart.plot::prp(type=3)
Advantages of a recursive partition model is that it is easy to interpret and doesn’t make any assumptions about your random variables.
https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/
head(Runners)
age | net | gun | sex | year | previous | nruns | start_position |
---|---|---|---|---|---|---|---|
31 | 85.35 | 86.88 | M | 2005 | 0 | 4 | calm |
32 | 91.75 | 92.53 | M | 2006 | 1 | 4 | eager |
33 | NA | 92.40 | M | 2007 | 2 | 4 | NA |
34 | NA | 89.28 | M | 2008 | 3 | 4 | NA |
25 | 67.33 | 67.48 | M | 2001 | 0 | 3 | eager |
26 | 70.45 | 70.70 | M | 2002 | 1 | 3 | eager |
Lets explore the relationship between exploratory variables age
, net
and response variable start_position
.
rpart(start_position~net + age,data=Runners ) %>% rpart.plot::prp(type=3)
it doesn’t appear that age plays is a factor.
The partitioning works in the same way except that now with each split the purity of the groups increases. For example in the first split in the above example we remove most of the eager runners.
Sometimes it makes sense just to roll the dice and see what comes up. In the context of modeling, this means throwing a big set of potential explanatory variables into a model and seeing if the process of model training finds something of interest. (Only the rpart() architecture provides an opportunity to automatically choose a subset of the explanatory variables. lm() will put every variable you give it into the model.)
rpart(start_position~.,data=Runners ) %>% rpart.plot::prp(type=3)
It appears that only net
and gun
are important variables. gun
is the time (in min.) from when the starter’s gun was fired to when the runner finished the race.
Being able to explore what are the significant variables in this way makes recursive partition modeling very popular.
In linear regression we measure how good our regression plane fits our data using the coefficient of determination \[R^2=1-SSE/SST\] where SSE is the sum of squared errors and SST is the total sum of squares.
Here is a picture for simple linear regression.
Here is a picture for when you have two explanatory variables.
The regression surface is the linear combination of your explanatory variables so as to your best fit your data (i.e. minimize the SSE). Adding additional explanatory variables can’t possibly make the SSE worse but only better because the their coefficient can just be zero. It is possible however that it isn’t zero which will make it seem like your added variable is significant.
The adjusted coefficient of determination, \(\mathrm{Adjusted} R^2\) provides a penalty for adding additional explanatory variables and is a better guage of whether adding an explanatory variable is helping the fit. Still it isn’t perfect.
This is from before.
lm(net~age + sex, Runners) %>% summary()
##
## Call:
## lm(formula = net ~ age + sex, data = Runners)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.925 -9.318 -0.823 8.063 85.045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.759116 0.395270 209.37 <2e-16 ***
## age 0.311674 0.009556 32.62 <2e-16 ***
## sexM -12.151242 0.210166 -57.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.49 on 18156 degrees of freedom
## (6175 observations deleted due to missingness)
## Multiple R-squared: 0.1701, Adjusted R-squared: 0.17
## F-statistic: 1861 on 2 and 18156 DF, p-value: < 2.2e-16
Now I add a new variable to Runners which is age plus some random noise.
newRunners <- Runners %>% mutate(new=age+ rnorm(nrow(Runners)))
head(newRunners)
age | net | gun | sex | year | previous | nruns | start_position | new |
---|---|---|---|---|---|---|---|---|
31 | 85.35 | 86.88 | M | 2005 | 0 | 4 | calm | 30.99919 |
32 | 91.75 | 92.53 | M | 2006 | 1 | 4 | eager | 30.67126 |
33 | NA | 92.40 | M | 2007 | 2 | 4 | NA | 33.20056 |
34 | NA | 89.28 | M | 2008 | 3 | 4 | NA | 33.21155 |
25 | 67.33 | 67.48 | M | 2001 | 0 | 3 | eager | 26.40780 |
26 | 70.45 | 70.70 | M | 2002 | 1 | 3 | eager | 25.76742 |
lm(net~age + sex + new, newRunners) %>% summary()
##
## Call:
## lm(formula = net ~ age + sex + new, data = newRunners)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.985 -9.293 -0.823 8.086 84.787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.76241 0.39526 209.387 < 2e-16 ***
## age 0.46482 0.10018 4.640 3.51e-06 ***
## sexM -12.15011 0.21016 -57.814 < 2e-16 ***
## new -0.15322 0.09978 -1.536 0.125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.48 on 18155 degrees of freedom
## (6175 observations deleted due to missingness)
## Multiple R-squared: 0.1702, Adjusted R-squared: 0.1701
## F-statistic: 1241 on 3 and 18155 DF, p-value: < 2.2e-16
To test how good your model is you should do cross validation where you test how well your model fits new data.
most basically you want to do something like this:
Generate a random TRUE or FALSE for each case in Runners
newRunnersTF<- rnorm(nrow(newRunners)) > 0
head(newRunnersTF)
## [1] FALSE TRUE FALSE FALSE FALSE TRUE
Build base model net
~ age
+ sex
+ new
with training cases
train_model <- lm(net ~ age + sex + new, data = subset(newRunners, newRunnersTF))
summary(train_model)
##
## Call:
## lm(formula = net ~ age + sex + new, data = subset(newRunners,
## newRunnersTF))
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.712 -9.270 -0.754 8.042 76.611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.90738 0.54941 150.902 <2e-16 ***
## age 0.28291 0.14089 2.008 0.0447 *
## sexM -12.04029 0.29395 -40.960 <2e-16 ***
## new 0.02214 0.14042 0.158 0.8747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.44 on 9238 degrees of freedom
## (3082 observations deleted due to missingness)
## Multiple R-squared: 0.1675, Adjusted R-squared: 0.1672
## F-statistic: 619.6 on 3 and 9238 DF, p-value: < 2.2e-16
Evaluate the model for the testing cases
Preds_train <- evaluate_model(train_model, data = subset(newRunners, newRunnersTF))
head(Preds_train,3)
age | net | gun | sex | year | previous | nruns | start_position | new | model_output | |
---|---|---|---|---|---|---|---|---|---|---|
2 | 32 | 91.75 | 92.53 | M | 2006 | 1 | 4 | eager | 30.67126 | 80.59919 |
6 | 26 | 70.45 | 70.70 | M | 2002 | 1 | 3 | eager | 25.76742 | 78.79317 |
10 | 29 | 69.32 | 69.68 | M | 2004 | 2 | 4 | eager | 28.43851 | 79.70103 |
Preds_test <- evaluate_model(train_model, data = subset(newRunners, !newRunnersTF))
head(Preds_test,3)
age | net | gun | sex | year | previous | nruns | start_position | new | model_output | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 31 | 85.35 | 86.88 | M | 2005 | 0 | 4 | calm | 30.99919 | 80.32354 |
3 | 33 | NA | 92.40 | M | 2007 | 2 | 4 | NA | 33.20056 | 80.93809 |
4 | 34 | NA | 89.28 | M | 2008 | 3 | 4 | NA | 33.21155 | 81.22124 |
Compare the SSE on the training and testing data
with(data = Preds_train, mean((model_output - net)^2,na.rm=TRUE))
## [1] 180.5936
with(data = Preds_test, mean((model_output - net)^2,na.rm=TRUE))
## [1] 183.1258
You can see that the MSE=SSE/n went up with the test model compared to the train model. This indicates that the new
variable doesn’t have any additional predictive value.
This is the idea behind cross validation discussed in data camp’s statistical modeling course that you did for homework.