Pete Rose

Pete Rose of the Cincinnati Reds set a National League record of hitting safely in $44$ consecutive games. How likely a such a long sequence of safe hits is to be observed? If you were a bookmaker, what odds would you offer on such an event? This means that he safely reached first base after hitting the ball into fair territory, without the benefit of an error or a fielder’s choice at least once at every of those 44 games. There are a couple of facts we know about him: - Rose was a $300$ hitter, he hits safely 3 times out of 10 attempts - Each at bat is assumed to be independent, i.e., the current at bat doesn’t affect the outcome of the next. So $P(\text{leit}) = 0.3$

Assuming he comes to bat $4$ times each game, what probability might reasonably be associated with that hitting streak? First we define notation. We use $A_i$ to denote an event of hitting safely at game $i$, then \[ \begin{aligned} & P( {\rm Rose \; Hits \; Safely \; in \;44 \; consecutive \; games} ) = \\ & P ( A_1 \; \text{and} \; A_2 \ldots \text{and} \; A_{44} ) = P ( A_1 ) P ( A_2 ) \ldots P ( A_{44} ) \end{aligned} \] We now need to find $P(A_i) \ldots$ where $P (A_i ) = 1 - P ( \text{not} \; A_i )$

\[\begin{aligned} P ( A_1 ) & = 1 - P ( {\rm not} \; A_1 ) \\ & = 1 - P ( {\rm Rose \; makes \; 4 \; outs } ) \\ & = 1 - ( 0.7)^4 = 0.76 \end{aligned}\]

pa = 1 - ( 0.7)^4 %>% print()

## [1] 0.2401

phit = pa^44 %>% print()

## [1] 5.666091e-06

(1-phit)/phit

## [1] 176487.5

For the winning streak, then we have $(0.76)^{44} = 0.0000057$, a very low probability. In terms of odds, there are three basic inferences

This means that the odds for a particular player as good as Pete Rose starting a hitting streak today are $176,487$ to $1$
This doesn’t mean that the run of $44$ won’t be beaten by some player at some time: the Law of Very Large Numbers
Joe DiMaggio’s record is 56. He is a 325better, thus we have $(0.792)^{56} = 2.13 \times 10^{-6}$ or 455,962 to $1$. It’s going to be hard to beat.

Apple Watch

The Apple Watch Series 4 can perform a single-lead ECG and detect atrial fibrillation. The software can correctly identify 98% of cases of atrial fibrillation (true positives) and 99% of cases of non-atrial fibrillation (true negatives).

Predicted	atrial fibrillation	no atrial fibrillation	Total
atrial fibrillation	1960	980	2940
no atrial fibrillation	40	97020	97060
Total	2000	98000	100000

However, what is the probability of a person having atrial fibrillation when atrial fibrillation is identified by the Apple Watch Series 4? We use Bayes theorem to answer this question.

\[ p(\text{atrial fibrillation}\mid \text{atrial fibrillation is identified }) = \frac{0.01960}{ 0.02940} = 0.6667 \] The conditional probability of having atrial fibrillation when the Apple Watch Series 4 detects atrial fibrillation is about 67%.

In people younger than 55, Apple Watch’s positive predictive value is just 19.6 percent. That means in this group – which constitutes more than 90 percent of users of wearable devices like the Apple Watch – the app incorrectly diagnoses atrial fibrillation 79.4 percent of the time. (You can try the calculation yourself using this Bayesian calculator: enter 0.001 for prevalence, 0.98 for sensitivity, and 0.996 for specificity).

The electrocardiogram app becomes more reliable in older individuals: The positive predictive value is 76 percent among users between the ages of 60 and 64, 91 percent among those aged 70 to 74, and 96 percent for those older than 85.

We can see how the posterior $p(\text{atrial fibrillation}\mid \text{atrial fibrillation is identified })$ depends on the prior $p(\text{atrial fibrillation})$

prior = seq(0.001,0.1,by=0.0001)
posterior = 0.98*prior/(0.98*prior + 0.01*(1-prior))
plot(prior,posterior, type='l',lwd=2)

Abraham Wald and Airplane Safety

Many lives were saved by analysis of conditional probabilities performed by Abraham Wald during the Second World War. He was analyzing damages on the US planes that came back from bombarding missions in Germany. Somebody suggested to analyze distribution of the hits over different parts of the plane. The idea was to find a pattern in the damages and design a reinforcement strategy.

After examining hundreds of damaged airplanes, researchers came up with the following table

Location	Number of Planes
Engine	53
Cockpit	65
Fuel system	96
Wings, fuselage, etc.	434

We can convert those counts to probabilities

Location	Number of Planes
Engine	0.08
Cockpit	0.1
Fuel system	0.15
Wings, fuselage, etc.	0.67

We can conclude the the most likely area to be damaged on the returned planes was the wings and fuselage. \[ p(\mbox{hit on wings or fuselage } \mid \mbox{returns safely}) = 0.67 \] Wald realized that analyzing damages only on survived planes is not the right approach. Instead, he suggested that it is essential to calculate the inverse probability \[ p(\mbox{returns safely} \mid \mbox{hit on wings or fuselage }) = ? \] To calculate that, he interviewed many engineers and pilots, he performed a lot field experiments. He analyzed likely attack angles. He studied the properties of a shrapnel cloud from a flak gun. He suggested to the army that they fire thousands of dummy bullets at a plane sitting on the tarmac. Wald constructed a ‘probability model’ carefull to reconstruct an estimate for the joint probabilities. Table below shows the results.

Hit	Returned	Shut Down
Engine	53	57
Cockpit	65	46
Fuel system	96	16
Wings, fuselage, etc.	434	33

Which allows us to estimate joint probabilities, for example \[ p(\mbox{outcome = returns safely} , \mbox{hit = engine }) = 53/800 = 0.066 \] We also can calculate the conditional probabilities now \[ p(\mbox{outcome = returns safely} \mid \mbox{hit = wings or fuselage }) = \dfrac{434}{434+33} = 0.9293362 \]

Should we reinforce wings or fuselage? Which part of the airplane does need ot be reinforced?

\[ p(\mbox{outcome = returns safely} \mid \mbox{hit = engine }) = \dfrac{53}{53+57} = 0.48 \]

Google Stock Returns

goog = read.csv("data/GOOG2019.csv")
rgoog = goog$Adj.Close[2:251]/goog$Adj.Close[1:250] - 1
sp = read.csv("data/SP2019.csv");   rsp = sp$Adj.Close[2:251]/sp$Adj.Close[1:250] - 1
plot(rsp, rgoog, pch=21, bg="lightblue", ylab="GOOG return", xlab="SP500 return")
abline(lm(rgoog~rsp))

var_goog = mean((rgoog - mean(rgoog))^2)
var_sp = mean((rsp - mean(rsp))^2)
cov = mean((rgoog - mean(rgoog))*(rsp - mean(rsp))); print(cov)

## [1] 7.998625e-05

cor = cov/(sqrt(var_goog)*sqrt(var_sp)); print(cor)

## [1] 0.6673029

Normal Distribution

Consider observations of daily log-returns of a Google stock for 2019 Daily log-return on day $t$ is calculated by taking a logarithm of the ratio of price at close of day $t$ and at close of day $t-1$ \[ y_t = \log\left(\dfrac{P_t}{P_{t-1}}\right) \] For example on January 3 of 2017, the open price is 778.81 and close price was 786.140, then the log-return is $\log(786.140/778.81) = -0.0094$. It was empirically observed that log-returns follow a Normal distribution. This observation is a basis for Black-Scholes model with is used to evaluate future returns of a stock.

p = read.csv("data/GOOG2019.csv")$Adj.Close; n = length(p) 
r = log(p[2:n]/p[1:(n-1)]) 
hist(r, breaks=30, col="lightblue")

Observations on the far right correspond to the days when positive news was released and on the far left correspond to bad news. Typically, those are days when the quarterly earnings reports are released.

To estimate the expected value $\mu$ (return) and standard deviation $\sigma$ (a measure of risk), we simply calculate their sample counterparts \[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i, ~\mathrm{ and }~ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x} )^2 \] The empirical (or sample) values $\bar x$ and $s^2$ are called sample mean and sample variance. Here simply vie them as our best guess about the mean and variance of the normal distribution model then our probabilistic model for next day’s return is then given by \[ R \sim N(\bar x, s^2). \]

This method of estimating the parameters of the distribution is called the method of moments, meaning that we estimate the variance and the mean (first two moments of a distribution) by simply calculating those from the observed sample.

Say we are interested in investing into Google and would like to calculated the expected return of our investment as well as risk associated with this investment We assume that behavior of the returns in the future will be the same as in 2019.

n = length(r) 
rbar = sum(r)/n; print(rbar)

## [1] 0.0009798198

s2 = sum((r-rbar)^2)/(n-1); print(s2)

## [1] 0.0002309921

x = seq(-0.08,0.08, length.out = 200) 
hist(r, breaks=30, col="lightblue", freq = F) 
lines(x,dnorm(x,rbar,sqrt(s2)), col="red", lwd=2)

Now, assume, I invest all my portfolio into Google. I can predict my annual return to be $251 \times 9.7981977\times 10^{-4}$ = 0.2459348 and risk (volatility) of my investment is $\sqrt{s^2}$ = 0.0151984% a year.

I can predict the risk of loosing 3% or more in one day using my model is 1.93%.

pnorm(log(1-0.03), rbar, sqrt(s2))*100

## [1] 1.929316

Stock market crash 1987

sp = read.csv("data/SPMonthly.csv")$Adj.Close; n = length(sp) 
spret = sp[602:n]/sp[601:(n-1)]-1 # Calculate  1977-1987 returns 
mean(spret)

## [1] 0.0120242

sd(spret)

## [1] 0.04300243

Prior to the October, 1987 crash SP500 monthly returns were 1.2% with a risk/volatility of 4.3%. The question is how extreme was the 1987 crash of $-21.76$%? \[ X \sim N \left(1.2, 4.3^2 \right ) \] This probability distribution can be standardized to yield \[ Z =\frac{X-\mu}{\sigma} = \frac{X - 1.2}{4.3} \sim N(0,1) . \] Now,, we calculate the observed $Z$, given the outcome of the crash event \[ Z = \frac{-0.2176 - 0.012}{0.043} = -5.27 \] That’s is $5$-sigma event in terms of the distribution of $X$. Meaning that -0.2176 is 5 standard diviations away from the mean. Under a normal model that is equivalent to $P(X < -0.2176)$

pnorm(-21.76, 1.2,4.3)

## [1] 4.659267e-08

EPL Model (Poisson Distribution)

Consider the problem of modeling soccer scores in the English Premier League (EPL) games. We use data from Betfair, a website, which posts odds on many football games. The goal is to calculate odds for the possible scores in a match. \[ 0-0, \; 1-0, \; 0-1, \; 1-1, \; 2-0, \ldots \] Another question we might ask, is what’s the odds of a team winning?

This is given by $P\left ( X> Y \right )$. The odds’s of a draw are given by $P \left ( X = Y \right )$?

Professional sports betters rely on sophisticated statistical models to predict the outcomes. Instead, we present a simple, but useful model for predicting outcomes of EPL games. We follow the methodology given in @spiegelhalter2009one.

First, load the data and then model the number of goals scored using Poisson distribution.

df = read.csv("data/epl.csv") %>% select(home_team_name,away_team_name,home_score,guest_score)
head(df)

##   home_team_name       away_team_name home_score guest_score
## 1        Arsenal            Liverpool          3           4
## 2    Bournemouth    Manchester United          1           3
## 3        Burnley              Swansea          0           1
## 4        Chelsea             West Ham          2           1
## 5 Crystal Palace West Bromwich Albion          0           1
## 6        Everton            Tottenham          1           1

Let’s look at the empirical distribution across the number of goals scored by Manchester United

team_name="Manchester United" 
team_for  = c(df[df$home_team_name==team_name,"home_score"],df[df$away_team_name==team_name,"guest_score"]) 
n = length(team_for) 
for_byscore = table(team_for)/n 
barplot(for_byscore, col="coral", main="Histogram of Goals Scored by MU")

Hence the historical data fits closely to a Poisson distribution, the parameter $\lambda$ describes the average number of goals scored and we calculate it by calculating the sample mean, the maximum likelihood estimate. A Bayesian method where we assume that $\lambda$ has a Gamma prior is also available. This lets you incorporate outside information into the predictive model.

lambda_for = mean(team_for) 
barplot(rbind(dpois(0:4, lambda = lambda_for),for_byscore),beside = T, col=c("aquamarine3","coral"), xlab="Goals", ylab="probability", main="Histogram vs Poisson Model Prediction of Goals Scored by MU") 
legend("topright", c("Poisson","MU"), pch=15, col=c("aquamarine3", "coral"), bty="n")

Now we will use Poisson model and Monter Carlo simulations to predict possible outcomes of the MU vs Hull games. First we estimate the rate parameter for goals by MU lmb_mu and goals by Hull lmb_h. Each team played a home and away game with every other team, thus 38 total games was played by all teams. We calculate the average by dividing total number of goals scored by the number of games

sumdf = df %>% 
  group_by(home_team_name) %>% 
  summarise(Goals_For_Home = sum(home_score)) %>%
  full_join(df %>% 
              group_by(away_team_name) %>% 
              summarise(Goals_For_Away = sum(guest_score)), by = c("home_team_name" = "away_team_name")
            ) %>%
  full_join(df %>% 
              group_by(home_team_name) %>% 
              summarise(Goals_Against_Home = sum(guest_score))
            ) %>%
  full_join(df %>% 
              group_by(away_team_name) %>%
              summarise(Goals_Against_Away = sum(home_score)), by = c("home_team_name" = "away_team_name")
            ) %>%
  rename(Team=home_team_name)

## Joining, by = "home_team_name"

knitr::kable(sumdf[sumdf$Team %in% c("Manchester United", "Hull"),])

Team	Goals_For_Home	Goals_For_Away	Goals_Against_Home	Goals_Against_Away
Hull	28	9	35	45
Manchester United	26	28	12	17

lmb_mu = (26+28)/38; print(lmb_mu)

## [1] 1.421053

lmb_h = (28+9)/38; print(lmb_h)

## [1] 0.9736842

Now we simulate 100 games between the teams

mu = rpois(100,lmb_mu)
hull = rpois(100,lmb_h)
sum(mu>hull)

## [1] 47

sum(mu==hull)

## [1] 33

table(mu,hull)

##    hull
## mu   0  1  2  3  4
##   0  9  9  2  1  0
##   1  7 15  5  0  0
##   2  9  7  8  1  1
##   3  7  3  6  1  1
##   4  1  2  1  1  0
##   5  2  1  0  0  0

From our simulaiton that 47 number of times MU wins and 33 there is a draw. The actual outcome was 0-0 (Hull at MU) and 0-1 (Mu at Hull). Thus our model fives a reasonable prediction.

The model can be improved by calculating different averages for home and away games. For example, Hull does much better at home games compared to away games. Further, we can include the characteristics of the opponent team to account for interactions between attack strengh (number of scored) and defence weakness of the opponent. Now we modify our value of expected goals for each of the teams by calculating \[ \hat \lambda = \lambda \times \text{Defense weakness} \]

Let’s model the MU at Hull game. The average away goals for MU $28/19 = 1.4736842$ and the defence weakness of Hull is $36/19 = 1.8421053$, thus the adjusted expected number of goals to be scored by MU is 2.7922438. Similarly, the adjusted number of the goals Hull is expected to score is $28/19 \times 17/19 = 1.3185596$

As a result of the simulation, we obtain

set.seed(1)
x = rpois(100, 28/19*35/19)
y = rpois(100,28/19*17/19)
table(x,y)

##    y
## x    0  1  2  3  4  5
##   0  1  3  0  0  0  0
##   1  3  5  6  1  1  0
##   2  4 16  7  3  0  0
##   3  6  7  2  3  0  0
##   4  4  7  5  2  1  0
##   5  2  3  1  2  0  2
##   6  1  0  0  1  0  0
##   7  1  0  0  0  0  0

image(z=table(x,y),x = 0:7,y = 0:5, xlab="MU Score", ylab="Hull Score")

sum(x>y)

## [1] 67

A model is only as good as its predictions. Let’s see how our model did in out-of-sample prediction,

Man U wins 67 games out of 100, we should bet when odds ratio is below 67 to 100.
Most likely outcome is 1-2 (16 games out of 100)
The actual outcome was 0-1 (they played on August 27, 2016)
In out simulation 0-1 was the fourth most probable outcome (8 games out of 100).

Sampling Error

Say we are interested in predicting if voters in the state of Michigan is to support the democratic or republican candidate at the next presidential elections. We will use the simulated data for our analysis. Let’s simulate voting preferences of the 8 million Michigan voters assuming that 52% of voters are to support the republican candidate

mi_voters = rbinom(8000000, 1, 0.52)

Now, we survey 100 randomly selected voters in Michigan

survey_sample = sample(mi_voters,size = 100, replace = F)
p = mean(survey_sample)

## [1] 0.45

The sample mean of 0.45 allows us to say something about the population mean, the proportion of republican voters we would get if we sampled enough in Michigan. Note, the central limit theorem provides us details of how quickly the sample mean tends to the popular mean. It only applies when observations are independent and identically distributed samples. We will illustrate this theorem empirically. We simulate 1000 surveys and calculate the sample mean for each survey. We will see that a Normal distribution fits the proportion distribution from the simulated surveys. Theoretically, we can find the means and standard deviation from the central limit theorem.

prep = replicate(2000, mean(sample(mi_voters,size = 100,replace = F)))
hist(prep, breaks = 30, freq = F, main="Histogram of voting proportions")
p = seq(0.3,0.7,length.out = 500)
lines(p, dnorm(p,mean(prep),sd(prep)), col="red",lwd=3)

We see that the red bell-curve is a good model for the distribution over the means calculated from samples. In fact, the central limit theorem says that sample proportions do follow normal distribution. More formally, if $x_1,x_2,\ldots,x_n$ is a sequence of independent random variables having mean $\mu$ and variance $\sigma^2$ coming from the same distribution then the sample mean follows a normal distribution \[ \bar{X} := \frac{1}{n}\sum_{i=1}^n X_i \sim N(\mu, {\sigma^2 \over n}). \] We sometimes use the notation $\hat \mu$ to denote an estimate. So we have $\hat \mu = \bar x$. This statistical property allows us to quantify uncertainty about the sample mean and say something about the true value of the mean $\mu$, in terms of a probabilistic interval statement.

In practice, we do not know mean $\mu$ and variance $\sigma^2$. Put simply, we wish to estimate $\bar x$ and $s^2$ calculated from the sample \[ s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar x)^2. \] We can think of $s^2$ as an estimate of $\hat\sigma^2$. The estimated standard deviation of $\bar x$ is then \[ s_{\bar{x} } = \frac{s}{\sqrt{n}}, \] and we call it the standard error, standard deviation of the error.

Google Seach Algorithm AB Testing

Let’s look at another example and test effectiveness of Google’s new search algorithm. We measure effectiveness by the number of users who clicked on one of the search results. As users send the search requests, they will be randomly processed with Algo 1 or Algo 2. We wait until 2500 search requests were processed by each of the algorithms and calculate the following table based on how often people clicked through

Google Search Algorithm
	Algo1	Algo2
success	1755	1818
failure	745	682
total	2500	2500

The probability of success is estimated to be $\hat{p}_1 = 0.702$ for the current algorithm and $\hat{p}_2 = 0.727$ for the new algorithm. We can calculate the 95% confidence interval or 95% Bayesian credible region for both estimated proportions Is the new algorithm

For Algo 1:

p1 = 1755/2500; p2 = 1818/2500
p1 + c(-1.96,1.96)*sqrt(p1*(1-p1)/2500)

## [1] 0.6840707 0.7199293

p2 + c(-1.96,1.96)*sqrt(p2*(1-p2)/2500)

## [1] 0.7097404 0.7446596

Given that the intervals do not overlap, there is enough evidence that algorithms are different, and the new Algo 1 is indeed more efficient.

We will get a slightly more precise estimation of uncertainty if we calculate confidence interval for the difference of the proportions. Since $p_1$ and $p_2$ both follow Normal distribution, thus their difference is also normally distributed \[ p_1 - p_2 \sim N(\hat p_1 - \hat p_2, s_1^2/n + s_2^2/n). \] Applying this formula for the Google search algorithm experiment, we calculate the 95% confidence interval for the difference

p1 - p2  + c(-1.96,1.96)*sqrt(p1*(1-p1)/2500 + p2*(1-p2)/2500)

## [1] -0.0502259431 -0.0001740569

The confidence interval for the difference does contain 0, and thus we cannot say that we are confident that algorithms are different!

More generally, if the number of observations in two groups are different, say $n_1$ and $n_2$ then the \[ s_{ \bar{X}_1 - \bar{X}_2 } = \sqrt{ \frac{ s^2_{ \bar{X}_1 }}{n_1} + \frac{ s^2_{ \bar{X}_2 }}{n_2} } \] or for proportions, we compute \[ s_{ \hat{p}_1 - \hat{p}_2 } = \sqrt{ \frac{ \hat{p}_1 (1- \hat{p}_1)}{n_1} + \frac{ \hat{p}_2 (1- \hat{p}_2)}{n_2} }. \]

Pfizer’s Viagra

We consider another example that involves pharmaceutical company Pfizer. Pfizer introduced Viagra in early 1998. During $1998$ of the $6$ millionViagra users $77$ died from coronary problems such as heart attacks.Pfizer claimed that this rate is no more than the general population.A clinical study found $11$ out of $1,500,000$ men who were not onViagra died of coronary problems during the same length of time as the$77$ Viagra users who died in $1998$.The question is,Let’s calculate the significance Interval. A 95% confidence interval fora difference in proportions $p_1 - p_2$ is \[ ( \hat{p}_1 - \hat{p}_2 ) \pm 1.96 \sqrt{ \frac{ \hat{p}_1 ( 1 - \hat{p}_1 ) }{ n_1 } + \frac{ \hat{p}_2 ( 1 - \hat{p}_2 ) }{ n_2 } } \]

Can do a confidence interval or a $Z$-score test.
With Viagra, $\hat{p}_1 = 77/6000000 = 0.00001283$ and without Viagra $\hat{p}_2 = 11/1500000 = 0.00000733$.
Need to test whether these are equal.

With a $95$% confidence interval for $( p_1 - p_2 )$ you get an interval \[{( 0.00000549 , 0.0000055 )}\] This doesn’t contain zero.

The evidence is that the proportion is higher.

Measured very accurately as $n$ is large even though $p$ is small.
With testing might use a one-sided test and an $\alpha$ of $0.01$.

Difference of proportions:

prop.test(x=c(11,77), n=c(1500000,6000000), alternative='greater',conf.level=.95)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(11, 77) out of c(1500000, 6e+06)
## X-squared = 2.6428, df = 1, p-value = 0.948
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -1.027715e-05  1.000000e+00
## sample estimates:
##       prop 1       prop 2 
## 7.333333e-06 1.283333e-05

The p-value for the Null is $1-0.948 =0.052$.

Lord Rayleigh

Lord Rayleigh won the Nobel Prize for . This discovery occurred when he noticed a small discrepancy between two sets of measurements on nitrogen gas that he had extracted from the air and one he had made in the lab.

First, he removed all oxygen from a sample of air. He measured the density of the remaining gas in a fixed volume at constant temperature and pressure.
Second, he prepared the same volume of pure nitrogen by the chemical decomposition of nitrous oxide ($N_2 O$) and nitric oxide $NO$.

Let’s look at the Lord Rayleigh’s data

d = data.frame(air = c(2.31017, 2.30986, 2.31010, 2.31001, 2.31024, 2.31010, 2.31028,2.31028), 
               chemical = c(2.30143, 2.29890, 2.29816, 2.30182, 2.29869, 2.29940, 2.29849, 2.29889))
knitr::kable(d)

air	chemical
2.31017	2.30143
2.30986	2.29890
2.31010	2.29816
2.31001	2.30182
2.31024	2.29869
2.31010	2.29940
2.31028	2.29849
2.31028	2.29889

c(mean(d$air), mean(d$chemical))

## [1] 2.310130 2.299473

c(sd(d$air), sd(d$chemical))

## [1] 0.0001453076 0.0013791897

Let’s calculate the CIs for both

mean(d$air) + 1.96*sd(d$air)*c(-1,1)

## [1] 2.309845 2.310415

mean(d$chemical) + 1.96*sd(d$chemical)*c(-1,1)

## [1] 2.296769 2.302176

We also can use built-in function

t.test(d$air,d$chemical,var.equal=T)

## 
##  Two Sample t-test
## 
## data:  d$air and d$chemical
## t = 21.736, df = 14, p-value = 3.47e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.009605877 0.011709123
## sample estimates:
## mean of x mean of y 
##  2.310130  2.299473

Linear Regression (Housing Data)

To demonstrate linear regression we develop a model that predicts the price of a house, given its square footage. This is a simplified version of a regression model used by Zillow, a house pricing site. First, we look at property sales data where we know the price and some observed characteristics. Let’s look at the scatter plot of living areas measure in squared feet and the price measured in thousands of dollars.

Second, we build a decision model that predicts price as a function of the observed characteristics. Now, we need to decide on what characteristics do we use. There are many factors or variables that affect the price of a house, for example location. Some other basics ones include size, number of rooms, and parking.

Let’s run a simple linear regression on size, measured in square foots. The value that we seek to predict is called the dependent variable, and we denote this by $y$ which is the price of house measured in thousand of dollars. The variable that we use to construct our prediction is the predictor variable, and we denote it with $x$. What’s does the data look like?

d = read.csv("data/SaratogaHouses.csv")
d$price = d$price/1000; d$livingArea = d$livingArea/1000
plot(d$livingArea,d$price, 'p', pch=21, bg="lightblue")

We use lm function to draw a line

l = lm(price ~ livingArea, data=d)
plot(d$livingArea, d$price, 'p', pch=21, bg="lightblue", xlab="Living Area", ylab="Price") 
abline(l, col="red", lwd=3)

We can use coef funciton to find out the slope and the intercept of this line.

coef(l)

## (Intercept)  livingArea 
##    13.43939   113.12254

The intercept value is in units of $y$ ($1000). The slope is in units of $y$ per units of $x$ ($1000/1 sq. feet).

In general, adding statistical noise, the simple linear regression model is then given by the follwing equation \[ y = \beta_0 + \beta_1 x + \epsilon \; \; {\rm where} \; \epsilon \sim N(0, \sigma^2) \] Here error $\epsilon$ term indicates that the relation between $x$ and $y$ is not excatly a line and for a given $x$, the value of $y$ can be deviating from the line. We will write this as $y \mid x \sim N\left(\beta_0 + \beta_1 x, \sigma^2 \right)$

In our housing example, we find that $\beta_0 = 13.439394$ and $\beta_1 =113.1225418$. Thus, every $1000$ sqft increase the price of the hous by $1.3439394^{4} and the price of a lot of land without a house is 113.1225418. The magnitude of $\beta_1$ shows the economic significance of house’s square footage.

We can now predict the price of a house when we only know that size: take the value off the regression line. For example, given a house size of $x = 2.2$. Predicted Price: \[ \hat{y} = 13.439394 + 113.1225418 \times 2.2 = 262.3089861 \]

Now we can use our training data set to compare prices predicted by the model and actual prices. We will do it by clculating an average squared difference, also called a mean squared error \[ MSE = \dfrac{1}{n}\sum_{i=1}^{n} (\hat y_i - y_i)^2 \]

Or in code

yhat = coef(l)[1] + coef(l)[2]*d$livingArea
y = d$price
mse = sum((yhat-y)^2)/length(y)
mse

## [1] 4769.913

Sometimes it is more convenient to use root mean squared error (RMSE) = $\sqrt{4769.912967} = 69.0645565$ which is measured in the same units as variable $y$, price in our case.

Log-tranformation and J&J data

Consider an example, when the relations are multiplicative.

Let’s load and plot Johnson and Johnson’s quarterly earnings

library(lubridate)
jj = read.csv("data/JNJ.csv")
jj$Date = lubridate::ymd(jj$Date) # convert from string to date format
plot(jj$Date,jj$Adj.Close, type='l', lwd = 3, xlab="Date", ylab="Earnings")

jj = jj %>%arrange(Date) # sort by date
jj$Date0 = 12*(lubridate::year(jj$Date)-lubridate::year(jj$Date[1])) + 
  (lubridate::month(jj$Date)-lubridate::month(jj$Date[1])) # months since the start of the data
model = lm(jj$Open ~ jj$Date0)
modellog = lm(log(jj$Open) ~ jj$Date0)
plot(jj$Date0,jj$Open, type='l', col="red", ylab="JNJ Open", xlab="Month")
abline(model, col="red")
par(new=T)
plot(log(jj$Open), type='l', xlab="", ylab="", xaxt='n', yaxt='n')
abline(modellog)
axis(side = 4)
legend("topleft", legend=c("Original", "log-Scale"), lty=1, col=c("red", "black"), lwd=3, bty="n", y.intersp = 1,cex=1)

summary(model)

## 
## Call:
## lm(formula = jj$Open ~ jj$Date0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.061 -11.115  -2.037   8.347  51.989 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -23.013493   1.203666  -19.12   <2e-16 ***
## jj$Date0      0.193127   0.003612   53.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.49 on 576 degrees of freedom
## Multiple R-squared:  0.8323, Adjusted R-squared:  0.832 
## F-statistic:  2859 on 1 and 576 DF,  p-value: < 2.2e-16

summary(modellog)

## 
## Call:
## lm(formula = log(jj$Open) ~ jj$Date0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90371 -0.31006 -0.07729  0.33070  0.78297 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.307e-02  3.238e-02  -2.874   0.0042 ** 
## jj$Date0     9.143e-03  9.716e-05  94.103   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3898 on 576 degrees of freedom
## Multiple R-squared:  0.9389, Adjusted R-squared:  0.9388 
## F-statistic:  8855 on 1 and 576 DF,  p-value: < 2.2e-16

Just by visual inspection we can see that a straight line will not be a good fit for this data. The relation between the time and earnings is multiplicative. There are many business and natural science examples where multiplicative relation holds. To model multiplicative or power relations we can still use the apparatus of linear models. However, we need to apply log-transformation to both input and output variables. $\log$ is the natural $\log_e$ with base $e=2.718\ldots$ and that $\log ( a b) = \log a + \log b$ and $log ( a^b ) = b \log a$.

Suppose that we have an equation: \[y = A e^{\beta_1 x},\] where $A = e^{\beta_0}$. This is equivalent to $\log(y) = a + \beta_1 x$.

Taking logs of the original equation gives \[\begin{aligned} \log y &= \log A + \beta_1 x\\ \log y & = \beta_0 + \beta_1 x\end{aligned}\]

This model is called the exponential model. Any process that is exponentially growing will lead to such a model. For example growth of population with time or compounding interest in savings account.

Log-tranformation and the Mammals Data

mammals = read.csv("data/mammals.csv")
attach(mammals)
head(mammals)

##                      Mammal  Brain     Body
## 1          African_elephant 5712.0 6654.000
## 2 African_giant_pouched_rat    6.6    1.000
## 3                Arctic_Fox   44.5    3.385
## 4    Arctic_ground_squirrel    5.7    0.920
## 5            Asian_elephant 4603.0 2547.000
## 6                    Baboon  179.5   10.550

tail(mammals)

##                   Mammal Brain  Body
## 57                Tenrec   2.6 0.900
## 58            Tree_hyrax  12.3 2.000
## 59            Tree_shrew   2.5 0.104
## 60                Vervet  58.0 4.190
## 61         Water_opossum   3.9 3.500
## 62 Yellow-bellied_marmot  17.0 4.050

### Linear model
model = lm(Brain~Body)

Residual Diagnostics

The residuals show that you need a transformation .…

layout(matrix(1:2, ncol = 2))
plot(Brain~Body, data=mammals, pch=21, bg="lightblue")
abline(model, col="red", lwd=3)
plot(residuals(model)~mammals$Brain, bg="lightblue")

Residual Plots

par(mar=c(4,4.5,3,3), bty='n')
plot(model, bg="lightblue", pch=21)

log-log model

### Log-Log linear model
model = lm(log(Brain)~log(Body))
summary(model)

## 
## Call:
## lm(formula = log(Brain) ~ log(Body))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75335 -0.53911 -0.09239  0.42068  2.61153 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.18328    0.10682   20.44   <2e-16 ***
## log(Body)    0.74320    0.03166   23.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7722 on 60 degrees of freedom
## Multiple R-squared:  0.9018, Adjusted R-squared:  0.9002 
## F-statistic: 551.2 on 1 and 60 DF,  p-value: < 2.2e-16

par(mar=c(4.5,4.5,3,3))
layout(matrix(1:2, ncol = 2))
plot(log(Brain)~log(Body), pch=21, bg="lightblue")
abline(model, col="red", lwd=3)
plot(log(Brain),residuals(model), pch=21, bg="lightblue")

That’s better!

$4$ in $1$ Residuals: log-log model

par(mar=c(4,4.5,3,3))
plot(model, bg="lightblue", pch=21, bty='n')

log-log Model

\[{ {\rm log(Body)} = 2.18 + 0.74 \; {\rm log(Brain)} \; . }\] The coefficients are highly significant $R^2 = 90$%

Outliers

res = rstudent(model)
outliers = order(res,decreasing = T)[1:10]
cbind(mammals[outliers,],
      residual = res[outliers], 
      Fit = exp(model$fitted.values[outliers]))

##             Mammal  Brain   Body  residual        Fit
## 11      Chinchilla   64.0  0.425 3.7848652   4.699002
## 34             Man 1320.0 62.000 2.6697886 190.672827
## 50   Rhesus_monkey  179.0  6.800 2.1221002  36.889735
## 6           Baboon  179.5 10.550 1.6651361  51.128826
## 42      Owl_monkey   15.5  0.480 1.4589815   5.143815
## 10      Chimpanzee  440.0 52.160 1.2734358 167.690600
## 27 Ground_squirrel    4.0  0.101 1.2020391   1.615101
## 43    Patas_monkey  115.0 10.000 1.1133552  49.134284
## 60          Vervet   58.0  4.190 1.0617853  25.740367
## 3       Arctic_Fox   44.5  3.385 0.9205806  21.966124

There is a residual value of 3.78 extreme outlier.

It corresponds to the Chinchilla.

This suggests that the Chinchilla is a master race of supreme intelligence!

Inference

NO!!! I checked and there was a data entry error.

The brain weight is given as 64 grams and should only be 6.4 grams.
The next largest residual corresponds to mankind

In this example the log-log transformation used seems to achieve two important goals, namely linearity and constant variance.

Boston Housing

We will demonstrate several more advanced predictive models using Boston housing dataset. This price dataset contains the factors that contribute to the median value of owner-occupied homes in $1000’s.

By constructing a regression model we can quantify the effects of the following measure factors (excluding MEDV as the response):

variable	description
CRIM	per capita crime rate by town
ZN	proportion of residential land zoned for lots over 25,000 sq.ft
INDUS	proportion of non-retail business acres per town
CHAS	Charles River dummy variable (1 if tract bounds river; else 0)
NOX	nitric oxides concentration (parts per 10 million)
RM	average number of rooms per dwelling
AGE	proportion of owner-occupied units built prior to 1940
DIS	weighted distances to five Boston employment centres
RAD	index of accessibility to radial highways
TAX	full-value property-tax rate per $10,000
PTRATIO	pupil-teacher ratio by town
B	$1000(\mbox{Bk} - 0.63)^2$ where Bk is the proportion of blacks by town
LSTAT	% lower status of the population
MEDV	Median value of owner-occupied homes in $1000’s

library(tree)
library(MASS)
data(Boston)
attach(Boston)
knitr::kable(Boston[1:5,1:7])

crim	zn	indus	nox	rm	age
0.00632	18	2.31	0.538	6.575	65.2
0.02731	0	7.07	0.469	6.421	78.9
0.02729	0	7.07	0.469	7.185	61.1
0.03237	0	2.18	0.458	6.998	45.8
0.06905	0	2.18	0.458	7.147	54.2

dim(Boston)

## [1] 506  14

First, we split data into training and test subsets

test_index = sample(1:nrow(Boston), size = as.integer(0.2*nrow(Boston)))
testing = Boston[test_index,]
training = Boston[-test_index,]
nrow(training)

## [1] 405

Let’s look at the correlation of each independent variable with the dependent variable and near zero variance

knitr::kable(cor(Boston,Boston$medv))

crim	-0.3883046
zn	0.3604453
indus	-0.4837252
chas	0.1752602
nox	-0.4273208
rm	0.6953599
age	-0.3769546
dis	0.2499287
rad	-0.3816262
tax	-0.4685359
ptratio	-0.5077867
black	0.3334608
lstat	-0.7376627
medv	1.0000000

knitr::kable(t(summarise_if(Boston,is.numeric, sd)/summarise_if(Boston,is.numeric, mean)))

crim	2.3803761
zn	2.0523759
indus	0.6160087
chas	3.6720281
nox	0.2089034
rm	0.1117992
age	0.4104834
dis	0.5548581
rad	0.9118115
tax	0.4128412
ptratio	0.1173060
black	0.2559616
lstat	0.5643741
medv	0.4081651

plot(Boston[1:200,c("medv","lstat","rm","age", "crim")], pch=21, bg="lightblue", cex=0.5)

Linear Regression

#Try linear model using all features
fit.lm <- lm(medv~lstat,data = training)

#predict on test set
pred.lm <- predict(fit.lm, newdata = testing)

plot(testing$medv,pred.lm, pch=21, bg="lightblue"); abline(a=0,b=1)

# Root-mean squared error
rmse.lm  =  sqrt(sum((pred.lm - testing$medv)^2)/
                  length(testing$medv))
mape.lm = mean(abs(pred.lm - testing$medv)/testing$medv)

c(RMSE = rmse.lm, MAPE = mape.lm, R2 = summary(fit.lm)$r.squared)

##      RMSE      MAPE        R2 
## 7.2137599 0.2756965 0.5552640

par(mfrow=c(3,1), mar=c(4,4,0,0), bty='n')
plot(medv~lstat, data=Boston, pch=21, bg="lightblue")
plot(log(medv)~lstat, data=Boston, pch=21, bg="lightblue")
plot(log(medv)~log(lstat), data=Boston, pch=21, bg="lightblue")

#Try linear model using all features
fit.lm <- lm(log(medv)~log(lstat),data = training)

#predict on test set
pred.lm <- exp(predict(fit.lm, newdata = testing))

plot(testing$medv,pred.lm, pch=21, bg="lightblue"); abline(a=0,b=1)

# Root-mean squared error
rmse.lm  =  sqrt(sum((pred.lm - testing$medv)^2)/
                  length(testing$medv))
mape.lm = mean(abs(pred.lm - testing$medv)/testing$medv)

c(RMSE = rmse.lm, MAPE = mape.lm, R2 = summary(fit.lm)$r.squared)

##      RMSE      MAPE        R2 
## 5.5252182 0.2036122 0.6738844

#Try linear model using all features
fit.lm <- lm(log(medv)~lstat,data = training) 

#predict on test set
pred.lm <- exp(predict(fit.lm, newdata = testing))

plot(testing$medv,pred.lm, pch=21, bg="lightblue"); abline(a=0,b=1)

# Root-mean squared error
rmse.lm  =  sqrt(sum((pred.lm - testing$medv)^2)/
                  length(testing$medv))
mape.lm = mean(abs(pred.lm - testing$medv)/testing$medv)

c(RMSE = rmse.lm, MAPE = mape.lm, R2 = summary(fit.lm)$r.squared)

##      RMSE      MAPE        R2 
## 6.5361788 0.2234829 0.6614223

Multiple Linear Regression

#Try simple linear model using selected features
fit.lm2 <- lm(formula = log(medv) ~ crim + chas + nox + rm + dis + ptratio + 
                rad + black + log(lstat), data = training)

pred.lm2 <- exp(predict(fit.lm2, newdata = testing))

# Root-mean squared error
rmse.lm2 <- sqrt(sum((pred.lm2 - testing$medv)^2)/
                   length(testing$medv))
mape.lm2 = mean(abs(pred.lm2 - testing$medv)/testing$medv)

c(RMSE = rmse.lm2, MAPE = mape.lm2, R2 = summary(fit.lm2)$r.squared)

##      RMSE      MAPE        R2 
## 4.2148008 0.1510489 0.7833539

# Plot of predicted price vs actual price
plot(pred.lm2,testing$medv, xlab = "Predicted Price", ylab = "Actual Price",pch=21,bg="lightblue")
abline(a=0,b=1)

summary(fit.lm2)

## 
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + dis + ptratio + 
##     rad + black + log(lstat), data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71608 -0.09696 -0.00954  0.09395  0.81543 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.7921376  0.2308976  20.754  < 2e-16 ***
## crim        -0.0108393  0.0013508  -8.024 1.18e-14 ***
## chas         0.1124884  0.0389586   2.887 0.004098 ** 
## nox         -0.8090206  0.1605993  -5.038 7.19e-07 ***
## rm           0.0469227  0.0182935   2.565 0.010686 *  
## dis         -0.0457327  0.0072537  -6.305 7.73e-10 ***
## ptratio     -0.0345566  0.0054580  -6.331 6.61e-10 ***
## rad          0.0057448  0.0018024   3.187 0.001550 ** 
## black        0.0004136  0.0001149   3.601 0.000358 ***
## log(lstat)  -0.4056284  0.0252214 -16.083  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1874 on 395 degrees of freedom
## Multiple R-squared:  0.7834, Adjusted R-squared:  0.7784 
## F-statistic: 158.7 on 9 and 395 DF,  p-value: < 2.2e-16

# Diagnostics plots
layout(matrix(c(1,2,3,4),2,2))
par(mar=c(4,5,2,0), bty='n')
plot(fit.lm2, pch=21, bg="lightblue", cex=0.7)

Lasso and Elastic-Net

library(glmnet)
mat = model.matrix(log(medv) ~ crim + chas + nox + rm + dis + ptratio + 
                rad + black + log(lstat), data = training)
mattest = model.matrix(log(medv) ~ crim + chas + nox + rm + dis + ptratio + 
                rad + black + log(lstat), data = testing)
fit.glm <- glmnet(mat,log(training$medv))
plot(fit.glm)

We will use cross-validaiton to choose the best penalty size

cvfit <- cv.glmnet(mat,log(training$medv))
plot(cvfit)

cvfit$lambda.min

## [1] 0.000484624

coef(cvfit, s = "lambda.min")

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  4.7622337038
## (Intercept)  .           
## crim        -0.0106952601
## chas         0.1114781908
## nox         -0.7768003749
## rm           0.0470269254
## dis         -0.0444852877
## ptratio     -0.0338991864
## rad          0.0053855587
## black        0.0004076927
## log(lstat)  -0.4057575693

Let’s define a function that calculates model performance

modelmetrics = function(y,yhat) {
  ssreg = sum((y-yhat)^2)
  sstot = sum((y-mean(y))^2)
  rmse  =  sqrt(ssreg/length(y))
  mape = mean(abs((y-yhat)/y))
  r2 = 1 - ssreg/sstot
  knitr::kable(t(c(RMSE = rmse, MAPE = mape, R2 = r2)),digits = 2)
}

Lets look at the predictions

pred.lasso = exp(predict(cvfit, newx = mattest, s = "lambda.min"))
plot(testing$medv,pred.lasso, pch=21, bg="lightblue"); abline(a=0,b=1,lwd=3)

modelmetrics(testing$medv,pred.lasso)

RMSE	MAPE	R2
4.22	0.15	0.83

Now, let’s try ridge regression. The glmnet library soles the following optimisation problem \[ \min_{\beta_0,\beta} \frac{1}{N} \sum_{i=1}^{N} w_i l(y_i,\beta_0+\beta^T x_i) + \lambda\left[(1-\alpha)\|\beta\|_2^2/2 + \alpha \|\beta\|_1\right], \] The value of $\alpha=1$ is assumed by default. To get the ridge regression fit, we simply set $\alpha=0$

wts <-  c(rep(1,50), rep(2,50))
cvfit <- cv.glmnet(mat,log(training$medv), alpha=0)
plot(cvfit)

cvfit$lambda.min

## [1] 0.03263531

coef(cvfit, s = "lambda.min")

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  4.2760342514
## (Intercept)  .           
## crim        -0.0094728276
## chas         0.1175579865
## nox         -0.5910244766
## rm           0.0712810232
## dis         -0.0311888700
## ptratio     -0.0304784923
## rad          0.0028441011
## black        0.0004275498
## log(lstat)  -0.3506923535

pred.ridge = exp(predict(cvfit, newx = mattest, s = "lambda.min"))
plot(testing$medv,pred.ridge, pch=21, bg="lightblue"); abline(a=0,b=1,lwd=3)

modelmetrics(testing$medv,pred.ridge)

RMSE	MAPE	R2
4.15	0.15	0.84

KNN Model

library(kknn) ## knn library
attach(Boston)
ind = order(testing$lstat)
RMSE = NULL

kk = c(2,10,20,50,100,300)
par(mfrow=c(3,2), mar=c(3,3,2,0), bty='n')
for(i in kk){
  near = kknn(medv~lstat,training,testing,k=i,kernel = "rectangular")
  aux = sqrt(mean((testing$medv-near$fitted)^2))
  RMSE = c(RMSE,aux)
  plot(testing$lstat,testing$medv,main=paste("k=",i),pch=21,bg="lightblue",cex=0.8,col="darkgray")
  lines(testing$lstat[ind],near$fitted[ind],col=2,lwd=2)
}

Model performance

modelmetrics(testing$medv, near$fitted)

RMSE	MAPE	R2
8.67	0.35	0.29

Now let’s plot the relation between the number of neighbors and model performance (out-of-sample)

plot(log(1/kk),RMSE,type="b",xlab="Complexity(log(1/k))",col="blue",ylab="RMSE",lwd=2,cex.lab=1.2,pch=21,bg="lightblue")

Tree-Based Models

First big tree size

big.tree = tree(medv~lstat,data=training,mindev=.0001)
print(length(unique(big.tree$where)))

## [1] 56

plot(big.tree,type="uniform")
text(big.tree,col="blue",label=c("yval"),cex=.8)

Let’s see how well we did on the out-of-sample data

boston.fit = predict(big.tree,testing) #get training fitted values

Model performance

modelmetrics(testing$medv, boston.fit)

RMSE	MAPE	R2
6.16	0.24	0.64

Not so bad, comparable to LM. Do we overfit? Let’s see if a smaller tree will do any better.

boston.tree=prune.tree(big.tree,best=7)
cat('pruned tree size: \n')

## pruned tree size:

print(length(unique(boston.tree$where)))

## [1] 7

#plot the tree and the fits.
par(mfrow=c(1,2))
plot(boston.tree,type="uniform")
text(boston.tree,col="blue",label=c("yval"),cex=.8)
boston.fit = predict(boston.tree,testing) #get training fitted values
plot(testing$medv,boston.fit, pch=21, bg="lightblue")

Model performance

modelmetrics(testing$medv, boston.fit)

RMSE	MAPE	R2
5.83	0.22	0.68

We have a simpler model (less leafes) that predicts better. The larger model did overfit! Let’s plit our predictions

plot(testing$lstat,testing$medv,cex=.5,pch=21, bg="lightblue") #plot data
oo=order(testing$lstat)
lines(testing$lstat[oo],boston.fit[oo],col='red',lwd=3) #step function fit

cvals=c(9.725,4.65,3.325,5.495,16.085,19.9) #cutpoints from tree
for(i in 1:length(cvals)) abline(v=cvals[i],col='magenta',lty=2) #cutpoints

Let’s Fit a regression tree to mev~dis+lstat. Now our we have a two dimensional input and the partitions will be rectangles (possible open) The tree is plotted as well as the corresponding partition of the two-dimensional. Again, let’s start with a big tree

big.tree = tree(medv~lstat+dis,training,mindev=.0001)
cat('first big tree size: \n')

## first big tree size:

print(length(unique(big.tree$where)))

## [1] 60

Then prune it down to one with 7 leaves

boston.tree=prune.tree(big.tree,best=7)
cat('pruned tree size: \n')

## pruned tree size:

print(length(unique(boston.tree$where)))

## [1] 7

qplot(lstat,dis,data=Boston, colour = medv) +
  scale_color_gradient(low="blue", high="red")

Plot tree and partition in x.

par(mfrow=c(1,2))
plot(boston.tree,type="u")
text(boston.tree,col="blue",label=c("yval"),cex=.8)
partition.tree(boston.tree)
lines(training$lstat,training$dis, type='p', pch=21, bg="lightblue", col=as.integer(training$medv))

pv=seq(from=.01,to=.99,by=.05)
x1q = quantile(testing$lstat,probs=pv)
x2q = quantile(testing$dis,probs=pv)
xx = expand.grid(x1q,x2q) #matrix with two columns using all combinations of x1q and x2q
dfpred = data.frame(dis=xx[,2],lstat=xx[,1])
lmedpred = predict(boston.tree,dfpred)

#make perspective plot

par(mfrow=c(1,1))
persp(x1q,x2q,matrix(lmedpred,ncol=length(x2q),byrow=T),
      theta=150,xlab='dis',ylab='lstat',zlab='medv',
      zlim=c(min(training$medv),1.1*max(training$medv)))

tree.boston=tree(medv~.,training,mindev=.01)
cv.boston=cv.tree(tree.boston, ); 
plot(cv.boston$size,cv.boston$dev, type='l')

prune.boston=prune.tree(tree.boston,best=6)
#summary(prune.boston)
#cv.tree(tree.boston,,prune.tree)$dev
plot(prune.boston)
text(prune.boston,pretty=0)

boston.fit = predict(prune.boston,testing) #get training fitted values
plot(testing$medv,boston.fit, pch=21, bg="lightblue")

Model performance

modelmetrics(testing$medv, boston.fit)

RMSE	MAPE	R2
4.48	0.2	0.81

The best model thus far!

library(randomForest)

set.seed(1)
bag.boston=randomForest(medv~.,data=training,mtry=13,importance=TRUE)
bag.boston

## 
## Call:
##  randomForest(formula = medv ~ ., data = training, mtry = 13,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.48287
##                     % Var explained: 85.47

yhat.bag = predict(bag.boston,newdata=testing)

par(mfrow=c(1,1))
plot(yhat.bag, testing$medv, pch=21, bg="lightblue")
abline(0,1)

Model performance

modelmetrics(testing$medv, yhat.bag)

RMSE	MAPE	R2
2.96	0.12	0.92

bag.boston=randomForest(medv~.,data=training,mtry=13,ntree=25)
yhat.bag = predict(bag.boston,newdata=testing)
modelmetrics(testing$medv, yhat.bag)

RMSE	MAPE	R2
3.08	0.12	0.91

rf.boston=randomForest(medv~.,data=training,mtry=6,importance=TRUE)
yhat.rf = predict(rf.boston,newdata=testing)
modelmetrics(testing$medv, yhat.rf)

RMSE	MAPE	R2
2.87	0.12	0.92

importance(rf.boston)

##           %IncMSE IncNodePurity
## crim    14.560967    1932.52197
## zn       4.284978      86.77127
## indus    9.605498    1267.96102
## chas     2.114661      86.29355
## nox     17.798989    1880.16365
## rm      37.452825    8700.64525
## age     12.740484     638.15264
## dis     18.415476    2062.13520
## rad      6.213532     192.69777
## tax      8.876818     620.15756
## ptratio 13.588925    1212.24552
## black    8.164535     552.84349
## lstat   36.354017   12256.23324

par(mfrow=c(1,1))
varImpPlot(rf.boston,pch=21,bg="lightblue")

Boosting

library(gbm)
set.seed(1)
boost.boston=gbm(medv~.,data=training,distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.boston)

##             var    rel.inf
## lstat     lstat 40.2917987
## rm           rm 23.3439045
## dis         dis 10.9281748
## nox         nox  5.6387294
## crim       crim  4.8134756
## age         age  4.5566143
## black     black  3.5027374
## ptratio ptratio  2.9804523
## tax         tax  1.2842744
## chas       chas  0.8298594
## indus     indus  0.7111797
## rad         rad  0.6165401
## zn           zn  0.5022594

plot(boost.boston,i="rm", bty='n')
plot(boost.boston,i="lstat", bty='n')

The marginal effect, the other variables are integrated out.

yhat.boost=predict(boost.boston,newdata=testing,n.trees=5000)
modelmetrics(testing$medv, yhat.boost)

RMSE	MAPE	R2
2.76	0.11	0.93

boost.boston=gbm(medv~.,data=training,distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost=predict(boost.boston,newdata=testing,n.trees=5000)
modelmetrics(testing$medv, yhat.boost)

RMSE	MAPE	R2
3	0.13	0.92

Neural Network with Keras library

Now let’s train the model

library(keras)

model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = 13) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1) 
summary(model)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_3 (Dense)                     (None, 64)                      896         
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 64)                      4160        
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 32)                      2080        
## ________________________________________________________________________________
## dense (Dense)                       (None, 1)                       33          
## ================================================================================
## Total params: 7,169
## Trainable params: 7,169
## Non-trainable params: 0
## ________________________________________________________________________________

model %>% compile(loss = "mse",optimizer = "RMSprop")
history = model %>% fit(x = as.matrix(scale(training[,1:13])), y = training$medv,verbose = 2, epochs=20, validation_data = list(as.matrix(scale(testing[,1:13])),testing$medv))
yhat = model %>% predict( as.matrix(scale(testing[,1:13])))
plot(history)

plot(testing$medv,yhat, pch=21, bg='lightblue')
abline(0,1,col="red",lwd=2)

modelmetrics(testing$medv, yhat)

RMSE	MAPE	R2
3.89	0.18	0.86

Gaussian Process

library(laGP)
library(ggplot2)

x = scale(as.matrix(Boston[,13:13]))
y = scale(Boston$medv)
dest <- darg(NULL, x)
gest <- garg(list(mle = TRUE), y)
gpi <- newGPsep(x, y, d=dest$start, g=gest$start, dK=TRUE)
mle <- mleGPsep(gpi, param="both", tmin=c(dest$min, gest$min), tmax=c(dest$max, gest$max))
p <- predGPsep(gpi, x)
qplot(y, p$mean)

modelmetrics(y,p$mean)

RMSE	MAPE	R2
0.56	2.82	0.68

qplot(y, p$mean) + 
geom_point(shape=21, fill="lightblue", size=2) +
geom_ribbon(aes(ymin=p$mean+2*diag(p$Sigma),ymax=p$mean-2*diag(p$Sigma)),alpha=0.3) +
xlab("y") + ylab(expression(hat(y))) + theme_classic()

Dimensionality Reduction: Partial Least Squares (PLS)

library(pls)
library(randomForest)
N = nrow(Boston)
# train PLS
plsr_fit <-plsr(medv~., data=Boston, scale=T, validation="CV")
summary(plsr_fit)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.206    6.534    5.074    4.989    4.922    4.880    4.852
## adjCV        9.206    6.532    5.069    4.981    4.912    4.869    4.844
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       4.843    4.842    4.841     4.841     4.842     4.842     4.842
## adjCV    4.835    4.833    4.832     4.832     4.833     4.833     4.833
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       45.86    56.78    63.91    69.70    75.47    78.72    81.95    84.62
## medv    49.93    70.64    72.30    73.29    73.77    73.92    74.01    74.06
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       90.16     92.40     96.22     98.16    100.00
## medv    74.06     74.06     74.06     74.06     74.06

# use 7 components
plsr_7 <-plsr(formula = medv ~ ., ncomp = 7, data=Boston, scale = T)
PLS_Score <- scores(plsr_7)

# random forest
rf <- randomForest(x=PLS_Score, y=Boston$medv, importance=TRUE, proximity=TRUE)

# goodness of fit
pls_yhat = predict(plsr_7,Boston,ncomp=7)
rf_yhat = rf$predicted

v = mean((Boston$medv - mean(Boston$medv))^2)
r21 = 1 - mean((pls_yhat-Boston$medv)^2)/v
r22 = 1 - mean((rf_yhat-Boston$medv)^2)/v

par(mfrow=c(1,2))
plot(Boston$medv, pls_yhat, 
     xlab='y', ylab = 'yhat', 
     main=paste("PLS Fit, ", "R-squared =", round(r21,2)),
     pch=21,bg="lightblue",
     xlim=c(-5,50),ylim=c(-5,50))
abline(0,1,col="red",lwd=2)
modelmetrics(Boston$medv,pls_yhat)

RMSE	MAPE	R2
4.68	0.16	0.74

plot(Boston$medv, rf_yhat, 
     xlab='y', ylab = 'yhat',
     main=paste("PLS+RF Fit, ", "R-squared =", round(r22,2)),
     pch=21,bg="lightblue",
     xlim=c(-5,50),ylim=c(-5,50))
abline(0,1,col="red",lwd=2)

modelmetrics(Boston$medv,rf_yhat)

RMSE	MAPE	R2
4.05	0.14	0.81

What if we combine PLS and neural network?

model <- keras_model_sequential() %>%
  layer_dense(units = 64, activation = "relu", input_shape = 7) %>%
  layer_dense(units = 64, activation = "relu") %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1) 
summary(model)

## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_7 (Dense)                     (None, 64)                      512         
## ________________________________________________________________________________
## dense_6 (Dense)                     (None, 64)                      4160        
## ________________________________________________________________________________
## dense_5 (Dense)                     (None, 32)                      2080        
## ________________________________________________________________________________
## dense_4 (Dense)                     (None, 1)                       33          
## ================================================================================
## Total params: 6,785
## Trainable params: 6,785
## Non-trainable params: 0
## ________________________________________________________________________________

model %>% compile(loss = "mse",optimizer = "RMSprop")
history = model %>% fit(x = as.matrix(scale(PLS_Score)), y = Boston$medv,verbose = 2, epochs=100, validation_data = list(as.matrix(scale(PLS_Score)),Boston$medv))
yhat = model %>% predict( as.matrix(scale(PLS_Score)))
plot(history)

## `geom_smooth()` using formula 'y ~ x'

plot(Boston$medv, yhat, 
     xlab='y', ylab = 'yhat',
     main=paste("PLS+DL Fit"),
     pch=21,bg="lightblue",
     xlim=c(-5,50),ylim=c(-5,50))
abline(0,1,col="red",lwd=2)

modelmetrics(Boston$medv, yhat)

RMSE	MAPE	R2
2.27	0.09	0.94

Logistic regression

dat=Boston
dat$medv=1*(dat$medv>25)+0*(dat$medv<=25)
dat$medv=as.factor(dat$medv)
mod_logit=glm(medv~.,data=dat,family=binomial())
summary(mod_logit)

## 
## Call:
## glm(formula = medv ~ ., family = binomial(), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3498  -0.2806  -0.0932  -0.0006   3.3781  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.312511   4.876070   1.090 0.275930    
## crim        -0.011101   0.045322  -0.245 0.806503    
## zn           0.010917   0.010834   1.008 0.313626    
## indus       -0.110452   0.058740  -1.880 0.060060 .  
## chas         0.966337   0.808960   1.195 0.232266    
## nox         -6.844521   4.483514  -1.527 0.126861    
## rm           1.886872   0.452692   4.168 3.07e-05 ***
## age          0.003491   0.011133   0.314 0.753853    
## dis         -0.589016   0.164013  -3.591 0.000329 ***
## rad          0.318042   0.082623   3.849 0.000118 ***
## tax         -0.010826   0.004036  -2.682 0.007314 ** 
## ptratio     -0.353017   0.122259  -2.887 0.003884 ** 
## black       -0.002264   0.003826  -0.592 0.554105    
## lstat       -0.367355   0.073020  -5.031 4.88e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 563.52  on 505  degrees of freedom
## Residual deviance: 209.11  on 492  degrees of freedom
## AIC: 237.11
## 
## Number of Fisher Scoring iterations: 7

# diagnostic plot
par(mfrow=c(2,2), mar=c(3,3,0,0), bty='n')
plot(mod_logit,pch=21,bg="lightblue")

Multinomial

dat2=Boston
dat2$medv=0*(dat2$medv<=17)+1*(dat2$medv<=25 & dat2$medv>17)+2*(dat2$medv>25)
dat2$medv=as.factor(dat2$medv)
library(nnet)
mod_multinom=multinom(medv~.,data=dat2)

## # weights:  45 (28 variable)
## initial  value 555.897818 
## iter  10 value 338.289401
## iter  20 value 309.716007
## iter  30 value 211.736278
## iter  40 value 203.333103
## iter  50 value 202.890685
## iter  60 value 202.858211
## final  value 202.858175 
## converged

summary(mod_multinom)

## Call:
## multinom(formula = medv ~ ., data = dat2)
## 
## Coefficients:
##   (Intercept)       crim         zn       indus     chas       nox         rm
## 1    24.19217 -0.2231462 0.04670231  0.10160200 0.343424 -12.91234 -0.1602033
## 2    29.51905 -0.0643685 0.05704767 -0.01761023 1.311033 -18.77936  1.7631714
##           age        dis       rad          tax    ptratio       black
## 1 -0.03926022 -0.6648017 0.1826890 -0.006564069 -0.3770017 0.005904329
## 2 -0.03695548 -1.2169084 0.4612001 -0.017781470 -0.7175312 0.001405510
##        lstat
## 1 -0.2409420
## 2 -0.5899765
## 
## Std. Errors:
##   (Intercept)       crim         zn      indus      chas        nox        rm
## 1  0.09595305 0.07342447 0.04186604 0.05732693 0.7108412 0.17673798 0.3344770
## 2  0.02845394 0.06391464 0.04316645 0.08044716 1.0467007 0.01608296 0.4633493
##          age       dis        rad         tax    ptratio       black      lstat
## 1 0.01544144 0.2200741 0.05386458 0.003231373 0.09660977 0.002003785 0.04387560
## 2 0.01846981 0.2649990 0.08985403 0.005039059 0.13750043 0.004055685 0.08066911
## 
## Residual Deviance: 405.7164 
## AIC: 461.7164

par(mfrow=c(1,1))
pred=predict(mod_multinom,dat2)
plot(dat2$medv,pred,xlab="true class",ylab="predicted class")

SVM

library(e1071)
mod_svm=svm(medv~.,dat)
summary(mod_svm)

## 
## Call:
## svm(formula = medv ~ ., data = dat)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  159
## 
##  ( 86 73 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

par(mfrow=c(1,1))
plot(mod_svm,data=dat,rm~lstat,pch=21,bg="lightblue")

41000 Business Statistics: Examples

Pete Rose

Apple Watch

Abraham Wald and Airplane Safety

Google Stock Returns

Normal Distribution

Stock market crash 1987

EPL Model (Poisson Distribution)

Sampling Error

Google Seach Algorithm AB Testing

Pfizer’s Viagra

Lord Rayleigh

Linear Regression (Housing Data)

Log-tranformation and J&J data

Log-tranformation and the Mammals Data

Boston Housing

Linear Regression

Multiple Linear Regression

Lasso and Elastic-Net

KNN Model

Tree-Based Models

Boosting

Neural Network with Keras library

Gaussian Process

Dimensionality Reduction: Partial Least Squares (PLS)

Logistic regression

Multinomial

SVM