Pete Rose of the Cincinnati Reds set a National League record of hitting safely in \(44\) consecutive games. How likely a such a long sequence of safe hits is to be observed? If you were a bookmaker, what odds would you offer on such an event? This means that he safely reached first base after hitting the ball into fair territory, without the benefit of an error or a fielder’s choice at least once at every of those 44 games. There are a couple of facts we know about him: - Rose was a \(300\) hitter, he hits safely 3 times out of 10 attempts - Each at bat is assumed to be independent, i.e., the current at bat doesn’t affect the outcome of the next. So \(P(\text{leit}) = 0.3\)
Assuming he comes to bat \(4\) times each game, what probability might reasonably be associated with that hitting streak? First we define notation. We use \(A_i\) to denote an event of hitting safely at game \(i\), then \[ \begin{aligned} & P( {\rm Rose \; Hits \; Safely \; in \;44 \; consecutive \; games} ) = \\ & P ( A_1 \; \text{and} \; A_2 \ldots \text{and} \; A_{44} ) = P ( A_1 ) P ( A_2 ) \ldots P ( A_{44} ) \end{aligned} \] We now need to find \(P(A_i) \ldots\) where \(P (A_i ) = 1 - P ( \text{not} \; A_i )\)
\[\begin{aligned} P ( A_1 ) & = 1 - P ( {\rm not} \; A_1 ) \\ & = 1 - P ( {\rm Rose \; makes \; 4 \; outs } ) \\ & = 1 - ( 0.7)^4 = 0.76 \end{aligned}\]
pa = 1 - ( 0.7)^4 %>% print()
## [1] 0.2401
phit = pa^44 %>% print()
## [1] 5.666091e-06
(1-phit)/phit
## [1] 176487.5
For the winning streak, then we have \((0.76)^{44} = 0.0000057\), a very low probability. In terms of odds, there are three basic inferences
The Apple Watch Series 4 can perform a single-lead ECG and detect atrial fibrillation. The software can correctly identify 98% of cases of atrial fibrillation (true positives) and 99% of cases of non-atrial fibrillation (true negatives).
Predicted | atrial fibrillation | no atrial fibrillation | Total |
---|---|---|---|
atrial fibrillation | 1960 | 980 | 2940 |
no atrial fibrillation | 40 | 97020 | 97060 |
Total | 2000 | 98000 | 100000 |
However, what is the probability of a person having atrial fibrillation when atrial fibrillation is identified by the Apple Watch Series 4? We use Bayes theorem to answer this question.
\[ p(\text{atrial fibrillation}\mid \text{atrial fibrillation is identified }) = \frac{0.01960}{ 0.02940} = 0.6667 \] The conditional probability of having atrial fibrillation when the Apple Watch Series 4 detects atrial fibrillation is about 67%.
In people younger than 55, Apple Watch’s positive predictive value is just 19.6 percent. That means in this group – which constitutes more than 90 percent of users of wearable devices like the Apple Watch – the app incorrectly diagnoses atrial fibrillation 79.4 percent of the time. (You can try the calculation yourself using this Bayesian calculator: enter 0.001 for prevalence, 0.98 for sensitivity, and 0.996 for specificity).
The electrocardiogram app becomes more reliable in older individuals: The positive predictive value is 76 percent among users between the ages of 60 and 64, 91 percent among those aged 70 to 74, and 96 percent for those older than 85.
We can see how the posterior \(p(\text{atrial fibrillation}\mid \text{atrial fibrillation is identified })\) depends on the prior \(p(\text{atrial fibrillation})\)
prior = seq(0.001,0.1,by=0.0001)
posterior = 0.98*prior/(0.98*prior + 0.01*(1-prior))
plot(prior,posterior, type='l',lwd=2)
Many lives were saved by analysis of conditional probabilities performed by Abraham Wald during the Second World War. He was analyzing damages on the US planes that came back from bombarding missions in Germany. Somebody suggested to analyze distribution of the hits over different parts of the plane. The idea was to find a pattern in the damages and design a reinforcement strategy.
After examining hundreds of damaged airplanes, researchers came up with the following table
Location | Number of Planes |
---|---|
Engine | 53 |
Cockpit | 65 |
Fuel system | 96 |
Wings, fuselage, etc. | 434 |
We can convert those counts to probabilities
Location | Number of Planes |
---|---|
Engine | 0.08 |
Cockpit | 0.1 |
Fuel system | 0.15 |
Wings, fuselage, etc. | 0.67 |
We can conclude the the most likely area to be damaged on the returned planes was the wings and fuselage. \[ p(\mbox{hit on wings or fuselage } \mid \mbox{returns safely}) = 0.67 \] Wald realized that analyzing damages only on survived planes is not the right approach. Instead, he suggested that it is essential to calculate the inverse probability \[ p(\mbox{returns safely} \mid \mbox{hit on wings or fuselage }) = ? \] To calculate that, he interviewed many engineers and pilots, he performed a lot field experiments. He analyzed likely attack angles. He studied the properties of a shrapnel cloud from a flak gun. He suggested to the army that they fire thousands of dummy bullets at a plane sitting on the tarmac. Wald constructed a ‘probability model’ carefull to reconstruct an estimate for the joint probabilities. Table below shows the results.
Hit | Returned | Shut Down |
---|---|---|
Engine | 53 | 57 |
Cockpit | 65 | 46 |
Fuel system | 96 | 16 |
Wings, fuselage, etc. | 434 | 33 |
Which allows us to estimate joint probabilities, for example \[ p(\mbox{outcome = returns safely} , \mbox{hit = engine }) = 53/800 = 0.066 \] We also can calculate the conditional probabilities now \[ p(\mbox{outcome = returns safely} \mid \mbox{hit = wings or fuselage }) = \dfrac{434}{434+33} = 0.9293362 \]
Should we reinforce wings or fuselage? Which part of the airplane does need ot be reinforced?
\[ p(\mbox{outcome = returns safely} \mid \mbox{hit = engine }) = \dfrac{53}{53+57} = 0.48 \]
goog = read.csv("data/GOOG2019.csv")
rgoog = goog$Adj.Close[2:251]/goog$Adj.Close[1:250] - 1
sp = read.csv("data/SP2019.csv"); rsp = sp$Adj.Close[2:251]/sp$Adj.Close[1:250] - 1
plot(rsp, rgoog, pch=21, bg="lightblue", ylab="GOOG return", xlab="SP500 return")
abline(lm(rgoog~rsp))
var_goog = mean((rgoog - mean(rgoog))^2)
var_sp = mean((rsp - mean(rsp))^2)
cov = mean((rgoog - mean(rgoog))*(rsp - mean(rsp))); print(cov)
## [1] 7.998625e-05
cor = cov/(sqrt(var_goog)*sqrt(var_sp)); print(cor)
## [1] 0.6673029
Consider observations of daily log-returns of a Google stock for 2019 Daily log-return on day \(t\) is calculated by taking a logarithm of the ratio of price at close of day \(t\) and at close of day \(t-1\) \[ y_t = \log\left(\dfrac{P_t}{P_{t-1}}\right) \] For example on January 3 of 2017, the open price is 778.81 and close price was 786.140, then the log-return is \(\log(786.140/778.81) = -0.0094\). It was empirically observed that log-returns follow a Normal distribution. This observation is a basis for Black-Scholes model with is used to evaluate future returns of a stock.
p = read.csv("data/GOOG2019.csv")$Adj.Close; n = length(p)
r = log(p[2:n]/p[1:(n-1)])
hist(r, breaks=30, col="lightblue")
Observations on the far right correspond to the days when positive news was released and on the far left correspond to bad news. Typically, those are days when the quarterly earnings reports are released.
To estimate the expected value \(\mu\) (return) and standard deviation \(\sigma\) (a measure of risk), we simply calculate their sample counterparts \[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i, ~\mathrm{ and }~ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x} )^2 \] The empirical (or sample) values \(\bar x\) and \(s^2\) are called sample mean and sample variance. Here simply vie them as our best guess about the mean and variance of the normal distribution model then our probabilistic model for next day’s return is then given by \[ R \sim N(\bar x, s^2). \]
This method of estimating the parameters of the distribution is called the method of moments, meaning that we estimate the variance and the mean (first two moments of a distribution) by simply calculating those from the observed sample.
Say we are interested in investing into Google and would like to calculated the expected return of our investment as well as risk associated with this investment We assume that behavior of the returns in the future will be the same as in 2019.
n = length(r)
rbar = sum(r)/n; print(rbar)
## [1] 0.0009798198
s2 = sum((r-rbar)^2)/(n-1); print(s2)
## [1] 0.0002309921
x = seq(-0.08,0.08, length.out = 200)
hist(r, breaks=30, col="lightblue", freq = F)
lines(x,dnorm(x,rbar,sqrt(s2)), col="red", lwd=2)
Now, assume, I invest all my portfolio into Google. I can predict my annual return to be \(251 \times 9.7981977\times 10^{-4}\) = 0.2459348 and risk (volatility) of my investment is \(\sqrt{s^2}\) = 0.0151984% a year.
I can predict the risk of loosing 3% or more in one day using my model is 1.93%.
pnorm(log(1-0.03), rbar, sqrt(s2))*100
## [1] 1.929316
sp = read.csv("data/SPMonthly.csv")$Adj.Close; n = length(sp)
spret = sp[602:n]/sp[601:(n-1)]-1 # Calculate 1977-1987 returns
mean(spret)
## [1] 0.0120242
sd(spret)
## [1] 0.04300243
Prior to the October, 1987 crash SP500 monthly returns were 1.2% with a risk/volatility of 4.3%. The question is how extreme was the 1987 crash of \(-21.76\)%? \[ X \sim N \left(1.2, 4.3^2 \right ) \] This probability distribution can be standardized to yield \[ Z =\frac{X-\mu}{\sigma} = \frac{X - 1.2}{4.3} \sim N(0,1) . \] Now,, we calculate the observed \(Z\), given the outcome of the crash event \[ Z = \frac{-0.2176 - 0.012}{0.043} = -5.27 \] That’s is \(5\)-sigma event in terms of the distribution of \(X\). Meaning that -0.2176 is 5 standard diviations away from the mean. Under a normal model that is equivalent to \(P(X < -0.2176)\)
pnorm(-21.76, 1.2,4.3)
## [1] 4.659267e-08
Consider the problem of modeling soccer scores in the English Premier League (EPL) games. We use data from Betfair, a website, which posts odds on many football games. The goal is to calculate odds for the possible scores in a match. \[ 0-0, \; 1-0, \; 0-1, \; 1-1, \; 2-0, \ldots \] Another question we might ask, is what’s the odds of a team winning?
This is given by \(P\left ( X> Y \right )\). The odds’s of a draw are given by \(P \left ( X = Y \right )\)?
Professional sports betters rely on sophisticated statistical models to predict the outcomes. Instead, we present a simple, but useful model for predicting outcomes of EPL games. We follow the methodology given in @spiegelhalter2009one.
First, load the data and then model the number of goals scored using Poisson distribution.
df = read.csv("data/epl.csv") %>% select(home_team_name,away_team_name,home_score,guest_score)
head(df)
## home_team_name away_team_name home_score guest_score
## 1 Arsenal Liverpool 3 4
## 2 Bournemouth Manchester United 1 3
## 3 Burnley Swansea 0 1
## 4 Chelsea West Ham 2 1
## 5 Crystal Palace West Bromwich Albion 0 1
## 6 Everton Tottenham 1 1
Let’s look at the empirical distribution across the number of goals scored by Manchester United
team_name="Manchester United"
team_for = c(df[df$home_team_name==team_name,"home_score"],df[df$away_team_name==team_name,"guest_score"])
n = length(team_for)
for_byscore = table(team_for)/n
barplot(for_byscore, col="coral", main="Histogram of Goals Scored by MU")
Hence the historical data fits closely to a Poisson distribution, the parameter \(\lambda\) describes the average number of goals scored and we calculate it by calculating the sample mean, the maximum likelihood estimate. A Bayesian method where we assume that \(\lambda\) has a Gamma prior is also available. This lets you incorporate outside information into the predictive model.
lambda_for = mean(team_for)
barplot(rbind(dpois(0:4, lambda = lambda_for),for_byscore),beside = T, col=c("aquamarine3","coral"), xlab="Goals", ylab="probability", main="Histogram vs Poisson Model Prediction of Goals Scored by MU")
legend("topright", c("Poisson","MU"), pch=15, col=c("aquamarine3", "coral"), bty="n")
Now we will use Poisson model and Monter Carlo simulations to predict possible outcomes of the MU vs Hull games. First we estimate the rate parameter for goals by MU
lmb_mu
and goals by Hull lmb_h
. Each team played a home and away game with every other team, thus 38 total games was played by all teams. We calculate the average by dividing total number of goals scored by the number of games
sumdf = df %>%
group_by(home_team_name) %>%
summarise(Goals_For_Home = sum(home_score)) %>%
full_join(df %>%
group_by(away_team_name) %>%
summarise(Goals_For_Away = sum(guest_score)), by = c("home_team_name" = "away_team_name")
) %>%
full_join(df %>%
group_by(home_team_name) %>%
summarise(Goals_Against_Home = sum(guest_score))
) %>%
full_join(df %>%
group_by(away_team_name) %>%
summarise(Goals_Against_Away = sum(home_score)), by = c("home_team_name" = "away_team_name")
) %>%
rename(Team=home_team_name)
## Joining, by = "home_team_name"
knitr::kable(sumdf[sumdf$Team %in% c("Manchester United", "Hull"),])
Team | Goals_For_Home | Goals_For_Away | Goals_Against_Home | Goals_Against_Away |
---|---|---|---|---|
Hull | 28 | 9 | 35 | 45 |
Manchester United | 26 | 28 | 12 | 17 |
lmb_mu = (26+28)/38; print(lmb_mu)
## [1] 1.421053
lmb_h = (28+9)/38; print(lmb_h)
## [1] 0.9736842
Now we simulate 100 games between the teams
mu = rpois(100,lmb_mu)
hull = rpois(100,lmb_h)
sum(mu>hull)
## [1] 47
sum(mu==hull)
## [1] 33
table(mu,hull)
## hull
## mu 0 1 2 3 4
## 0 9 9 2 1 0
## 1 7 15 5 0 0
## 2 9 7 8 1 1
## 3 7 3 6 1 1
## 4 1 2 1 1 0
## 5 2 1 0 0 0
From our simulaiton that 47 number of times MU wins and 33 there is a draw. The actual outcome was 0-0 (Hull at MU) and 0-1 (Mu at Hull). Thus our model fives a reasonable prediction.
The model can be improved by calculating different averages for home and away games. For example, Hull does much better at home games compared to away games. Further, we can include the characteristics of the opponent team to account for interactions between attack strengh (number of scored) and defence weakness of the opponent. Now we modify our value of expected goals for each of the teams by calculating \[ \hat \lambda = \lambda \times \text{Defense weakness} \]
Let’s model the MU at Hull game. The average away goals for MU \(28/19 = 1.4736842\) and the defence weakness of Hull is \(36/19 = 1.8421053\), thus the adjusted expected number of goals to be scored by MU is 2.7922438. Similarly, the adjusted number of the goals Hull is expected to score is \(28/19 \times 17/19 = 1.3185596\)
As a result of the simulation, we obtain
set.seed(1)
x = rpois(100, 28/19*35/19)
y = rpois(100,28/19*17/19)
table(x,y)
## y
## x 0 1 2 3 4 5
## 0 1 3 0 0 0 0
## 1 3 5 6 1 1 0
## 2 4 16 7 3 0 0
## 3 6 7 2 3 0 0
## 4 4 7 5 2 1 0
## 5 2 3 1 2 0 2
## 6 1 0 0 1 0 0
## 7 1 0 0 0 0 0
image(z=table(x,y),x = 0:7,y = 0:5, xlab="MU Score", ylab="Hull Score")
sum(x>y)
## [1] 67
A model is only as good as its predictions. Let’s see how our model did in out-of-sample prediction,
Say we are interested in predicting if voters in the state of Michigan is to support the democratic or republican candidate at the next presidential elections. We will use the simulated data for our analysis. Let’s simulate voting preferences of the 8 million Michigan voters assuming that 52% of voters are to support the republican candidate
mi_voters = rbinom(8000000, 1, 0.52)
Now, we survey 100 randomly selected voters in Michigan
survey_sample = sample(mi_voters,size = 100, replace = F)
p = mean(survey_sample)
## [1] 0.45
The sample mean of 0.45 allows us to say something about the population mean, the proportion of republican voters we would get if we sampled enough in Michigan. Note, the central limit theorem provides us details of how quickly the sample mean tends to the popular mean. It only applies when observations are independent and identically distributed samples. We will illustrate this theorem empirically. We simulate 1000 surveys and calculate the sample mean for each survey. We will see that a Normal distribution fits the proportion distribution from the simulated surveys. Theoretically, we can find the means and standard deviation from the central limit theorem.
prep = replicate(2000, mean(sample(mi_voters,size = 100,replace = F)))
hist(prep, breaks = 30, freq = F, main="Histogram of voting proportions")
p = seq(0.3,0.7,length.out = 500)
lines(p, dnorm(p,mean(prep),sd(prep)), col="red",lwd=3)
We see that the red bell-curve is a good model for the distribution over the means calculated from samples. In fact, the central limit theorem says that sample proportions do follow normal distribution. More formally, if \(x_1,x_2,\ldots,x_n\) is a sequence of independent random variables having mean \(\mu\) and variance \(\sigma^2\) coming from the same distribution then the sample mean follows a normal distribution \[ \bar{X} := \frac{1}{n}\sum_{i=1}^n X_i \sim N(\mu, {\sigma^2 \over n}). \] We sometimes use the notation \(\hat \mu\) to denote an estimate. So we have \(\hat \mu = \bar x\). This statistical property allows us to quantify uncertainty about the sample mean and say something about the true value of the mean \(\mu\), in terms of a probabilistic interval statement.
In practice, we do not know mean \(\mu\) and variance \(\sigma^2\). Put simply, we wish to estimate \(\bar x\) and \(s^2\) calculated from the sample \[ s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar x)^2. \] We can think of \(s^2\) as an estimate of \(\hat\sigma^2\). The estimated standard deviation of \(\bar x\) is then \[ s_{\bar{x} } = \frac{s}{\sqrt{n}}, \] and we call it the standard error, standard deviation of the error.
Let’s look at another example and test effectiveness of Google’s new search algorithm. We measure effectiveness by the number of users who clicked on one of the search results. As users send the search requests, they will be randomly processed with Algo 1 or Algo 2. We wait until 2500 search requests were processed by each of the algorithms and calculate the following table based on how often people clicked through
Algo1 | Algo2 | |
---|---|---|
success | 1755 | 1818 |
failure | 745 | 682 |
total | 2500 | 2500 |
The probability of success is estimated to be \(\hat{p}_1 = 0.702\) for the current algorithm and \(\hat{p}_2 = 0.727\) for the new algorithm. We can calculate the 95% confidence interval or 95% Bayesian credible region for both estimated proportions Is the new algorithm
For Algo 1:
p1 = 1755/2500; p2 = 1818/2500
p1 + c(-1.96,1.96)*sqrt(p1*(1-p1)/2500)
## [1] 0.6840707 0.7199293
p2 + c(-1.96,1.96)*sqrt(p2*(1-p2)/2500)
## [1] 0.7097404 0.7446596
Given that the intervals do not overlap, there is enough evidence that algorithms are different, and the new Algo 1 is indeed more efficient.
We will get a slightly more precise estimation of uncertainty if we calculate confidence interval for the difference of the proportions. Since \(p_1\) and \(p_2\) both follow Normal distribution, thus their difference is also normally distributed \[ p_1 - p_2 \sim N(\hat p_1 - \hat p_2, s_1^2/n + s_2^2/n). \] Applying this formula for the Google search algorithm experiment, we calculate the 95% confidence interval for the difference
p1 - p2 + c(-1.96,1.96)*sqrt(p1*(1-p1)/2500 + p2*(1-p2)/2500)
## [1] -0.0502259431 -0.0001740569
The confidence interval for the difference does contain 0, and thus we cannot say that we are confident that algorithms are different!
More generally, if the number of observations in two groups are different, say \(n_1\) and \(n_2\) then the \[ s_{ \bar{X}_1 - \bar{X}_2 } = \sqrt{ \frac{ s^2_{ \bar{X}_1 }}{n_1} + \frac{ s^2_{ \bar{X}_2 }}{n_2} } \] or for proportions, we compute \[ s_{ \hat{p}_1 - \hat{p}_2 } = \sqrt{ \frac{ \hat{p}_1 (1- \hat{p}_1)}{n_1} + \frac{ \hat{p}_2 (1- \hat{p}_2)}{n_2} }. \]
We consider another example that involves pharmaceutical company Pfizer. Pfizer introduced Viagra in early 1998. During \(1998\) of the \(6\) millionViagra users \(77\) died from coronary problems such as heart attacks.Pfizer claimed that this rate is no more than the general population.A clinical study found \(11\) out of \(1,500,000\) men who were not onViagra died of coronary problems during the same length of time as the\(77\) Viagra users who died in \(1998\).The question is,Let’s calculate the significance Interval. A 95% confidence interval fora difference in proportions \(p_1 - p_2\) is \[ ( \hat{p}_1 - \hat{p}_2 ) \pm 1.96 \sqrt{ \frac{ \hat{p}_1 ( 1 - \hat{p}_1 ) }{ n_1 } + \frac{ \hat{p}_2 ( 1 - \hat{p}_2 ) }{ n_2 } } \]
Can do a confidence interval or a \(Z\)-score test.
With Viagra, \(\hat{p}_1 = 77/6000000 = 0.00001283\) and without Viagra \(\hat{p}_2 = 11/1500000 = 0.00000733\).
Need to test whether these are equal.
With a \(95\)% confidence interval for \(( p_1 - p_2 )\) you get an interval \[{( 0.00000549 , 0.0000055 )}\] This doesn’t contain zero.
The evidence is that the proportion is higher.
Measured very accurately as \(n\) is large even though \(p\) is small.
With testing might use a one-sided test and an \(\alpha\) of \(0.01\).
Difference of proportions:
prop.test(x=c(11,77), n=c(1500000,6000000), alternative='greater',conf.level=.95)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(11, 77) out of c(1500000, 6e+06)
## X-squared = 2.6428, df = 1, p-value = 0.948
## alternative hypothesis: greater
## 95 percent confidence interval:
## -1.027715e-05 1.000000e+00
## sample estimates:
## prop 1 prop 2
## 7.333333e-06 1.283333e-05
The p-value for the Null is \(1-0.948 =0.052\).
Lord Rayleigh won the Nobel Prize for . This discovery occurred when he noticed a small discrepancy between two sets of measurements on nitrogen gas that he had extracted from the air and one he had made in the lab.
First, he removed all oxygen from a sample of air. He measured the density of the remaining gas in a fixed volume at constant temperature and pressure.
Second, he prepared the same volume of pure nitrogen by the chemical decomposition of nitrous oxide (\(N_2 O\)) and nitric oxide \(NO\).
Let’s look at the Lord Rayleigh’s data
d = data.frame(air = c(2.31017, 2.30986, 2.31010, 2.31001, 2.31024, 2.31010, 2.31028,2.31028),
chemical = c(2.30143, 2.29890, 2.29816, 2.30182, 2.29869, 2.29940, 2.29849, 2.29889))
knitr::kable(d)
air | chemical |
---|---|
2.31017 | 2.30143 |
2.30986 | 2.29890 |
2.31010 | 2.29816 |
2.31001 | 2.30182 |
2.31024 | 2.29869 |
2.31010 | 2.29940 |
2.31028 | 2.29849 |
2.31028 | 2.29889 |
c(mean(d$air), mean(d$chemical))
## [1] 2.310130 2.299473
c(sd(d$air), sd(d$chemical))
## [1] 0.0001453076 0.0013791897
Let’s calculate the CIs for both
mean(d$air) + 1.96*sd(d$air)*c(-1,1)
## [1] 2.309845 2.310415
mean(d$chemical) + 1.96*sd(d$chemical)*c(-1,1)
## [1] 2.296769 2.302176
We also can use built-in function
t.test(d$air,d$chemical,var.equal=T)
##
## Two Sample t-test
##
## data: d$air and d$chemical
## t = 21.736, df = 14, p-value = 3.47e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.009605877 0.011709123
## sample estimates:
## mean of x mean of y
## 2.310130 2.299473
To demonstrate linear regression we develop a model that predicts the price of a house, given its square footage. This is a simplified version of a regression model used by Zillow, a house pricing site. First, we look at property sales data where we know the price and some observed characteristics. Let’s look at the scatter plot of living areas measure in squared feet and the price measured in thousands of dollars.
Second, we build a decision model that predicts price as a function of the observed characteristics. Now, we need to decide on what characteristics do we use. There are many factors or variables that affect the price of a house, for example location. Some other basics ones include size, number of rooms, and parking.
Let’s run a simple linear regression on size, measured in square foots. The value that we seek to predict is called the dependent variable, and we denote this by \(y\) which is the price of house measured in thousand of dollars. The variable that we use to construct our prediction is the predictor variable, and we denote it with \(x\). What’s does the data look like?
d = read.csv("data/SaratogaHouses.csv")
d$price = d$price/1000; d$livingArea = d$livingArea/1000
plot(d$livingArea,d$price, 'p', pch=21, bg="lightblue")
We use lm
function to draw a line
l = lm(price ~ livingArea, data=d)
plot(d$livingArea, d$price, 'p', pch=21, bg="lightblue", xlab="Living Area", ylab="Price")
abline(l, col="red", lwd=3)
We can use coef
funciton to find out the slope and the intercept of this line.
coef(l)
## (Intercept) livingArea
## 13.43939 113.12254
The intercept value is in units of \(y\) ($1000). The slope is in units of \(y\) per units of \(x\) ($1000/1 sq. feet).
In general, adding statistical noise, the simple linear regression model is then given by the follwing equation \[ y = \beta_0 + \beta_1 x + \epsilon \; \; {\rm where} \; \epsilon \sim N(0, \sigma^2) \] Here error \(\epsilon\) term indicates that the relation between \(x\) and \(y\) is not excatly a line and for a given \(x\), the value of \(y\) can be deviating from the line. We will write this as \(y \mid x \sim N\left(\beta_0 + \beta_1 x, \sigma^2 \right)\)
In our housing example, we find that \(\beta_0 = 13.439394\) and \(\beta_1 =113.1225418\). Thus, every \(1000\) sqft increase the price of the hous by $1.3439394^{4} and the price of a lot of land without a house is 113.1225418. The magnitude of \(\beta_1\) shows the economic significance of house’s square footage.
We can now predict the price of a house when we only know that size: take the value off the regression line. For example, given a house size of \(x = 2.2\). Predicted Price: \[ \hat{y} = 13.439394 + 113.1225418 \times 2.2 = 262.3089861 \]
Now we can use our training data set to compare prices predicted by the model and actual prices. We will do it by clculating an average squared difference, also called a mean squared error \[ MSE = \dfrac{1}{n}\sum_{i=1}^{n} (\hat y_i - y_i)^2 \]
Or in code
yhat = coef(l)[1] + coef(l)[2]*d$livingArea
y = d$price
mse = sum((yhat-y)^2)/length(y)
mse
## [1] 4769.913
Sometimes it is more convenient to use root mean squared error (RMSE) = \(\sqrt{4769.912967} = 69.0645565\) which is measured in the same units as variable \(y\), price in our case.
Consider an example, when the relations are multiplicative.
Let’s load and plot Johnson and Johnson’s quarterly earnings
library(lubridate)
jj = read.csv("data/JNJ.csv")
jj$Date = lubridate::ymd(jj$Date) # convert from string to date format
plot(jj$Date,jj$Adj.Close, type='l', lwd = 3, xlab="Date", ylab="Earnings")
jj = jj %>%arrange(Date) # sort by date
jj$Date0 = 12*(lubridate::year(jj$Date)-lubridate::year(jj$Date[1])) +
(lubridate::month(jj$Date)-lubridate::month(jj$Date[1])) # months since the start of the data
model = lm(jj$Open ~ jj$Date0)
modellog = lm(log(jj$Open) ~ jj$Date0)
plot(jj$Date0,jj$Open, type='l', col="red", ylab="JNJ Open", xlab="Month")
abline(model, col="red")
par(new=T)
plot(log(jj$Open), type='l', xlab="", ylab="", xaxt='n', yaxt='n')
abline(modellog)
axis(side = 4)
legend("topleft", legend=c("Original", "log-Scale"), lty=1, col=c("red", "black"), lwd=3, bty="n", y.intersp = 1,cex=1)
summary(model)
##
## Call:
## lm(formula = jj$Open ~ jj$Date0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.061 -11.115 -2.037 8.347 51.989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.013493 1.203666 -19.12 <2e-16 ***
## jj$Date0 0.193127 0.003612 53.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.49 on 576 degrees of freedom
## Multiple R-squared: 0.8323, Adjusted R-squared: 0.832
## F-statistic: 2859 on 1 and 576 DF, p-value: < 2.2e-16
summary(modellog)
##
## Call:
## lm(formula = log(jj$Open) ~ jj$Date0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90371 -0.31006 -0.07729 0.33070 0.78297
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.307e-02 3.238e-02 -2.874 0.0042 **
## jj$Date0 9.143e-03 9.716e-05 94.103 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3898 on 576 degrees of freedom
## Multiple R-squared: 0.9389, Adjusted R-squared: 0.9388
## F-statistic: 8855 on 1 and 576 DF, p-value: < 2.2e-16
Just by visual inspection we can see that a straight line will not be a good fit for this data. The relation between the time and earnings is multiplicative. There are many business and natural science examples where multiplicative relation holds. To model multiplicative or power relations we can still use the apparatus of linear models. However, we need to apply log-transformation to both input and output variables. \(\log\) is the natural \(\log_e\) with base \(e=2.718\ldots\) and that \(\log ( a b) = \log a + \log b\) and \(log ( a^b ) = b \log a\).
Suppose that we have an equation: \[y = A e^{\beta_1 x},\] where \(A = e^{\beta_0}\). This is equivalent to \(\log(y) = a + \beta_1 x\).
Taking logs of the original equation gives \[\begin{aligned} \log y &= \log A + \beta_1 x\\ \log y & = \beta_0 + \beta_1 x\end{aligned}\]
This model is called the exponential model. Any process that is exponentially growing will lead to such a model. For example growth of population with time or compounding interest in savings account.
mammals = read.csv("data/mammals.csv")
attach(mammals)
head(mammals)
## Mammal Brain Body
## 1 African_elephant 5712.0 6654.000
## 2 African_giant_pouched_rat 6.6 1.000
## 3 Arctic_Fox 44.5 3.385
## 4 Arctic_ground_squirrel 5.7 0.920
## 5 Asian_elephant 4603.0 2547.000
## 6 Baboon 179.5 10.550
tail(mammals)
## Mammal Brain Body
## 57 Tenrec 2.6 0.900
## 58 Tree_hyrax 12.3 2.000
## 59 Tree_shrew 2.5 0.104
## 60 Vervet 58.0 4.190
## 61 Water_opossum 3.9 3.500
## 62 Yellow-bellied_marmot 17.0 4.050
### Linear model
model = lm(Brain~Body)
Residual Diagnostics
The residuals show that you need a transformation .…
layout(matrix(1:2, ncol = 2))
plot(Brain~Body, data=mammals, pch=21, bg="lightblue")
abline(model, col="red", lwd=3)
plot(residuals(model)~mammals$Brain, bg="lightblue")
Residual Plots
par(mar=c(4,4.5,3,3), bty='n')
plot(model, bg="lightblue", pch=21)
log-log model
### Log-Log linear model
model = lm(log(Brain)~log(Body))
summary(model)
##
## Call:
## lm(formula = log(Brain) ~ log(Body))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75335 -0.53911 -0.09239 0.42068 2.61153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.18328 0.10682 20.44 <2e-16 ***
## log(Body) 0.74320 0.03166 23.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7722 on 60 degrees of freedom
## Multiple R-squared: 0.9018, Adjusted R-squared: 0.9002
## F-statistic: 551.2 on 1 and 60 DF, p-value: < 2.2e-16
par(mar=c(4.5,4.5,3,3))
layout(matrix(1:2, ncol = 2))
plot(log(Brain)~log(Body), pch=21, bg="lightblue")
abline(model, col="red", lwd=3)
plot(log(Brain),residuals(model), pch=21, bg="lightblue")
That’s better!
\(4\) in \(1\) Residuals: log-log model
par(mar=c(4,4.5,3,3))
plot(model, bg="lightblue", pch=21, bty='n')
log-log Model
\[{ {\rm log(Body)} = 2.18 + 0.74 \; {\rm log(Brain)} \; . }\] The coefficients are highly significant \(R^2 = 90\)%
Outliers
res = rstudent(model)
outliers = order(res,decreasing = T)[1:10]
cbind(mammals[outliers,],
residual = res[outliers],
Fit = exp(model$fitted.values[outliers]))
## Mammal Brain Body residual Fit
## 11 Chinchilla 64.0 0.425 3.7848652 4.699002
## 34 Man 1320.0 62.000 2.6697886 190.672827
## 50 Rhesus_monkey 179.0 6.800 2.1221002 36.889735
## 6 Baboon 179.5 10.550 1.6651361 51.128826
## 42 Owl_monkey 15.5 0.480 1.4589815 5.143815
## 10 Chimpanzee 440.0 52.160 1.2734358 167.690600
## 27 Ground_squirrel 4.0 0.101 1.2020391 1.615101
## 43 Patas_monkey 115.0 10.000 1.1133552 49.134284
## 60 Vervet 58.0 4.190 1.0617853 25.740367
## 3 Arctic_Fox 44.5 3.385 0.9205806 21.966124
There is a residual value of 3.78 extreme outlier.
It corresponds to the Chinchilla.
This suggests that the Chinchilla is a master race of supreme intelligence!
Inference
NO!!! I checked and there was a data entry error.
The brain weight is given as 64 grams and should only be 6.4 grams.
The next largest residual corresponds to mankind
In this example the log-log transformation used seems to achieve two important goals, namely linearity and constant variance.
We will demonstrate several more advanced predictive models using Boston housing dataset. This price dataset contains the factors that contribute to the median value of owner-occupied homes in $1000’s.
By constructing a regression model we can quantify the effects of the following measure factors (excluding MEDV as the response):
variable | description |
---|---|
CRIM | per capita crime rate by town |
ZN | proportion of residential land zoned for lots over 25,000 sq.ft |
INDUS | proportion of non-retail business acres per town |
CHAS | Charles River dummy variable (1 if tract bounds river; else 0) |
NOX | nitric oxides concentration (parts per 10 million) |
RM | average number of rooms per dwelling |
AGE | proportion of owner-occupied units built prior to 1940 |
DIS | weighted distances to five Boston employment centres |
RAD | index of accessibility to radial highways |
TAX | full-value property-tax rate per $10,000 |
PTRATIO | pupil-teacher ratio by town |
B | \(1000(\mbox{Bk} - 0.63)^2\) where Bk is the proportion of blacks by town |
LSTAT | % lower status of the population |
MEDV | Median value of owner-occupied homes in $1000’s |
library(tree)
library(MASS)
data(Boston)
attach(Boston)
knitr::kable(Boston[1:5,1:7])
crim | zn | indus | chas | nox | rm | age |
---|---|---|---|---|---|---|
0.00632 | 18 | 2.31 | 0 | 0.538 | 6.575 | 65.2 |
0.02731 | 0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 |
0.02729 | 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 |
0.03237 | 0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 |
0.06905 | 0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 |
dim(Boston)
## [1] 506 14
First, we split data into training and test subsets
test_index = sample(1:nrow(Boston), size = as.integer(0.2*nrow(Boston)))
testing = Boston[test_index,]
training = Boston[-test_index,]
nrow(training)
## [1] 405
Let’s look at the correlation of each independent variable with the dependent variable and near zero variance
knitr::kable(cor(Boston,Boston$medv))
crim | -0.3883046 |
zn | 0.3604453 |
indus | -0.4837252 |
chas | 0.1752602 |
nox | -0.4273208 |
rm | 0.6953599 |
age | -0.3769546 |
dis | 0.2499287 |
rad | -0.3816262 |
tax | -0.4685359 |
ptratio | -0.5077867 |
black | 0.3334608 |
lstat | -0.7376627 |
medv | 1.0000000 |
knitr::kable(t(summarise_if(Boston,is.numeric, sd)/summarise_if(Boston,is.numeric, mean)))
crim | 2.3803761 |
zn | 2.0523759 |
indus | 0.6160087 |
chas | 3.6720281 |
nox | 0.2089034 |
rm | 0.1117992 |
age | 0.4104834 |
dis | 0.5548581 |
rad | 0.9118115 |
tax | 0.4128412 |
ptratio | 0.1173060 |
black | 0.2559616 |
lstat | 0.5643741 |
medv | 0.4081651 |
plot(Boston[1:200,c("medv","lstat","rm","age", "crim")], pch=21, bg="lightblue", cex=0.5)
#Try linear model using all features
fit.lm <- lm(medv~lstat,data = training)
#predict on test set
pred.lm <- predict(fit.lm, newdata = testing)
plot(testing$medv,pred.lm, pch=21, bg="lightblue"); abline(a=0,b=1)
# Root-mean squared error
rmse.lm = sqrt(sum((pred.lm - testing$medv)^2)/
length(testing$medv))
mape.lm = mean(abs(pred.lm - testing$medv)/testing$medv)
c(RMSE = rmse.lm, MAPE = mape.lm, R2 = summary(fit.lm)$r.squared)
## RMSE MAPE R2
## 7.2137599 0.2756965 0.5552640
par(mfrow=c(3,1), mar=c(4,4,0,0), bty='n')
plot(medv~lstat, data=Boston, pch=21, bg="lightblue")
plot(log(medv)~lstat, data=Boston, pch=21, bg="lightblue")
plot(log(medv)~log(lstat), data=Boston, pch=21, bg="lightblue")
#Try linear model using all features
fit.lm <- lm(log(medv)~log(lstat),data = training)
#predict on test set
pred.lm <- exp(predict(fit.lm, newdata = testing))
plot(testing$medv,pred.lm, pch=21, bg="lightblue"); abline(a=0,b=1)
# Root-mean squared error
rmse.lm = sqrt(sum((pred.lm - testing$medv)^2)/
length(testing$medv))
mape.lm = mean(abs(pred.lm - testing$medv)/testing$medv)
c(RMSE = rmse.lm, MAPE = mape.lm, R2 = summary(fit.lm)$r.squared)
## RMSE MAPE R2
## 5.5252182 0.2036122 0.6738844
#Try linear model using all features
fit.lm <- lm(log(medv)~lstat,data = training)
#predict on test set
pred.lm <- exp(predict(fit.lm, newdata = testing))
plot(testing$medv,pred.lm, pch=21, bg="lightblue"); abline(a=0,b=1)
# Root-mean squared error
rmse.lm = sqrt(sum((pred.lm - testing$medv)^2)/
length(testing$medv))
mape.lm = mean(abs(pred.lm - testing$medv)/testing$medv)
c(RMSE = rmse.lm, MAPE = mape.lm, R2 = summary(fit.lm)$r.squared)
## RMSE MAPE R2
## 6.5361788 0.2234829 0.6614223
#Try simple linear model using selected features
fit.lm2 <- lm(formula = log(medv) ~ crim + chas + nox + rm + dis + ptratio +
rad + black + log(lstat), data = training)
pred.lm2 <- exp(predict(fit.lm2, newdata = testing))
# Root-mean squared error
rmse.lm2 <- sqrt(sum((pred.lm2 - testing$medv)^2)/
length(testing$medv))
mape.lm2 = mean(abs(pred.lm2 - testing$medv)/testing$medv)
c(RMSE = rmse.lm2, MAPE = mape.lm2, R2 = summary(fit.lm2)$r.squared)
## RMSE MAPE R2
## 4.2148008 0.1510489 0.7833539
# Plot of predicted price vs actual price
plot(pred.lm2,testing$medv, xlab = "Predicted Price", ylab = "Actual Price",pch=21,bg="lightblue")
abline(a=0,b=1)
summary(fit.lm2)
##
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + dis + ptratio +
## rad + black + log(lstat), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71608 -0.09696 -0.00954 0.09395 0.81543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7921376 0.2308976 20.754 < 2e-16 ***
## crim -0.0108393 0.0013508 -8.024 1.18e-14 ***
## chas 0.1124884 0.0389586 2.887 0.004098 **
## nox -0.8090206 0.1605993 -5.038 7.19e-07 ***
## rm 0.0469227 0.0182935 2.565 0.010686 *
## dis -0.0457327 0.0072537 -6.305 7.73e-10 ***
## ptratio -0.0345566 0.0054580 -6.331 6.61e-10 ***
## rad 0.0057448 0.0018024 3.187 0.001550 **
## black 0.0004136 0.0001149 3.601 0.000358 ***
## log(lstat) -0.4056284 0.0252214 -16.083 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1874 on 395 degrees of freedom
## Multiple R-squared: 0.7834, Adjusted R-squared: 0.7784
## F-statistic: 158.7 on 9 and 395 DF, p-value: < 2.2e-16
# Diagnostics plots
layout(matrix(c(1,2,3,4),2,2))
par(mar=c(4,5,2,0), bty='n')
plot(fit.lm2, pch=21, bg="lightblue", cex=0.7)
library(glmnet)
mat = model.matrix(log(medv) ~ crim + chas + nox + rm + dis + ptratio +
rad + black + log(lstat), data = training)
mattest = model.matrix(log(medv) ~ crim + chas + nox + rm + dis + ptratio +
rad + black + log(lstat), data = testing)
fit.glm <- glmnet(mat,log(training$medv))
plot(fit.glm)
We will use cross-validaiton to choose the best penalty size
cvfit <- cv.glmnet(mat,log(training$medv))
plot(cvfit)
cvfit$lambda.min
## [1] 0.000484624
coef(cvfit, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 4.7622337038
## (Intercept) .
## crim -0.0106952601
## chas 0.1114781908
## nox -0.7768003749
## rm 0.0470269254
## dis -0.0444852877
## ptratio -0.0338991864
## rad 0.0053855587
## black 0.0004076927
## log(lstat) -0.4057575693
Let’s define a function that calculates model performance
modelmetrics = function(y,yhat) {
ssreg = sum((y-yhat)^2)
sstot = sum((y-mean(y))^2)
rmse = sqrt(ssreg/length(y))
mape = mean(abs((y-yhat)/y))
r2 = 1 - ssreg/sstot
knitr::kable(t(c(RMSE = rmse, MAPE = mape, R2 = r2)),digits = 2)
}
Lets look at the predictions
pred.lasso = exp(predict(cvfit, newx = mattest, s = "lambda.min"))
plot(testing$medv,pred.lasso, pch=21, bg="lightblue"); abline(a=0,b=1,lwd=3)
modelmetrics(testing$medv,pred.lasso)
RMSE | MAPE | R2 |
---|---|---|
4.22 | 0.15 | 0.83 |
Now, let’s try ridge regression. The glmnet
library soles the following optimisation problem \[
\min_{\beta_0,\beta} \frac{1}{N} \sum_{i=1}^{N} w_i l(y_i,\beta_0+\beta^T x_i) + \lambda\left[(1-\alpha)\|\beta\|_2^2/2 + \alpha \|\beta\|_1\right],
\] The value of \(\alpha=1\) is assumed by default. To get the ridge regression fit, we simply set \(\alpha=0\)
wts <- c(rep(1,50), rep(2,50))
cvfit <- cv.glmnet(mat,log(training$medv), alpha=0)
plot(cvfit)
cvfit$lambda.min
## [1] 0.03263531
coef(cvfit, s = "lambda.min")
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 4.2760342514
## (Intercept) .
## crim -0.0094728276
## chas 0.1175579865
## nox -0.5910244766
## rm 0.0712810232
## dis -0.0311888700
## ptratio -0.0304784923
## rad 0.0028441011
## black 0.0004275498
## log(lstat) -0.3506923535
pred.ridge = exp(predict(cvfit, newx = mattest, s = "lambda.min"))
plot(testing$medv,pred.ridge, pch=21, bg="lightblue"); abline(a=0,b=1,lwd=3)
modelmetrics(testing$medv,pred.ridge)
RMSE | MAPE | R2 |
---|---|---|
4.15 | 0.15 | 0.84 |
library(kknn) ## knn library
attach(Boston)
ind = order(testing$lstat)
RMSE = NULL
kk = c(2,10,20,50,100,300)
par(mfrow=c(3,2), mar=c(3,3,2,0), bty='n')
for(i in kk){
near = kknn(medv~lstat,training,testing,k=i,kernel = "rectangular")
aux = sqrt(mean((testing$medv-near$fitted)^2))
RMSE = c(RMSE,aux)
plot(testing$lstat,testing$medv,main=paste("k=",i),pch=21,bg="lightblue",cex=0.8,col="darkgray")
lines(testing$lstat[ind],near$fitted[ind],col=2,lwd=2)
}
Model performance
modelmetrics(testing$medv, near$fitted)
RMSE | MAPE | R2 |
---|---|---|
8.67 | 0.35 | 0.29 |
Now let’s plot the relation between the number of neighbors and model performance (out-of-sample)
plot(log(1/kk),RMSE,type="b",xlab="Complexity(log(1/k))",col="blue",ylab="RMSE",lwd=2,cex.lab=1.2,pch=21,bg="lightblue")
First big tree size
big.tree = tree(medv~lstat,data=training,mindev=.0001)
print(length(unique(big.tree$where)))
## [1] 56
plot(big.tree,type="uniform")
text(big.tree,col="blue",label=c("yval"),cex=.8)
Let’s see how well we did on the out-of-sample data
boston.fit = predict(big.tree,testing) #get training fitted values
Model performance
modelmetrics(testing$medv, boston.fit)
RMSE | MAPE | R2 |
---|---|---|
6.16 | 0.24 | 0.64 |
Not so bad, comparable to LM. Do we overfit? Let’s see if a smaller tree will do any better.
boston.tree=prune.tree(big.tree,best=7)
cat('pruned tree size: \n')
## pruned tree size:
print(length(unique(boston.tree$where)))
## [1] 7
#plot the tree and the fits.
par(mfrow=c(1,2))
plot(boston.tree,type="uniform")
text(boston.tree,col="blue",label=c("yval"),cex=.8)
boston.fit = predict(boston.tree,testing) #get training fitted values
plot(testing$medv,boston.fit, pch=21, bg="lightblue")
Model performance
modelmetrics(testing$medv, boston.fit)
RMSE | MAPE | R2 |
---|---|---|
5.83 | 0.22 | 0.68 |
We have a simpler model (less leafes) that predicts better. The larger model did overfit! Let’s plit our predictions
plot(testing$lstat,testing$medv,cex=.5,pch=21, bg="lightblue") #plot data
oo=order(testing$lstat)
lines(testing$lstat[oo],boston.fit[oo],col='red',lwd=3) #step function fit
cvals=c(9.725,4.65,3.325,5.495,16.085,19.9) #cutpoints from tree
for(i in 1:length(cvals)) abline(v=cvals[i],col='magenta',lty=2) #cutpoints
Let’s Fit a regression tree to mev~dis+lstat
. Now our we have a two dimensional input and the partitions will be rectangles (possible open) The tree is plotted as well as the corresponding partition of the two-dimensional. Again, let’s start with a big tree
big.tree = tree(medv~lstat+dis,training,mindev=.0001)
cat('first big tree size: \n')
## first big tree size:
print(length(unique(big.tree$where)))
## [1] 60
Then prune it down to one with 7 leaves
boston.tree=prune.tree(big.tree,best=7)
cat('pruned tree size: \n')
## pruned tree size:
print(length(unique(boston.tree$where)))
## [1] 7
qplot(lstat,dis,data=Boston, colour = medv) +
scale_color_gradient(low="blue", high="red")
Plot tree and partition in x.
par(mfrow=c(1,2))
plot(boston.tree,type="u")
text(boston.tree,col="blue",label=c("yval"),cex=.8)
partition.tree(boston.tree)
lines(training$lstat,training$dis, type='p', pch=21, bg="lightblue", col=as.integer(training$medv))
pv=seq(from=.01,to=.99,by=.05)
x1q = quantile(testing$lstat,probs=pv)
x2q = quantile(testing$dis,probs=pv)
xx = expand.grid(x1q,x2q) #matrix with two columns using all combinations of x1q and x2q
dfpred = data.frame(dis=xx[,2],lstat=xx[,1])
lmedpred = predict(boston.tree,dfpred)
#make perspective plot
par(mfrow=c(1,1))
persp(x1q,x2q,matrix(lmedpred,ncol=length(x2q),byrow=T),
theta=150,xlab='dis',ylab='lstat',zlab='medv',
zlim=c(min(training$medv),1.1*max(training$medv)))
tree.boston=tree(medv~.,training,mindev=.01)
cv.boston=cv.tree(tree.boston, );
plot(cv.boston$size,cv.boston$dev, type='l')
prune.boston=prune.tree(tree.boston,best=6)
#summary(prune.boston)
#cv.tree(tree.boston,,prune.tree)$dev
plot(prune.boston)
text(prune.boston,pretty=0)
boston.fit = predict(prune.boston,testing) #get training fitted values
plot(testing$medv,boston.fit, pch=21, bg="lightblue")
Model performance
modelmetrics(testing$medv, boston.fit)
RMSE | MAPE | R2 |
---|---|---|
4.48 | 0.2 | 0.81 |
The best model thus far!
library(randomForest)
set.seed(1)
bag.boston=randomForest(medv~.,data=training,mtry=13,importance=TRUE)
bag.boston
##
## Call:
## randomForest(formula = medv ~ ., data = training, mtry = 13, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 11.48287
## % Var explained: 85.47
yhat.bag = predict(bag.boston,newdata=testing)
par(mfrow=c(1,1))
plot(yhat.bag, testing$medv, pch=21, bg="lightblue")
abline(0,1)
Model performance
modelmetrics(testing$medv, yhat.bag)
RMSE | MAPE | R2 |
---|---|---|
2.96 | 0.12 | 0.92 |
bag.boston=randomForest(medv~.,data=training,mtry=13,ntree=25)
yhat.bag = predict(bag.boston,newdata=testing)
modelmetrics(testing$medv, yhat.bag)
RMSE | MAPE | R2 |
---|---|---|
3.08 | 0.12 | 0.91 |
rf.boston=randomForest(medv~.,data=training,mtry=6,importance=TRUE)
yhat.rf = predict(rf.boston,newdata=testing)
modelmetrics(testing$medv, yhat.rf)
RMSE | MAPE | R2 |
---|---|---|
2.87 | 0.12 | 0.92 |
importance(rf.boston)
## %IncMSE IncNodePurity
## crim 14.560967 1932.52197
## zn 4.284978 86.77127
## indus 9.605498 1267.96102
## chas 2.114661 86.29355
## nox 17.798989 1880.16365
## rm 37.452825 8700.64525
## age 12.740484 638.15264
## dis 18.415476 2062.13520
## rad 6.213532 192.69777
## tax 8.876818 620.15756
## ptratio 13.588925 1212.24552
## black 8.164535 552.84349
## lstat 36.354017 12256.23324
par(mfrow=c(1,1))
varImpPlot(rf.boston,pch=21,bg="lightblue")
library(gbm)
set.seed(1)
boost.boston=gbm(medv~.,data=training,distribution="gaussian",n.trees=5000,interaction.depth=4)
summary(boost.boston)
## var rel.inf
## lstat lstat 40.2917987
## rm rm 23.3439045
## dis dis 10.9281748
## nox nox 5.6387294
## crim crim 4.8134756
## age age 4.5566143
## black black 3.5027374
## ptratio ptratio 2.9804523
## tax tax 1.2842744
## chas chas 0.8298594
## indus indus 0.7111797
## rad rad 0.6165401
## zn zn 0.5022594
plot(boost.boston,i="rm", bty='n')
plot(boost.boston,i="lstat", bty='n')
The marginal effect, the other variables are integrated out.
yhat.boost=predict(boost.boston,newdata=testing,n.trees=5000)
modelmetrics(testing$medv, yhat.boost)
RMSE | MAPE | R2 |
---|---|---|
2.76 | 0.11 | 0.93 |
boost.boston=gbm(medv~.,data=training,distribution="gaussian",n.trees=5000,interaction.depth=4,shrinkage=0.2,verbose=F)
yhat.boost=predict(boost.boston,newdata=testing,n.trees=5000)
modelmetrics(testing$medv, yhat.boost)
RMSE | MAPE | R2 |
---|---|---|
3 | 0.13 | 0.92 |
Now let’s train the model
library(keras)
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu", input_shape = 13) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 1)
summary(model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_3 (Dense) (None, 64) 896
## ________________________________________________________________________________
## dense_2 (Dense) (None, 64) 4160
## ________________________________________________________________________________
## dense_1 (Dense) (None, 32) 2080
## ________________________________________________________________________________
## dense (Dense) (None, 1) 33
## ================================================================================
## Total params: 7,169
## Trainable params: 7,169
## Non-trainable params: 0
## ________________________________________________________________________________
model %>% compile(loss = "mse",optimizer = "RMSprop")
history = model %>% fit(x = as.matrix(scale(training[,1:13])), y = training$medv,verbose = 2, epochs=20, validation_data = list(as.matrix(scale(testing[,1:13])),testing$medv))
yhat = model %>% predict( as.matrix(scale(testing[,1:13])))
plot(history)
plot(testing$medv,yhat, pch=21, bg='lightblue')
abline(0,1,col="red",lwd=2)
modelmetrics(testing$medv, yhat)
RMSE | MAPE | R2 |
---|---|---|
3.89 | 0.18 | 0.86 |
library(laGP)
library(ggplot2)
x = scale(as.matrix(Boston[,13:13]))
y = scale(Boston$medv)
dest <- darg(NULL, x)
gest <- garg(list(mle = TRUE), y)
gpi <- newGPsep(x, y, d=dest$start, g=gest$start, dK=TRUE)
mle <- mleGPsep(gpi, param="both", tmin=c(dest$min, gest$min), tmax=c(dest$max, gest$max))
p <- predGPsep(gpi, x)
qplot(y, p$mean)
modelmetrics(y,p$mean)
RMSE | MAPE | R2 |
---|---|---|
0.56 | 2.82 | 0.68 |
qplot(y, p$mean) +
geom_point(shape=21, fill="lightblue", size=2) +
geom_ribbon(aes(ymin=p$mean+2*diag(p$Sigma),ymax=p$mean-2*diag(p$Sigma)),alpha=0.3) +
xlab("y") + ylab(expression(hat(y))) + theme_classic()
library(pls)
library(randomForest)
N = nrow(Boston)
# train PLS
plsr_fit <-plsr(medv~., data=Boston, scale=T, validation="CV")
summary(plsr_fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.206 6.534 5.074 4.989 4.922 4.880 4.852
## adjCV 9.206 6.532 5.069 4.981 4.912 4.869 4.844
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 4.843 4.842 4.841 4.841 4.842 4.842 4.842
## adjCV 4.835 4.833 4.832 4.832 4.833 4.833 4.833
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 45.86 56.78 63.91 69.70 75.47 78.72 81.95 84.62
## medv 49.93 70.64 72.30 73.29 73.77 73.92 74.01 74.06
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 90.16 92.40 96.22 98.16 100.00
## medv 74.06 74.06 74.06 74.06 74.06
# use 7 components
plsr_7 <-plsr(formula = medv ~ ., ncomp = 7, data=Boston, scale = T)
PLS_Score <- scores(plsr_7)
# random forest
rf <- randomForest(x=PLS_Score, y=Boston$medv, importance=TRUE, proximity=TRUE)
# goodness of fit
pls_yhat = predict(plsr_7,Boston,ncomp=7)
rf_yhat = rf$predicted
v = mean((Boston$medv - mean(Boston$medv))^2)
r21 = 1 - mean((pls_yhat-Boston$medv)^2)/v
r22 = 1 - mean((rf_yhat-Boston$medv)^2)/v
par(mfrow=c(1,2))
plot(Boston$medv, pls_yhat,
xlab='y', ylab = 'yhat',
main=paste("PLS Fit, ", "R-squared =", round(r21,2)),
pch=21,bg="lightblue",
xlim=c(-5,50),ylim=c(-5,50))
abline(0,1,col="red",lwd=2)
modelmetrics(Boston$medv,pls_yhat)
RMSE | MAPE | R2 |
---|---|---|
4.68 | 0.16 | 0.74 |
plot(Boston$medv, rf_yhat,
xlab='y', ylab = 'yhat',
main=paste("PLS+RF Fit, ", "R-squared =", round(r22,2)),
pch=21,bg="lightblue",
xlim=c(-5,50),ylim=c(-5,50))
abline(0,1,col="red",lwd=2)
modelmetrics(Boston$medv,rf_yhat)
RMSE | MAPE | R2 |
---|---|---|
4.05 | 0.14 | 0.81 |
What if we combine PLS and neural network?
model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu", input_shape = 7) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 1)
summary(model)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_7 (Dense) (None, 64) 512
## ________________________________________________________________________________
## dense_6 (Dense) (None, 64) 4160
## ________________________________________________________________________________
## dense_5 (Dense) (None, 32) 2080
## ________________________________________________________________________________
## dense_4 (Dense) (None, 1) 33
## ================================================================================
## Total params: 6,785
## Trainable params: 6,785
## Non-trainable params: 0
## ________________________________________________________________________________
model %>% compile(loss = "mse",optimizer = "RMSprop")
history = model %>% fit(x = as.matrix(scale(PLS_Score)), y = Boston$medv,verbose = 2, epochs=100, validation_data = list(as.matrix(scale(PLS_Score)),Boston$medv))
yhat = model %>% predict( as.matrix(scale(PLS_Score)))
plot(history)
## `geom_smooth()` using formula 'y ~ x'
plot(Boston$medv, yhat,
xlab='y', ylab = 'yhat',
main=paste("PLS+DL Fit"),
pch=21,bg="lightblue",
xlim=c(-5,50),ylim=c(-5,50))
abline(0,1,col="red",lwd=2)
modelmetrics(Boston$medv, yhat)
RMSE | MAPE | R2 |
---|---|---|
2.27 | 0.09 | 0.94 |
dat=Boston
dat$medv=1*(dat$medv>25)+0*(dat$medv<=25)
dat$medv=as.factor(dat$medv)
mod_logit=glm(medv~.,data=dat,family=binomial())
summary(mod_logit)
##
## Call:
## glm(formula = medv ~ ., family = binomial(), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3498 -0.2806 -0.0932 -0.0006 3.3781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.312511 4.876070 1.090 0.275930
## crim -0.011101 0.045322 -0.245 0.806503
## zn 0.010917 0.010834 1.008 0.313626
## indus -0.110452 0.058740 -1.880 0.060060 .
## chas 0.966337 0.808960 1.195 0.232266
## nox -6.844521 4.483514 -1.527 0.126861
## rm 1.886872 0.452692 4.168 3.07e-05 ***
## age 0.003491 0.011133 0.314 0.753853
## dis -0.589016 0.164013 -3.591 0.000329 ***
## rad 0.318042 0.082623 3.849 0.000118 ***
## tax -0.010826 0.004036 -2.682 0.007314 **
## ptratio -0.353017 0.122259 -2.887 0.003884 **
## black -0.002264 0.003826 -0.592 0.554105
## lstat -0.367355 0.073020 -5.031 4.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.52 on 505 degrees of freedom
## Residual deviance: 209.11 on 492 degrees of freedom
## AIC: 237.11
##
## Number of Fisher Scoring iterations: 7
# diagnostic plot
par(mfrow=c(2,2), mar=c(3,3,0,0), bty='n')
plot(mod_logit,pch=21,bg="lightblue")
dat2=Boston
dat2$medv=0*(dat2$medv<=17)+1*(dat2$medv<=25 & dat2$medv>17)+2*(dat2$medv>25)
dat2$medv=as.factor(dat2$medv)
library(nnet)
mod_multinom=multinom(medv~.,data=dat2)
## # weights: 45 (28 variable)
## initial value 555.897818
## iter 10 value 338.289401
## iter 20 value 309.716007
## iter 30 value 211.736278
## iter 40 value 203.333103
## iter 50 value 202.890685
## iter 60 value 202.858211
## final value 202.858175
## converged
summary(mod_multinom)
## Call:
## multinom(formula = medv ~ ., data = dat2)
##
## Coefficients:
## (Intercept) crim zn indus chas nox rm
## 1 24.19217 -0.2231462 0.04670231 0.10160200 0.343424 -12.91234 -0.1602033
## 2 29.51905 -0.0643685 0.05704767 -0.01761023 1.311033 -18.77936 1.7631714
## age dis rad tax ptratio black
## 1 -0.03926022 -0.6648017 0.1826890 -0.006564069 -0.3770017 0.005904329
## 2 -0.03695548 -1.2169084 0.4612001 -0.017781470 -0.7175312 0.001405510
## lstat
## 1 -0.2409420
## 2 -0.5899765
##
## Std. Errors:
## (Intercept) crim zn indus chas nox rm
## 1 0.09595305 0.07342447 0.04186604 0.05732693 0.7108412 0.17673798 0.3344770
## 2 0.02845394 0.06391464 0.04316645 0.08044716 1.0467007 0.01608296 0.4633493
## age dis rad tax ptratio black lstat
## 1 0.01544144 0.2200741 0.05386458 0.003231373 0.09660977 0.002003785 0.04387560
## 2 0.01846981 0.2649990 0.08985403 0.005039059 0.13750043 0.004055685 0.08066911
##
## Residual Deviance: 405.7164
## AIC: 461.7164
par(mfrow=c(1,1))
pred=predict(mod_multinom,dat2)
plot(dat2$medv,pred,xlab="true class",ylab="predicted class")
library(e1071)
mod_svm=svm(medv~.,dat)
summary(mod_svm)
##
## Call:
## svm(formula = medv ~ ., data = dat)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 159
##
## ( 86 73 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
par(mfrow=c(1,1))
plot(mod_svm,data=dat,rm~lstat,pch=21,bg="lightblue")