Notes from Statistical Modeling in R I & II. Walk through on how to fit recursive partitioning models and utilize the statisticalModeling package.
RCODE IN PROGRESS
In this lecture, statistical models are defined as representation of reality with a purpose using mathematical properties. Models can be used to train data and explain variation.
Load the Mosaic library and use the mean function
library(ggplot2)
library(rpart)
library(dplyr)
library(rpart.plot)
library(statisticalModeling)
library(mosaic)
data(Gestation)
mean(gestation ~ smoke, data=Gestation, na.rm=TRUE)
0 1 2 3
279.6357 277.9792 281.3263 282.0700
Gestation$smoke[Gestation$smoke==2]=1
Gestation$smoke[Gestation$smoke==3]=0
Gestation$smoke <- as.factor(Gestation$smoke)
Designing & training model, evaluating, testing, interpreting
## -process by comparing models and evaluating
## -select model architecture to represent data
## -recursive partitioning (using rpart)
## -fit model
Evaluating Models
## linear models tells a linear story, while recursive follows the data pattern
## predictions will be different between the 2 architectures
build linear or recursive model: workflow
model1 <- lm()
construct an example of data to predict from
ex_values <- data.frame(x1=, x2=, data=df) ### predict ex_values using the evaluate_model function in the statisticalModeling package for better formating
mod1 <- lm(gestation~ wt.1+smoke+race+age,data=Gestation, na.action=na.omit)
values <- data.frame(wt.1=115, smoke="1", race=1, age=30, data=Gestation)
head(evaluate_model(mod1, values))
Chosing explanatory variables
### measuring how closely a model predicts y values:
### prediction error = $\sum{\hat{y}-y}$, mean squared errors gives us a value to evaluate
compute mean predictions of two models, subtract and square to see how much the predictions differ by; means would normally be compared between nested models.
((mean(predict(mod1), na.rm=TRUE) - (mean(predict(mod2), na.rm=TRUE))) ^2)
[1] 0.07270266
mod1_diff <- with(Gestation, gestation-(predict(mod1, newdata=Gestation)))
mod2_diff <- with(Gestation, gestation-(predict(mod2, newdata=Gestation)))
mean(mod1_diff^2, na.rm=TRUE)
[1] 247.0878
mean(mod2_diff^2, na.rm=TRUE)
[1] 244.1467
Examining MSE
mean(Gestation$gestation - predict(mod1, newdata=Gestation)^2, na.rm=TRUE)
[1] -77608.6
mean(Gestation$gestation - predict(mod2, newdata=Gestation)^2, na.rm=TRUE)
[1] -77763.2
Cross-validation: divide into train/test sets
use train data to build both models
calculate model outputs using the predict function
compare model outputs to actual data with test set
the tidy way; model_output within the evaluate_model saves outpout values from the dataframe
output <- evaluate_model(mod1, data=Gestation)
with(data=output, mean((Gestation$gestation - model_output)^2, na.rm=TRUE))
[1] 247.0878
Create a new variable that randomly assigns 75% of rows to train and the rest to test
row_count <- nrow(Gestation)
shuffled_rows <- sample(row_count)
train <- Gestation[head(shuffled_rows,floor(row_count*0.75)),]
train$train <- 1
train <- train[,-2:-23]
Gestation <- Gestation %>% left_join(train)
Gestation$train[is.na(Gestation$train)] <-0
prop.table(table(Gestation$train))
0 1
0.25 0.75
fit model to train and evaluate on test
mod1 <- lm(gestation~ wt.1+smoke+race+age,data=subset(Gestation, train==1), na.action=na.omit)
output <- evaluate_model(mod1, data = subset(Gestation, train==0))
Repeated random trials: cross validation varies by trial, therefore, repeating the procedure is a best practice
Compute the in sample error with training set.
in_sample_error <- with(evaluate_model(mod1, data=subset(Gestation, train==1)), mean((gestation-model_output)^2, na.rm=TRUE))
in_sample_error
[1] 246.556
Use the cv function to iterate over n trails
trials <- cv_pred_error(mod1, ntrials = 100)
mean(trials$mse)
[1] 249.5662
find confidence interval on trails compared to in_sample error
mosaic::t.test(~mse, mu=in_sample_error, data=trials)
One Sample t-test
data: mse
t = 46.877, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 246.556
95 percent confidence interval:
249.4388 249.6937
sample estimates:
mean of x
249.5662
Comparing 2 models
trials <- cv_pred_error(mod1, mod2) # mod2 is the rpart model
names(trials)
[1] "mse" "model"
t.test(mse~model, data=trials)
Welch Two Sample t-test
data: mse by model
t = -6.9307, df = 4.201, p-value = 0.001895
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-25.64725 -11.17166
sample estimates:
mean in group mod1 mean in group mod2
249.9818 268.3912
Prediction error for categorical outcome
Work flow
Calculate error rate with true and false possitive
Return probabiliies instead of 0, 1
For each case extract the prob that the model assigned for each outcome, likelihood, and multiply each case together to get total likelihood of observations
process
- fit null model 1 predictors with a constant value
data("ptitanic", package="rpart.plot")
ptitanic$x <- 1
null <- rpart(survived~x, data=ptitanic)
- evaluate null model with evaluate_model function
null_out <-evaluate_model(null, data=ptitanic, type="class")
with(data=null_out, mean(survived !=model_output, na.rm=TRUE))
[1] 0.381971
null_out$rand_guess <- mosaic::shuffle(ptitanic$survived)
prop.table(table(null_out$rand_guess))
died survived
0.618029 0.381971
- compute error rate #### lower error rate value determines if null model does better than guessing at random
with (data=null_out, mean(survived !=rand_guess, na.rm=T))
[1] 0.487395
adding predictors and computing error rate (error rate drops with predictors compared to the null model)
mod <- rpart(survived~., data=ptitanic)
mod_out <- evaluate_model(mod, data=ptitanic, type="class")
with(data=mod_out, mean(survived != model_output, na.rm=T))
[1] 0.1749427
Exploring data for relationships:
Model covariates:
build two models, 1 w/ covariate and w/o
use evaluate_model to get predictions
subtract predicted values w/ & w/o holding covariate constant
mod1 <- lm(survived~pclass, data = ptitanic)
mod2 <- lm(survived~pclass+sex, data = ptitanic)
evaluate_model(mod1, pclass="1st")
evaluate_model(mod2, pclass="1st", sex="female") # with covariate
Effect sizes: How much is the change, or strength of association
using effect_size function from statisticalModeling
mod1 <- lm(gestation~smoke, data=Gestation)
mod2 <- lm(gestation~smoke+wt.1+age, data=Gestation)
effect_size(mod1, ~smoke)
effect_size(mod2, ~smoke+wt.1+age)
Part 2:
Using effect_size function in statisticalModeling to identify values to center
using fmodel function
build rpart model
mod3 <- rpart(gestation~smoke+wt.1+race, data=Gestation)
fmodel(mod3, ~wt.1+smoke+race, data=Gestation, race=c(5:7 )) # specify facets

Categorical response using evaluate_model
for model_output = 1,0
ptitanic$survived_f[ptitanic$survived=="survived"]=1
ptitanic$survived_f[ptitanic$survived=="died"]=0
ptitanic$survived_f <- as.factor(ptitanic$survived_f)
mod <- rpart(survived_f~pclass+sex, data=ptitanic, method="class")
eval_out1 <- evaluate_model(mod, type="class", pclass=c("1st","3rd"), sex="female")
eval_out1
for probabilities of y change type=“prop”
eval_out2 <- evaluate_model(mod, type="prob", pclass=c("1st","3rd"), sex="female")
eval_out2
Interaction workflow:
examine if interactions help model, use cross validation
build lm model and plot w/ & w/o interaction
mod1 <- lm(gestation~smoke+wt.1+race, data=Gestation, na.action =na.exclude)
mod2 <- lm(gestation~smoke*wt.1*race, data=Gestation, na.action =na.exclude)
fmodel(mod1) + ggplot2::ylab("survived")

fmodel(mod2) + ggplot2::ylab("survived")

cross validate & compare prediction errors
res <- cv_pred_error(mod1, mod2)
names(res)
[1] "mse" "model"
t.test(mse~model, data=res)
Welch Two Sample t-test
data: mse by model
t = -2.6061, df = 4.8822, p-value = 0.04901
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.710891798 -0.008725852
sample estimates:
mean in group mod1 mean in group mod2
249.3320 250.6918
Total & Partial Change:
to compare effect sizes, make sure units are on the same level
ex. price of car = age of car + mileage of car
effect size of age is dollars to year or dollars/year unit level
effect size of miles is dollars to mile, dollars/mile unit level
covert units to same scale by multiplying slopes, and taking an average value of miles per year (10k)
slope(dollars/mile) * 10000(miles/year)=slope(dollars/year
sometimes it is not possible to change the units of one variable w/o affecting the other.
partial change: impact on the response of changing 1 unit holding all other inputs constant
evaluate model using evaluate_model function from statisticalModeling, assign values to explanatory variables
evaluate_model(x=, z=)
R squared: Coefficient of determination
fraction of variation in the response variable accounted for by the model
R squared not as useful for predicting of inferece since the value changes with models and values depends on the context of data
alternatives;
prediction purpose = cross validated prediction error or MSE from cross validation
inference purpose = effect sizes & confidence intervals
calculate R-squared for a model
mod1<-lm() out <- evaluate_model(mod1, data=dd) ## R-squared variance of model response prediction / var(y) with(out, var(model_output)/ var(y)
plot MSE from both models in a boxplot
cv_pred_error function from statisticalModeling subsets MSE prediction error
boxplot(mse~model, data=cv_pred_error(model_1, model_2))
Degrees of Freedom
difference between df in the data and df in the model is the residual degrees of freedom
## models with several residual degrees of freedom are not reliable
length(unique(x)) # gives us the number of levels in a continuous variable
cross validation, can also run several trials
cv_results <- cv_pred_error(mod1, mode2, ntrials=n)
bootstrapping and precision (resampling method)
Confidence intervals are a representation of precision
bootstrap when data is not sufficient to study precision
idea = use 1 dataset to simulate many datasets as if they were sampled from the population
bootstraping for qunatifying many results on an effect size
work flow for constructing 1 resampled data set
mod <- lm() effect_size(mod, ~x) # construct a resampling of the df and save df by selecting row numbers at random from the original dataset trial_indices <- sample(1:nrow(df), replace=T) trial_data <- df[trial_indices,]
train the model to the resampled dataset
trial_mod <- lm(, data=trial_data) # compute effect size effect_size(trial_mod, ~x)
construct 100 resampled data sets using the ensemble function from the statisticalModeling package
mod <- lm() trial <- ensemble(mod, nreps=100) trial_effect_size <- effect_size(trial, ~x) # compute SD of the 100 effect sizes sd(trial_effect_size\(slope) # plot effect sizes hist(trial_effect_sizes\)slope)
Modeling exponential change over time
mod <- lm(salary~year) # a linear model fails to capture the data ## plotting can show this failutre fmodel(mod, data=df) +geom_point(data=df) # take log transformation of y
Confidence & Collinearity
---
title: "Statistical Modeling in R - Data Camp notes"
author: "Kevin L."
date: "July 26, 2018"
output: html_notebook
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Notes from Statistical Modeling in R I & II. Walk through on how to fit recursive partitioning models and utilize the statisticalModeling package.

## RCODE IN PROGRESS

## In this lecture, statistical models are defined as representation of reality with a purpose using mathematical properties. Models can be used to train data and explain variation.

### Load the Mosaic library and use the mean function
```{r, message=FALSE, warning=FALSE}
library(ggplot2)
library(rpart)
library(dplyr)
library(rpart.plot)
library(statisticalModeling)
library(mosaic)
data(Gestation)
mean(gestation ~ smoke, data=Gestation, na.rm=TRUE)
Gestation$smoke[Gestation$smoke==2]=1
Gestation$smoke[Gestation$smoke==3]=0
Gestation$smoke <- as.factor(Gestation$smoke)
```

# Designing & training model, evaluating, testing, interpreting
	## -process by comparing models and evaluating
	## -select model architecture to represent data
	## -recursive partitioning (using rpart)
	## -fit model 
# Evaluating Models
	## linear models tells a linear story, while recursive follows the data pattern
	## predictions will be different between the 2 architectures
# build linear or recursive model: workflow

## model1 <- lm()
### construct an example of data to predict from
ex_values <- data.frame(x1=, x2=, data=df)
### predict ex_values using the  evaluate_model function in the statisticalModeling package for better formating

```{r, message=FALSE, warning=FALSE}
mod1 <- lm(gestation~ wt.1+smoke+race+age,data=Gestation, na.action=na.omit)
values <- data.frame(wt.1=115, smoke="1", race=1, age=30, data=Gestation)
head(evaluate_model(mod1, values))
```


## Goal of evaluating model: extrapolattion - finding data outside the range of the data used to train model
### Evaluate model using the evaluate_model function gives output on training data
```{r, message=FALSE, warning=FALSE}
mod2 <- rpart(gestation~wt.1+smoke+race+age,data=Gestation, method="anova", cp=.005, na.action = na.rpart)

fmodel(mod1)
fmodel(mod2)
# predict values 
# use the evaluate_model function in the statisticalModeling package for better formating
head(evaluate_model(mod1))
head(evaluate_model(mod2))

rpart.plot(mod2, yesno=2)
```






## Chosing explanatory variables
	### measuring how closely a model predicts y values:
	### prediction error = $\sum{\hat{y}-y}$, mean squared errors gives us a value to evaluate
### compute mean predictions of two models, subtract and square to see how much the predictions differ by; means would normally be compared between nested models.

```{r, message=FALSE, warning=FALSE}
((mean(predict(mod1), na.rm=TRUE) - (mean(predict(mod2), na.rm=TRUE))) ^2)
mod1_diff <- with(Gestation, gestation-(predict(mod1, newdata=Gestation)))
mod2_diff <- with(Gestation, gestation-(predict(mod2, newdata=Gestation)))
mean(mod1_diff^2, na.rm=TRUE)
mean(mod2_diff^2, na.rm=TRUE)
```


## Examining MSE
```{r, message=FALSE, warning=FALSE}
mean(Gestation$gestation - predict(mod1, newdata=Gestation)^2, na.rm=TRUE)
mean(Gestation$gestation - predict(mod2, newdata=Gestation)^2, na.rm=TRUE)
```

## Cross-validation: divide into train/test sets
### use train data to build both models
### calculate model outputs using the predict function
### compare model outputs to actual data with test set
#### the tidy way; model_output within the evaluate_model saves outpout values from the dataframe
```{r, message=FALSE, warning=FALSE}
output <- evaluate_model(mod1, data=Gestation)
with(data=output, mean((Gestation$gestation - model_output)^2, na.rm=TRUE))
```

### Create a new variable that randomly assigns 75% of rows to train and the rest to test
```{r, message=FALSE, warning=FALSE}
row_count <- nrow(Gestation)
shuffled_rows <- sample(row_count)
train <- Gestation[head(shuffled_rows,floor(row_count*0.75)),]
train$train <- 1
train <- train[,-2:-23]
Gestation <- Gestation %>% left_join(train)
Gestation$train[is.na(Gestation$train)] <-0
prop.table(table(Gestation$train))
```
### fit model to train and evaluate on test
```{r, message=FALSE, warning=FALSE}
mod1 <- lm(gestation~ wt.1+smoke+race+age,data=subset(Gestation, train==1), na.action=na.omit)
output <- evaluate_model(mod1, data = subset(Gestation, train==0))
```



### Repeated random trials: cross validation varies by trial, therefore, repeating the procedure is a best practice
#### Compute the in sample error with training set.
```{r, message=FALSE, warning=FALSE}
in_sample_error <- with(evaluate_model(mod1, data=subset(Gestation, train==1)), mean((gestation-model_output)^2, na.rm=TRUE))
in_sample_error
```

### Use the cv function to iterate over n trails
```{r, message=FALSE, warning=FALSE}
trials <- cv_pred_error(mod1, ntrials = 100) 
mean(trials$mse)
```

# find confidence interval on trails compared to in_sample error
```{r, message=FALSE, warning=FALSE}
mosaic::t.test(~mse, mu=in_sample_error, data=trials)
```

### Comparing 2 models
```{r, message=FALSE, warning=FALSE}
trials <- cv_pred_error(mod1, mod2) # mod2 is the rpart model
names(trials)
t.test(mse~model, data=trials)
```




## Prediction error for categorical outcome
### Work flow
#### Calculate error rate with true and false possitive
#### Return probabiliies instead of 0, 1
#### For each case extract the prob that the model assigned for each outcome, likelihood, and multiply each case together to get total likelihood of observations

### process
* 1. fit null model 1 predictors with a constant value
```{r, message=FALSE, warning=FALSE}
data("ptitanic", package="rpart.plot")
ptitanic$x <- 1
null <- rpart(survived~x, data=ptitanic)
```

* 2.  evaluate null model with evaluate_model function
```{r, message=FALSE, warning=FALSE}
null_out <-evaluate_model(null, data=ptitanic, type="class")
```

* 3. calculate error rate
```{r, message=FALSE, warning=FALSE}
with(data=null_out, mean(survived !=model_output, na.rm=TRUE))
```

* 4. generate random guess
```{r, message=FALSE, warning=FALSE}
null_out$rand_guess <- mosaic::shuffle(ptitanic$survived)
prop.table(table(null_out$rand_guess))
```

* 5. compute error rate
#### lower error rate value determines if null model does better than guessing at random 
```{r, message=FALSE, warning=FALSE}
with (data=null_out, mean(survived !=rand_guess, na.rm=T))
```

 
## adding predictors and computing error rate (error rate drops with predictors compared to the null model)
```{r, message=FALSE, warning=FALSE}
mod <- rpart(survived~., data=ptitanic)
mod_out <- evaluate_model(mod, data=ptitanic, type="class")
with(data=mod_out, mean(survived != model_output, na.rm=T))
```


## Exploring data for relationships:
### ploting data with prp (rpart.plot package), plots as a tree with a format specified by "type"
```{r, message=FALSE, warning=FALSE}
prp(mod, type=3)
```


## Model covariates:
### build two models, 1 w/ covariate and w/o
###use evaluate_model to get predictions
### subtract predicted values w/ & w/o holding covariate constant

```{r, message=FALSE, warning=FALSE}
mod1 <- lm(survived~pclass, data = ptitanic)
mod2 <- lm(survived~pclass+sex, data = ptitanic)

evaluate_model(mod1, pclass="1st") 
evaluate_model(mod2, pclass="1st", sex="female") # with covariate

```



### Effect sizes: How much is the change, or strength of association
#### using effect_size function from statisticalModeling
```{r, message=FALSE, warning=FALSE}
mod1 <- lm(gestation~smoke, data=Gestation)
mod2 <- lm(gestation~smoke+wt.1+age, data=Gestation)
effect_size(mod1, ~smoke)
effect_size(mod2, ~smoke+wt.1+age)
```





Part 2:

## Using effect_size function in statisticalModeling to identify values to center
### using fmodel function
##### build rpart model
```{r, message=FALSE, warning=FALSE}
mod3 <- rpart(gestation~smoke+wt.1+race, data=Gestation)

fmodel(mod3, ~wt.1+smoke+race, data=Gestation,  race=c(5:7 )) # specify facets
```



### Categorical response using evaluate_model
#### for model_output = 1,0
```{r, message=FALSE, warning=FALSE}
ptitanic$survived_f[ptitanic$survived=="survived"]=1
ptitanic$survived_f[ptitanic$survived=="died"]=0
ptitanic$survived_f <- as.factor(ptitanic$survived_f)
mod <- rpart(survived_f~pclass+sex, data=ptitanic, method="class")
eval_out1 <- evaluate_model(mod, type="class", pclass=c("1st","3rd"), sex="female")
eval_out1

```

#### for probabilities of y change type="prop"
```{r, message=FALSE, warning=FALSE}
eval_out2 <- evaluate_model(mod, type="prob", pclass=c("1st","3rd"), sex="female")
eval_out2
```


### Interaction workflow:
#### examine if interactions help model, use cross validation
#### build lm model and plot w/ & w/o interaction
```{r, message=FALSE, warning=FALSE}
mod1 <- lm(gestation~smoke+wt.1+race, data=Gestation, na.action =na.exclude)
mod2 <- lm(gestation~smoke*wt.1*race, data=Gestation, na.action =na.exclude)
fmodel(mod1) + ggplot2::ylab("survived")
fmodel(mod2) + ggplot2::ylab("survived")


```


#### cross validate & compare prediction errors
```{r, message=FALSE, warning=FALSE}
res <- cv_pred_error(mod1, mod2)
names(res)
t.test(mse~model, data=res)
```




### Total & Partial Change:
## to compare effect sizes, make sure units are on the same level
### ex. price of car = age of car + mileage of car
### effect size of age is dollars to year or dollars/year unit level
### effect size of miles is dollars to mile, dollars/mile unit level
### covert units to same scale by multiplying slopes, and taking an average value of miles per year (10k)
### slope(dollars/mile) * 10000(miles/year)=slope(dollars/year
### sometimes it is not possible to change the units of one variable w/o affecting the other.
### partial change: impact on the response of changing 1 unit holding all other inputs constant
### total change: impact on the response of changing 1 input while letting others be freely estimated
#### leave out important variable from model to estimate effect size of another variable

## evaluate model using evaluate_model function from statisticalModeling, assign values to explanatory variables
evaluate_model(x=, z=)


# R squared: Coefficient of determination
## fraction of variation in the response variable accounted for by the model
## R squared not as useful for predicting of inferece since the value changes with models and values depends on the context of data
### alternatives;
#### prediction purpose = cross validated prediction error or MSE from cross validation
#### inference purpose =  effect sizes & confidence intervals

### calculate R-squared for a model
mod1<-lm()
out <- evaluate_model(mod1, data=dd)
## R-squared variance of model response prediction / var(y)
with(out, var(model_output)/ var(y)

# plot MSE from both models in a boxplot
## cv_pred_error function from statisticalModeling subsets MSE prediction error
boxplot(mse~model, data=cv_pred_error(model_1, model_2))

# Degrees of Freedom
## difference between df in the data and df in the model is the residual degrees of freedom
	## models with several residual degrees of freedom are not reliable

### length(unique(x)) # gives us the number of levels in a continuous variable

### cross validation, can also run several trials
cv_results <- cv_pred_error(mod1, mode2, ntrials=n)



# bootstrapping and precision (resampling method)
## Confidence intervals are a representation of precision
## bootstrap when data is not sufficient to study precision
### idea = use 1 dataset to simulate many datasets as if they were sampled from the population
### bootstraping for qunatifying many results on an effect size

# work flow for constructing 1 resampled data set
mod <- lm()
effect_size(mod, ~x)
# construct a resampling of the df and save df by selecting row numbers at random from the original dataset
trial_indices <- sample(1:nrow(df), replace=T)
trial_data <- df[trial_indices,]

# train the model to the resampled dataset
trial_mod <- lm(, data=trial_data)
# compute effect size
effect_size(trial_mod, ~x)

# construct 100 resampled data sets using the ensemble function from the statisticalModeling package
mod <- lm()
trial <- ensemble(mod, nreps=100)
trial_effect_size <- effect_size(trial, ~x)
# compute SD of the 100 effect sizes
sd(trial_effect_size$slope)
# plot effect sizes
hist(trial_effect_sizes$slope)



# Scales & Transformations:
## working with logs, interpretationp; proportional change
## box-cox
## rank transform, replace value c(2, 4, 2000, 1) as c(2, 3, 4, 1)

## ex model
lm(log(y)~x) # implies exponential relationship
lm(log(y)~log(x)) approximates linear relationship


# Modeling exponential change over time
mod <- lm(salary~year) # a linear model fails to capture the data
## plotting can show this failutre
fmodel(mod, data=df) +geom_point(data=df)
# take log transformation of y

# modeling predictions of log transforms by comparing cross validation
mod1 <- lm(y~x)
mod2 <- lm(log(y)~x)
# compute model values
pred1 <- evaluate_model(mod1, data=df)
pred2 <- evaluate_model(mod2, data=df)
# transform model output
pred2$mod_y <- exp(pred2$model_output)
# compute MSE
mean((pred1$model_output - df$y)^2, na.rm=T)
mean((pred1$mod_y - df$y)^2, na.rm=T)

## confidence intervals on log transforms
mod <- lm(log(y)~x)
# create bootstrap replications
bootstrap_mod <- ensemble(mod, nreps=1000, data=df)
# compute effect size
y_effect <- effect_size(bootstrap_mod, ~x)
# change slope to percent change
y_effect$per_change <- 100 * (exp(y_effect$slope) -1)
# compute confidence intervals
with(y_effect, mean(per_change) + c(-2, 2) * sd(per_change))



# Confidence & Collinearity
## transform rsquared into variance inflation factor
### vif= variance in spread from 1 sample to another
vif <- 1/(1-R2)
# standard error inflation factor
sqrt(vif)
## use the collinearity function in statisticalModeling
collinearity(~x+z, data=df)
collinearity(~x, data=df)






