Some code has been adapted from broom package vignette.
lmfit <- lm(mpg ~ wt, mtcars)
lmfit
Call:
lm(formula = mpg ~ wt, data = mtcars)
Coefficients:
(Intercept) wt
37.285 -5.344
The formula notation can be read as “mpg explained by wt”. Let’s see the scatterplot between these two variables
ggplot(mtcars, aes(wt,mpg)) +
geom_point()
We can get more information out of the model
summary(lmfit)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
summary(lmfit)$r.squared
[1] 0.7528328
coefficients(lmfit)
(Intercept) wt
37.285126 -5.344472
head(residuals(lmfit))
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant
-2.2826106 -0.9197704 -2.0859521 1.2973499 -0.2001440 -0.6932545
Let’s extract necessary information tidy way with help of broom package functions
broom::tidy(lmfit)
broom::augment(lmfit) %>% head()
broom::glance(lmfit)
Let’s generate a dummy data frame
test <- data_frame(wt=c(3.5,5,2.5), dummy=c(10,15,20))
test
Let’s predict mpg values based on new wt values.
predict(lmfit,test)
1 2 3
18.57948 10.56277 23.92395
# let's keep them together
test$prediction <-predict(lmfit,test)
test
Let’s visualize this concept
ggplot(mtcars, aes(wt,mpg)) +
geom_point()+
geom_point(data=test, aes(wt,prediction), col="red")
RMSE is used to measure quality of regression
RMSE
This code is from DataCamp
pred <- predict(lmfit)
pred
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360
23.282611 21.919770 24.885952 20.102650 18.900144 18.793255 18.205363
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC
20.236262 20.450041 18.900144 18.900144 15.533127 17.350247 17.083024
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
9.226650 8.296712 8.718926 25.527289 28.653805 27.478021 24.111004
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
18.472586 18.926866 16.762355 16.735633 26.943574 25.847957 29.198941
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
20.343151 22.480940 18.205363 22.427495
rmse <- sqrt(sum( (mtcars$mpg - pred) ^ 2) / nrow(mtcars))
rmse
[1] 2.949163
# OR
# Apply predict() to mtcars:
mtcars_est <- predict(lmfit,mtcars)
# Calculate difference between the predicted and the true values: res
res <- mtcars$mpg - mtcars_est
# Calculate RMSE, assign it to rmse and print it
rmse <- sqrt(sum((res^2))/nrow(mtcars))
rmse
[1] 2.949163
The code is taken from Hadley Wickham’s talk and Many Models chapter in R for Data Science book. Some code had been modified by alper yilmaz.
gapminder
gapminder <- gapminder %>% mutate(year1950 = year - 1950)
Remember Hans Rosling’s animation showing the relationship between GDP and Life Expectancy? Here’s the animation re-generated with ggplot
Gapminder Animation
Below is Hadley Wickham’s line graph showing all countries for all years.
gapminder %>%
ggplot(aes(year, lifeExp, group = country)) +
geom_line(alpha = 1/3)
There’s overall increasing trend. But some countries don’t follow the pattern, how can we find those countries.
For single country we can pull the data and draw the trend
tr <- filter(gapminder, country == "Turkey")
tr %>%
ggplot(aes(year, lifeExp)) +
geom_line() +
ggtitle("Full data = ")
tr_mod <- lm(lifeExp ~ year, data = tr)
tr %>%
add_predictions(tr_mod) %>%
ggplot(aes(year, pred)) +
geom_line() +
ggtitle("Linear trend + ")
tr %>%
add_residuals(tr_mod) %>%
ggplot(aes(year, resid)) +
geom_hline(yintercept = 0, colour = "white", size = 3) +
geom_line() +
ggtitle("Remaining pattern")
But this is not feasible to do for many countries.
Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it can’t surprise you. - Hadley Wickham (by way of John D. Cook)
Please refer to list columns section and following couple sections in R4DS book to understand list column concept better.
by_country <- gapminder %>%
group_by(continent, country) %>%
nest()
by_country
# str(by_country) # does not help
by_country$data[[1]] # can reach the contents of list column
## how do we do this with subset indexing?
# by_country$data[country=="Turkey"]
tr_data <-by_country %>%
filter(country == "Turkey")
tr_data$data[[1]]
by_country %>%
filter(country == "Turkey") %>%
unnest(data)
country_model <- function(df){ lm(lifeExp ~ year1950, data = df) }
models <- by_country %>%
mutate( mod = map(data, country_model) )
# try without function
models
models %>% filter(continent == "Africa")
As you might remember the output of summary() or glance() are not in tidy format. Below is the summary of model belonging to first country.
summary( models$mod[[1]] )
Call:
lm(formula = lifeExp ~ year1950, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.5447 -0.9905 -0.2757 0.8847 1.6868
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.35664 0.69898 42.00 1.40e-12 ***
year1950 0.27533 0.02045 13.46 9.84e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.223 on 10 degrees of freedom
Multiple R-squared: 0.9477, Adjusted R-squared: 0.9425
F-statistic: 181.2 on 1 and 10 DF, p-value: 9.835e-08
The broom package comes to rescue, especially when dealing with many models. Let’s extract r squared values from models and attach to tibble.
models %>%
mutate(glance = mod %>% map(broom::glance),
rsq = glance %>% map_dbl("r.squared"))
# OR
# models %>%
# mutate(rsq = mod %>% map(broom::glance) %>% map_dbl("r.squared"))
Let’s extract all model related data and keep as list column within same tibble.
models <- models %>%
mutate(
glance = mod %>% map(broom::glance),
rsq = glance %>% map_dbl("r.squared"),
tidy = mod %>% map(broom::tidy),
augment = mod %>% map(broom::augment)
)
models
As we saw earlier, we can access the contents of each list columns if we had to
models$glance[[1]]
models$tidy[[1]]
models$augment[[1]]
The tibble still behaves same, dplyr verbs or ggplot can be used with it
models %>% arrange(desc(rsq))
models %>% filter(continent == "Africa")
models %>%
ggplot(aes(rsq, reorder(country, rsq))) +
geom_point(aes(colour = continent))
unnest(models, data)
unnest(models, glance, .drop = TRUE )
unnest(models, tidy)
models %>%
unnest(tidy) %>%
select(continent, country, term, estimate, rsq) %>%
spread(term, estimate) %>%
ggplot(aes(`(Intercept)`, year1950)) +
geom_point(aes(colour = continent, size = rsq)) +
geom_smooth(se = FALSE) +
xlab("Life Expectancy (1950)") +
ylab("Yearly Improvement") +
scale_size_area()
unnest(models, augment)
models %>%
unnest(augment) %>%
ggplot(aes(year1950, .resid)) +
geom_line(aes(group = country, alpha = 1/3)) +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, colour = "white") +
facet_wrap(~continent)
Let’s find the countries with bad fit
bad_fit <- models %>%
unnest(glance, .drop = TRUE) %>%
filter(r.squared < 0.25)
gapminder %>%
semi_join(bad_fit, by = "country") %>%
ggplot(aes(year, lifeExp, colour = country)) +
geom_line()
Please refer to two part series of blog posts by David Robinson, part1 and part2
It will be about calculation of correlations with text analysis.
Last Updated 2017-12-07