But I Regress:

Using R to model data in different ways

Data Preprocessing

This next section just deals with data filtering and cleanup to make it more usable for the analysis we’re about to perform.

dataset <- read.csv("HEALTH_PHMC_24102017160155249.csv") # Importing the data
names(dataset) # See different features that the dataset has
 [1] "VAR"        "Variable"   "UNIT"       "Measure"    "COU"       
 [6] "Country"    "YEA"        "Year"       "Value"      "Flag.Codes"
[11] "Flags"     
dataset <- dplyr::filter(dataset, # Filters by rows containing
                         Measure == "Million US$ at exchange rate")
dataset <- dplyr::select(dataset, year = Year, country = Country,
                         value = Value) # Filters by column
total.revenue <- aggregate(dataset$value, by=list(dataset$year), FUN=sum)
total.revenue <- total.revenue[1:35,]
total.revenue[1] <- 1:35
names(total.revenue) <- c("year", "value") # names columns for easier reference
names(total.revenue)
[1] "year"  "value"

Linear Regression

Linear regression is one of the most basic forms of regression. It uses the equation \[ y = mx + b \] Where \(m\) is the slope of the line and \(b\) is where the line crosses the \(y\) axis (also known as the \(y\)-intercept) \(m\) can be found by the equation \[ m = \frac{y_{2} - y_{1}}{x_{2} - x_{1}} \] Where \(x_{1}\), \(y_{1}\) are the coordinates of point 1 and \(x_{2}\), \(y_{2}\) are the coordinates of point 2. With linear regression, we can have R do this for us. It is a very basic form of machine learning.

# Below we will be making a 'linear machine' or lm.
regressor <- lm(formula = value ~ year, data = total.revenue)
y.test <- data.frame("year" = 1:35, "value" = 0)
y.pred <- predict(regressor, newdata = y.test)
y.test[2] <- NULL
y.pred <- data.frame("year" = 1:35, "value" = y.pred)

Now we can use ggplot2, a toolset available to \(R\) analysts based on “The Grammar of Graphics” concept, to plot these two values.

ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = y.pred, aes(x = year, y = value), 
             color = "Green") + 
  ggtitle("Linear Regression Model")

Notice how the green dots and the blue dots both have a similar trend, but only really come into contact with each other for a few years? The model we just made is okay, but it could be better. The reason our group used Linear Regression for our presentation is because it is better for creating future predictions than the other models we looked at.

Polynomial regression

Polynomial regression is very similar to linear regression. But instead of using the equation \(y = mx + b\), we use the equation \[ y = \sum_{i=0}^{n}a_{i}x^{i} \] This may look slightly complicated, but really what it’s saying is \[ y = a_{0}x^{0}+a_{1}x^{1} + a_{2}x^{2} + ... +a_{n}x^{n} \] The best part about this kind of regression is that just like linear regression, we can have \(R\) do it for us. And since our data is ready to go, we can jump right in!

total.poly <- total.revenue
for (i in 2:4){ # This generates some polynomials for us. Fun!
  total.poly[, i+1] <- total.poly[, 1]^i
}
polyreg <- lm(formula = value ~ ., data = total.poly)
poly.test <- total.poly
poly.test[2] <- 0
poly.pred <- predict(polyreg, newdata = poly.test)
poly.pred <- data.frame("year" = 1:35, "value" = poly.pred)
ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = poly.pred, aes(x = year, y = value), 
             color = "Green") + 
  ggtitle("Polynomial Regression Model")

That looks a bit better, doesn’t it? Now, because we are basically guessing at paramaters here, there is a danger in using this form (and many other forms) of regression. It is called overfitting, and can be a serious issue.
Overfitting is basically making a guess that is too good. Above, we are making an educated guess that we think the data could fit a polynomial regression model and then we apply parameters that we choose and calculate. This will be something that is discussed further in statistics, but for now know that it is a thing and can happen when modeling.

Support Vector Regression

Support Vector Regression is basically a way to model the data that uses the size of the error \(\epsilon\) as a maximum deviation away from our actual data point(Smola and Scholkopf 2003, 2). That is to say, we want to minimize the line distance from our model to the actual data. In theory, this is pretty deep, but in practice it is very easy.
For this analysis in \(R\) we need to install the package e1071. After that it’s done in a very similar way to linear and polynomial regressions.

svreg = svm(formula = value ~ year,
                data = total.revenue,
                type = 'eps-regression',
                kernel = 'radial')
svr.pred <- predict(svreg, newdata = y.test)
svr.pred <- data.frame("year" = 1:35, "value" = svr.pred)
ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = svr.pred, aes(x = year, y = value), 
             color = "Green") + ggtitle("SVR Model")

We’re getting even closer, and there’s less danger of overfitting our model.

Decision Tree / Random Forest Regression

Decision Trees

Now we’re moving on to using decision trees to make our models. Basically, a decision tree is a type of machine learning that utilizes “… a tree (and a type of directed, acyclic graph) in which the nodes represent decisions (a square box), random transitions (a circular box) or terminal nodes, and the edges or branches are binary (yes/no, true/false) representing possible paths from one node to another”(Lan 2017). It can be used for regression (which is what we’ll be using it for today), or classification (basically, sorting data into like groups).
This can be done in \(R\) quite easily as well, in a similar fashion to what we did above. First, we must install in import the rpart library, and then it’s the same old story:

tree.reg = rpart(formula = value ~ year,
                  data = total.revenue,
                  control = rpart.control(minsplit = 1))
tree.pred <- predict(tree.reg, newdata = y.test)
tree.pred <- data.frame("year" = 1:35, "value" = tree.pred)
ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = tree.pred, aes(x = year, y = value), 
             color = "Green") + ggtitle("Decision Tree Regression Model")

Well that’s pretty awful, isn’t it? But you can see above we used a value of minsplit = 1. That tells \(R\) to use a minimum of one decision “branch”, and that’s why we have a straight line at the average. Let’s try that again.

two.tree.reg = rpart(formula = value ~ year,
                  data = total.revenue,
                  control = rpart.control(minsplit = 2))
two.tree.pred <- predict(two.tree.reg, newdata = y.test)
two.tree.pred <- data.frame("year" = 1:35, "value" = two.tree.pred)
ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = two.tree.pred, aes(x = year, y = value), 
             color = "Green") + 
  ggtitle("Decision Tree Regression Model V2")

Better, but it’s still not the best model. The reason for this is that we’re using a single tree to decide our regressor. The way that decision tree regression works is that it just takes a “cluster” (a group of similar data points) and then averages them all together. With the 35 data points that we have, it can only find 4 clusters and as a result there are only “mini-models” of data that we can use. The way to combat this is to use random forest regression.

Taking a look at the decision tree

plot(two.tree.reg, main = "Decision Tree V2")

What this plot shows is how the computer grouped the data in the decision tree. Each “branch” is a different cluster of data points.

Random Forests

Random forest regression uses the same idea as decision trees, but it makes many seperate decision trees and then combines them into one decision forest. The random aspect is simply that it slightly changes the way the decision tree is calculated each time. To do this in \(R\), simply install and load the package randomForest.

forest.reg = randomForest(x = total.revenue[-2],
                         y = total.revenue$value,
                         ntree = 500)
forest.pred <- predict(forest.reg, newdata = y.test)
forest.pred <- data.frame("year" = 1:35, "value" = forest.pred)
ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = forest.pred, aes(x = year, y = value), 
             color = "Green") + ggtitle("Random Forest Regression Model")

This is by far the closest fitting model we have gotten, and from this we can make great predictions about any year we have. If we increase the value of ntree we can get an even better model. The best part of this is that it allows us to do very little work on the front end to get a great-fitting model.

Conclusion

The above models are just a few of the tools available to any data analyst. There are other ways to do regression, but hopefully this will provide you with enough information to whet your curiosity. Feel free to check out the source R code at my github.
LicenseFacebooktwitterLinkedin

References

Lan, Haihan. 2017. “Decision Trees and Random Forests for Classification and Regression.” https://medium.com/towards-data-science/decision-trees-and-random-forests-for-classification-and-regression-pt-1-dbb65a458df.

OECD. 2017. “Pharmaceutical Market.” http://dx.doi.org/10.1787/data-00545-en.

Smola, Alex J., and Bernhard Scholkopf. 2003. “A Tutorial on Support Vector Regression.” http://alex.smola.org/papers/2003/SmoSch03b.pdf.

---
title: "Regression Examples"
author: Daniel Brown
bibliography: ref.bib
output: html_notebook
---

# But I Regress:
## Using R to model data in different ways
## Table of Contents
[Data Preprocessing](#Data-Preprocessing)  
[Linear Regression](#Linear-Regression)  
[Polynomial Regression](#Polynomial-Regression)  
[Support Vector Regression](#Support-Vector-Regression)  
[Decison Tree / Random Forest Regression](#Decision-Tree-/-Random-Forest-Regression)  
[Decision Trees](#Decision-Trees)  
[Random Forests](#Random-Forests)  
[Conclusion](#Conclusion)  
[References](#References)  

*Note:* all data was retrieved from the OECD[@OECD17] website.
```{r echo=FALSE}
library(tidyverse)
library(randomForest)
library(e1071)
library(rpart)
```
## Data Preprocessing
This next section just deals with data filtering and cleanup to make it more usable for the analysis we're about to perform.
```{r}
dataset <- read.csv("HEALTH_PHMC_24102017160155249.csv") # Importing the data
names(dataset) # See different features that the dataset has
dataset <- dplyr::filter(dataset, # Filters by rows containing
                         Measure == "Million US$ at exchange rate")
dataset <- dplyr::select(dataset, year = Year, country = Country,
                         value = Value) # Filters by column

total.revenue <- aggregate(dataset$value, by=list(dataset$year), FUN=sum)
total.revenue <- total.revenue[1:35,]
total.revenue[1] <- 1:35
names(total.revenue) <- c("year", "value") # names columns for easier reference
names(total.revenue)
```
## Linear Regression
Linear regression is one of the most basic forms of regression.
It uses the equation
\[
y = mx + b
\]
Where $m$ is the slope of the line and $b$ is where the line crosses
the $y$ axis (also known as the $y$-intercept)
$m$ can be found by the equation
\[
m = \frac{y_{2} - y_{1}}{x_{2} - x_{1}}
\]
Where $x_{1}$, $y_{1}$ are the coordinates of point 1 and $x_{2}$, $y_{2}$
are the coordinates of point 2.
With linear regression, we can have R do this for us. It is a very basic form of
**machine learning**.
```{r}
# Below we will be making a 'linear machine' or lm.
regressor <- lm(formula = value ~ year, data = total.revenue)
y.test <- data.frame("year" = 1:35, "value" = 0)
y.pred <- predict(regressor, newdata = y.test)
y.test[2] <- NULL
y.pred <- data.frame("year" = 1:35, "value" = y.pred)
```
Now we can use *ggplot2*, a toolset available to $R$ analysts based on "The Grammar of Graphics" concept, to plot these two values.

```{r}
ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = y.pred, aes(x = year, y = value), 
             color = "Green") + 
  ggtitle("Linear Regression Model")
```

Notice how the green dots and the blue dots both have a similar trend, but only really come into contact with each other for a few years? The model we just made is okay, but it could be better. The reason our group used Linear Regression for our presentation is because it is better for creating *future* predictions than the other models we looked at.

## Polynomial regression
Polynomial regression is very similar to linear regression. But instead of using the equation $y = mx + b$, we use the equation 
\[
y = \sum_{i=0}^{n}a_{i}x^{i}
\]
This may look slightly complicated, but really what it's saying is
\[
y = a_{0}x^{0}+a_{1}x^{1} + a_{2}x^{2} + ... +a_{n}x^{n}
\]
The best part about this kind of regression is that just like linear regression, we can have $R$ do it for us. And since our data is ready to go, we can jump right in!

```{r}
total.poly <- total.revenue
for (i in 2:4){ # This generates some polynomials for us. Fun!
  total.poly[, i+1] <- total.poly[, 1]^i
}
polyreg <- lm(formula = value ~ ., data = total.poly)
poly.test <- total.poly
poly.test[2] <- 0

poly.pred <- predict(polyreg, newdata = poly.test)
poly.pred <- data.frame("year" = 1:35, "value" = poly.pred)

ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = poly.pred, aes(x = year, y = value), 
             color = "Green") + 
  ggtitle("Polynomial Regression Model")
```
That looks a bit better, doesn't it?
Now, because we are basically guessing at paramaters here, there is a danger in using this form (and many other forms) of regression. It is called **overfitting**, and can be a serious issue.  
Overfitting is basically making a guess that is *too good*. Above, we are making an educated guess that we think the data could fit a polynomial regression model and then we apply parameters that we choose and calculate. This will be something that is discussed further in statistics, but for now know that it is a thing and can happen when modeling.

## Support Vector Regression
Support Vector Regression is basically a way to model the data that uses the size of the error $\epsilon$ as a maximum deviation away from our actual data point[@ajs03, pp. 2]. That is to say, we want to minimize the line distance from our model to the actual data. In theory, this is pretty deep, but in practice it is very easy.  
For this analysis in $R$ we need to install the package *e1071*. After that it's done in a very similar way to linear and polynomial regressions.
```{r}
svreg = svm(formula = value ~ year,
                data = total.revenue,
                type = 'eps-regression',
                kernel = 'radial')
svr.pred <- predict(svreg, newdata = y.test)

svr.pred <- data.frame("year" = 1:35, "value" = svr.pred)

ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = svr.pred, aes(x = year, y = value), 
             color = "Green") + ggtitle("SVR Model")
```
We're getting even closer, and there's less danger of overfitting our model.

## Decision Tree / Random Forest Regression
### Decision Trees
Now we're moving on to using decision trees to make our models. Basically, a decision tree is a type of machine learning that utilizes "... a tree (and a type of directed, acyclic graph) in which the nodes represent decisions (a square box), random transitions (a circular box) or terminal nodes, and the edges or branches are binary (yes/no, true/false) representing possible paths from one node to another"[@hl17]. It can be used for regression (which is what we'll be using it for today), or classification (basically, sorting data into like groups).  
This can be done in $R$ quite easily as well, in a similar fashion to what we did above. First, we must install in import the *rpart* library, and then it's the same old story:

```{r}
tree.reg = rpart(formula = value ~ year,
                  data = total.revenue,
                  control = rpart.control(minsplit = 1))
tree.pred <- predict(tree.reg, newdata = y.test)

tree.pred <- data.frame("year" = 1:35, "value" = tree.pred)

ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = tree.pred, aes(x = year, y = value), 
             color = "Green") + ggtitle("Decision Tree Regression Model")
```
Well that's pretty awful, isn't it? But you can see above we used a value of *minsplit = 1*. That tells $R$ to use a minimum of one decision "branch", and that's why we have a straight line at the average. Let's try that again.
```{r}
two.tree.reg = rpart(formula = value ~ year,
                  data = total.revenue,
                  control = rpart.control(minsplit = 2))
two.tree.pred <- predict(two.tree.reg, newdata = y.test)

two.tree.pred <- data.frame("year" = 1:35, "value" = two.tree.pred)

ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = two.tree.pred, aes(x = year, y = value), 
             color = "Green") + 
  ggtitle("Decision Tree Regression Model V2")
```
Better, but it's still not the best model. The reason for this is that we're using a single tree to decide our regressor. The way that decision tree regression works is that it just takes a "cluster" (a group of similar data points) and then averages them all together. With the 35 data points that we have, it can only find 4 clusters and as a result there are only "mini-models" of data that we can use. The way to combat this is to use **random forest** regression.  

#### Taking a look at the decision tree
```{r}
plot(two.tree.reg, main = "Decision Tree V2")
```
What this plot shows is how the computer grouped the data in the decision tree.
Each "branch" is a different cluster of data points.

### Random Forests  
Random forest regression uses the same idea as decision trees, but it makes many seperate decision trees and then combines them into one decision forest. The random aspect is simply that it slightly changes the way the decision tree is calculated each time. To do this in $R$, simply install and load the package *randomForest*.

```{r}
forest.reg = randomForest(x = total.revenue[-2],
                         y = total.revenue$value,
                         ntree = 500)
forest.pred <- predict(forest.reg, newdata = y.test)

forest.pred <- data.frame("year" = 1:35, "value" = forest.pred)

ggplot(data = total.revenue) + aes(x = year, y = value) +
  geom_point(color = "Blue") + 
  geom_point(data = forest.pred, aes(x = year, y = value), 
             color = "Green") + ggtitle("Random Forest Regression Model")
```
This is by far the closest fitting model we have gotten, and from this we can make great predictions about any year we have. If we increase the value of *ntree* we can get an even better model. The best part of this is that it allows us to do very little work on the front end to get a great-fitting model.

## Conclusion
The above models are just a few of the tools available to any data analyst. There are other ways to do regression, but hopefully this will provide you with enough information to whet your curiosity. Feel free to check out the source R code at my [github](https://github.com/danielbrownjr/regression).  
[![License](https://img.shields.io/badge/LICENSE-MIT-red.svg)](./License)[![Facebook](https://img.shields.io/badge/facebook-Daniel-red.svg?style=social)](https://www.facebook.com/chaseafterstart2006)[![twitter](https://img.shields.io/badge/twitter-chaseafterstart-red.svg?style=social)](https://twitter.com/ChaseAfterStart)[![Linkedin](https://img.shields.io/badge/Linkedin-Daniel-red.svg?style=social)](http://tiny.cc/danielbrown)  

# References