Introduction

When building linear regression models in R, it can be possible to specify a predictor as either numeric or as a “factor”. In R, factor variables are typically used when a given predictor contains data from a discrete set of values, that does not have an implied order. Some examples of factor variables can include gender, and color(“red”, “green”, “blue”, etc.). In some cases, the values can become difficult to decipher. Consider, for example, if you build a model and one of the parameters is “number_of_bedrooms”, and the dataset only contains values from 0-8. Should “number_of_bedrooms” be a numeric value, or should it be a factor?

The purpose of this document is to compare what happens when a simple model with one numeric predictor is built using numeric vs factor types.

The Data

The data in this example will come from the Abalone dataset from the University of California, Irvine machine learning repository. In this exploration we will use just one response variable, rings, and the predictor will be weight.whole. In Abalone, rings are found in the crossections of the animal’s shell. The number of rings plus 1.5 indicates the number years of growth for the animal.

The following cell will load the data from the UCI repository:

rm(list=ls())
library(tidyverse)

# read the dataset into a data frame
abalone <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header=FALSE)

The code below will pull out the specific columns of interest. The code also limits the number of rings in the dataset to less than 16. The reason for this will be discussed later in the document.

# add column names
names(abalone) <- c("sex", "length", "diameter", "height", "weight.whole",
    "weight.shucked", "weight.viscera", "weight.shell", "rings")

abalone <- abalone %>% select(weight.whole, rings)
abalone <- abalone %>% filter(rings<=15) %>% filter(rings>2)

For the purposes of comparison, we will create a training and testing datasets. First, we will create two copies of the original data. In the first case, we will leave the “rings” variable as a numeric data type. In the second set, we will convert the data to a factor variable. The training datasets will consists of 75% of the data, and the test sets will be the remainder.


## 75% of the sample size
smp_size <- floor(0.75 * nrow(abalone))

## set the seed to make your partition reproducible
set.seed(8888)

abalone_factor <- abalone

abalone_factor$rings <- as.factor(abalone_factor$rings)

train_ind <- sample(seq_len(nrow(abalone)), size = smp_size)

num_train <- abalone[train_ind, ]
num_test <- abalone[-train_ind, ]
factor_train <- abalone_factor[train_ind,]
factor_test <- abalone_factor[-train_ind,]

If we compare the first few records of the training and testing sets, they appear to be identical, except for the type used for the data ring variable.

head(num_train)
head(factor_train)

The contrast command in R helps to visualize how factor variables are actually encoded in the data. In this case, “dummy” variables are used to store the value of the predictor.

contrasts(factor_train$rings)[1:10,]
   4 5 6 7 8 9 10 11 12 13 14 15
3  0 0 0 0 0 0  0  0  0  0  0  0
4  1 0 0 0 0 0  0  0  0  0  0  0
5  0 1 0 0 0 0  0  0  0  0  0  0
6  0 0 1 0 0 0  0  0  0  0  0  0
7  0 0 0 1 0 0  0  0  0  0  0  0
8  0 0 0 0 1 0  0  0  0  0  0  0
9  0 0 0 0 0 1  0  0  0  0  0  0
10 0 0 0 0 0 0  1  0  0  0  0  0
11 0 0 0 0 0 0  0  1  0  0  0  0
12 0 0 0 0 0 0  0  0  1  0  0  0

The graph above gives us an indication of how the rings will be encoded as dummy variables. In this example, what we are looking at is a portion of the possible combinations of dummy variables for the different values of “ring”. It is important to note that this matrix is just showing us the different possiblities and how they are encoded. The matrix is not actual observed data.

In the above graph, the first row is considered the “base” case. In this example, an Abalone that has 1 ring would have 0 for all of its dummy variables. Data of the structure in the second row would used to indicate if an Abalone has 2 rings. In this case, the value 2 would be 1, and the remaining dummy variables would be 0. When we display the dataframe or tibble in R, we don’t see this encoding scheme… but it is there and will be used as part of the linear regression model.

For a model that uses a numeric data type (and not a factor), no dummy variables are created. As a result, a model with this structure will have just one coefficient and an intercept. However, if rings is used as a factor, the model has 12 coefficients and an intercept.

The following code builds a model for the numeric and the factor variations.

mdl1 <- lm(weight.whole~rings, data=num_train)
mdl2 <- lm(weight.whole~rings, data=factor_train)

The following is a summary of model 1. Note that this model, which uses a numeric data type for the single predictor has only 1 coefficient and an intercept.

summary(mdl1)

Call:
lm(formula = weight.whole ~ rings, data = num_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.17329 -0.26012 -0.05797  0.23882  1.66992 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.299261   0.029356  -10.19   <2e-16 ***
rings        0.117404   0.003035   38.69   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3957 on 2933 degrees of freedom
Multiple R-squared:  0.3379,    Adjusted R-squared:  0.3376 
F-statistic:  1497 on 1 and 2933 DF,  p-value: < 2.2e-16

Let’s compare that to model 2:

summary(mdl2)

Call:
lm(formula = weight.whole ~ rings, data = factor_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.03635 -0.22405 -0.02097  0.20128  1.66767 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02891    0.11196   0.258  0.79627    
rings4       0.03181    0.12490   0.255  0.79896    
rings5       0.09744    0.11861   0.821  0.41143    
rings6       0.24847    0.11519   2.157  0.03108 *  
rings7       0.36871    0.11402   3.234  0.00124 ** 
rings8       0.60356    0.11337   5.324 1.09e-07 ***
rings9       0.82116    0.11317   7.256 5.07e-13 ***
rings10      0.96602    0.11324   8.531  < 2e-16 ***
rings11      1.12744    0.11368   9.917  < 2e-16 ***
rings12      1.08292    0.11478   9.435  < 2e-16 ***
rings13      1.04442    0.11599   9.004  < 2e-16 ***
rings14      1.06209    0.11891   8.932  < 2e-16 ***
rings15      0.97661    0.11979   8.153 5.23e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3713 on 2922 degrees of freedom
Multiple R-squared:  0.4191,    Adjusted R-squared:  0.4167 
F-statistic: 175.7 on 12 and 2922 DF,  p-value: < 2.2e-16

There are several interesting things to note in the above. First, only coefficient for model1 is statistically significant at the 95% level. In the case of model 2, a number of coefficients are not statistically significant. The other thing to notice between these two models is that the F test does indicate that there is some predictive power (not all of the coefficients are 0).

So, how accurate are these models? The following code will compute the mean squared error for our models on the holdout test set that we used initially.


p1 <- predict(mdl1, newdata=num_test, type="response")
p2 <- predict(mdl2, newdata=factor_test, type="response")

msp1 <- mean((num_test$weight.whole - p1) ^ 2)
msp2 <- mean((factor_test$weight.whole - p2) ^ 2)


cat(paste("\nModel 1 (numeric) Adj R Squared:", summary(mdl1)$adj.r.squared))

Model 1 (numeric) Adj R Squared: 0.337628397989203
cat(paste("\nMSE Model 1:", msp1))

MSE Model 1: 0.149382512820635
cat(paste("\n\nModel 2 (factor) Adj R Squared:", summary(mdl2)$adj.r.squared))


Model 2 (factor) Adj R Squared: 0.416709755376824
cat(paste("\nMSE Model 2:", msp2))

MSE Model 2: 0.131112595156082
x = seq(3,15)
y = predict(mdl2, data.frame(rings=as.factor(x)))
plot(abalone$rings,abalone$weight.whole)
abline(coef=mdl1$coefficients)
lines(x,y, col="red")

At first glance, model 2 (in red) looks great… however there is a danger. In the above case, our model fits the data but only the data that it has been trained on. In fact, it has been “overfit” to the extreme. Consider if we try to predict the weight of an Abalone that has 20 rings, or one that has only 2. Model 1 will work fine in both cases. Model 2, however “blows up”. The model does not contain a dummy variable for the case of less than 3 rings, or greater than 15.

Conclusion

In many cases, coding factor variables for numeric types can provide better fit. However, doing so limits the model to only being able to predict data that has the corresponding dummy variables.

This concludes this writeup on converting numeric types to factor variables.

---
title: "Linear Regression and Converting Numeric Predictors to Factors"
author:  Miles Porter
date: May 27, 2020
output: html_notebook
---

# Introduction

When building linear regression models in R, it can be possible to specify a predictor as either numeric or as a "factor".  In R, factor variables are typically used when a given predictor contains data from a discrete set of values, that does not have an implied order.  Some examples of factor variables can include gender, and color("red", "green", "blue", etc.).  In some cases, the values can become difficult to decipher.  Consider, for example, if you build a model and one of the parameters is "number_of_bedrooms", and the dataset only contains values from 0-8.  Should "number_of_bedrooms" be a numeric value, or should it be a factor?

The purpose of this document is to compare what happens when a simple model with one numeric predictor is built using numeric vs factor types.

##  The Data

The data in this example will come from the Abalone dataset from the [University of California, Irvine machine learning repository](https://archive.ics.uci.edu/ml/index.php).  In this exploration we will use just one response variable, rings, and the predictor will be weight.whole.  In Abalone, rings are found in the crossections of the animal's shell.  The number of rings plus 1.5 indicates the number years of growth for the animal.

The following cell will load the data from the UCI repository:

```{r}
rm(list=ls())
library(tidyverse)

# read the dataset into a data frame
abalone <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header=FALSE)


```

The code below will pull out the specific columns of interest.  The code also limits the number of rings in the dataset to less than 16.  The reason for this will be discussed later in the document.

```{R}
# add column names
names(abalone) <- c("sex", "length", "diameter", "height", "weight.whole",
    "weight.shucked", "weight.viscera", "weight.shell", "rings")

abalone <- abalone %>% select(weight.whole, rings)
abalone <- abalone %>% filter(rings<=15) %>% filter(rings>2)

```

For the purposes of comparison, we will create a training and testing datasets.  First, we will create two copies of the original data.  In the first case, we will leave the "rings" variable as a numeric data type.  In the second set, we will convert the data to a factor variable.  The training datasets will consists of 75% of the data, and the test sets will be the remainder.

```{r}

## 75% of the sample size
smp_size <- floor(0.75 * nrow(abalone))

## set the seed to make your partition reproducible
set.seed(8888)

abalone_factor <- abalone

abalone_factor$rings <- as.factor(abalone_factor$rings)

train_ind <- sample(seq_len(nrow(abalone)), size = smp_size)

num_train <- abalone[train_ind, ]
num_test <- abalone[-train_ind, ]
factor_train <- abalone_factor[train_ind,]
factor_test <- abalone_factor[-train_ind,]
```

If we compare the first few records of the training and testing sets, they appear to be identical, except for the type used for the data ring variable.

```{r}
head(num_train)
head(factor_train)
```

The contrast command in R helps to visualize how factor variables are actually encoded in the data.  In this case, "dummy" variables are used to store the value of the predictor.

```{r}
contrasts(factor_train$rings)[1:10,]
```

The graph above gives us an indication of how the rings will be encoded as dummy variables.  In this example, what we are looking at is a portion of the possible combinations of dummy variables for the different values of "ring".  It is important to note that this matrix is just showing us the different possiblities and how they are encoded.  The matrix is not actual observed data.

In the above graph, the first row is considered the "base" case.  In this example, an Abalone that has 1 ring would have 0 for all of its dummy variables.  Data of the structure in the second row would used to indicate if an Abalone has 2 rings.  In this case, the value 2 would be 1, and the remaining dummy variables would be 0.  When we display the dataframe or tibble in R, we don't see this encoding scheme...  but it is there and will be used as part of the linear regression model.

For a model that uses a numeric data type (and not a factor), no dummy variables are created.  As a result, a model with this structure will have just one coefficient and an intercept.  However, if rings is used as a factor, the model has 12 coefficients and an intercept.

The following code builds a model for the numeric and the factor variations.

```{r}
mdl1 <- lm(weight.whole~rings, data=num_train)
mdl2 <- lm(weight.whole~rings, data=factor_train)
```

The following is a summary of model 1.  Note that this model, which uses a numeric data type for the single predictor has only 1 coefficient and an intercept.

```{r}
summary(mdl1)
```

Let's compare that to model 2:

```{r}
summary(mdl2)
```

There are several interesting things to note in the above.  First, only coefficient for model1 is statistically significant at the 95% level.  In the case of model 2, a number of coefficients are not statistically significant.  The other thing to notice between these two models is that the F test does indicate that there is some predictive power (not all of the coefficients are 0).

So, how accurate are these models?  The following code will compute the mean squared error for our models on the holdout test set that we used initially.

```{r}

p1 <- predict(mdl1, newdata=num_test, type="response")
p2 <- predict(mdl2, newdata=factor_test, type="response")

msp1 <- mean((num_test$weight.whole - p1) ^ 2)
msp2 <- mean((factor_test$weight.whole - p2) ^ 2)


cat(paste("\nModel 1 (numeric) Adj R Squared:", summary(mdl1)$adj.r.squared))
cat(paste("\nMSE Model 1:", msp1))

cat(paste("\n\nModel 2 (factor) Adj R Squared:", summary(mdl2)$adj.r.squared))
cat(paste("\nMSE Model 2:", msp2))
```

```{r}
x = seq(3,15)
y = predict(mdl2, data.frame(rings=as.factor(x)))
plot(abalone$rings,abalone$weight.whole)
abline(coef=mdl1$coefficients)
lines(x,y, col="red")
```

At first glance, model 2 (in red) looks great...  however there is a danger.  In the above case, our model fits the data but only the data that it has been trained on.  In fact, it has been "overfit" to the extreme.  Consider if we try to predict the weight of an Abalone that has 20 rings, or one that has only 2.  Model 1 will work fine in both cases.  Model 2, however "blows up".  The model does not contain a dummy variable for the case of less than 3 rings, or greater than 15.  

# Conclusion

In many cases, coding factor variables for numeric types can provide better fit.  However, doing so limits the model to only being able to predict data that has the corresponding dummy variables.

This concludes this writeup on converting numeric types to factor variables.
