Covariate Creation

This unit is about covariate creation. Covariates are sometimes called predictors and sometimes called features. They’re the variables that we include in our model to predict our outcome. There are two levels of covariate creation, or feature creation. The first level is, taking the raw data that we have and turning it into a predictor so the raw data often takes the form of an image, or a text file, or a website. That kind of information is very hard to build a predictive model around when we haven’t summarized the information in some useful way into either a quantitative or qualitative variable. So, what we want to do is take that raw data and turn it into features or covariates which are variables that describe the data as much as possible while giving some compression and making it easier to fit standard machine-learning algorithms. So the idea here is, it’s very hard to plug the email itself into a prediction function, because most prediction functions are based on the idea of taking a small number of variables and building a quantitative model around them and it doesn’t work for a free text for example.

In that context, the first thing we need to do is to create some features, and those features are just variables that describe the raw data. In the case of an email, we might think of different ways that we could describe this email. For example,

when we calculate the average number of capitals that are in the email, in this case 100% of the letters in the email are capital letters, you might say
what’s the frequency a particular word appears, for example, how often does you appear?
You might also calculate the number of dollar signs. This might be a really good predictor of whether an email is spam or not.

So, all in all, the raw data of the covariate usually involves a lot of thinking about:

the structure of the data that we have
what is the right way to extract,
extract the most useful information in the fewest number of variables that captures everything that we want

library(kernlab)
data(spam)
spam$capitalAveSq<-spam$capitalAve^2 #creating new variable squaring the number of capital Letters in an email
head(spam$capitalAveSq)#Checking if that worked

## [1] 14.10754 26.15300 96.45204 12.51037 12.51037  9.00000

summary(spam$capitalAveSq)#getting summary statistics

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       1.0       2.5       5.2    1033.5      13.7 1215506.2

The next stage, the second stage of covariate creation is transforming tidy covariates. In other words, we calculated for example capital average, the average number of capitals in the data set, but it might not be the average number that’s related very well to the outcome that we care about, it might be the average number of capitals squared or cubed, or it might be some other function. And so the next stage is transforming the variables into sort of more useful variables.

And so the idea is that we have to think very carefully about how to pick the right features that explain most of what’s happening in our raw data.

So, some potential examples can be:

i. Text files:- it might be the frequency of words or frequency of phrases. There’s this cool site, Google ngrams, which tells us about the frequency of different phrases that appear in books going back in time.

ii. Images:- it might be edges and corners, blobs and ridges,

iii. Websites:- it might be the number and type of images, where buttons are, colors and videos. This is a huge area of importance in web development which is called A/B testing, which is called randomized trials and statistics, which is basically showing different versions of a website with different values of these different features and predicting which one will introduce a more clicks or get more people to buy products.

iv. People:- you can imagine features of people are their height, weight, hair color, etc.

It’s basically any summary of the raw data that we can make as a potential feature. And often this involves quite a bit of scientific thinking and business acumen to know what the right covariates are for a particular problem. So the more knowledge we have of a system, the better job we’ll do at feature extraction in general. In general it’s a good idea to have a really clear understanding of why this set of data is useful for, to predicting the outcome we care about.

Thus, We should keep the following things in mind during the second phase of covariate creation:

More necessary for some methods (regression, support vector machines) than for others like classification trees.
Should be done only on the training set
The best approach is through exploratory analysis (plotting/tables)
New covariates should be added to data frames

Let’s load an example dataset and go through what we have mentioned so far. We are loading the ISLR (Introduction to Statistical Learning with Applications in R) package, and the caret package. We are going to use the ‘Wage’ dataset. We are dividing the data into training and testing sets, with 70% data in training set. The outcome variable is ‘wage’.

library(ISLR)
library(caret)
data(wage)
inTrain<-createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training<-Wage[inTrain, ]
testing<-Wage[-inTrain, ]

So one idea is that’s very common when building machine learning algorithms is to turn covariates that are qualitative, or factor variables, into what are called dummy variables. For example in the Wage data set, job class (information or industrial) can be a predictor, but we have to give them dummy names. Let’s check the job class variable:

head(training$jobclass)

## [1] 1. Industrial  2. Information 2. Information 2. Information 2. Information
## [6] 2. Information
## Levels: 1. Industrial 2. Information

table(training$jobclass)

## 
##  1. Industrial 2. Information 
##           1083           1019

As mentioned, the jobs are marked either as ‘Industrial’ or ‘Information’. There are total of 1082 cases of ‘Industrial’, and 1030 ‘Informational’ cases, altogether. We have to change them in dummy variables to be able to use as the predictors.

library(caret)

## Warning: package 'caret' was built under R version 4.0.3

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:kernlab':
## 
##     alpha

dummies<-dummyVars(wage ~ jobclass, data = training)
head(predict(dummies, newdata=training))

##        jobclass.1. Industrial jobclass.2. Information
## 231655                      1                       0
## 86582                       0                       1
## 155159                      0                       1
## 11443                       0                       1
## 376662                      0                       1
## 228963                      0                       1

We get two variables out. The first one is an indicator someone is Industrial or Information. In each case, 1 means the person belongs to that group and 0, the opposite.

In addition, some of the variables basically have no variability in them. So, if we create a feature that says for emails, does it have any letters in it at all? Almost every single email will have at least one letter in it, so, that variable will always be equal to true. One thing that we can use is this near zero variable or function in caret to identity those variables that have very little variability and will likely not be good predictors. And we apply it to the training dataframe.

nsv<-nearZeroVar(training, saveMetrics = TRUE)
nsv

##            freqRatio percentUnique zeroVar   nzv
## year        1.030030    0.33301618   FALSE FALSE
## age         1.069444    2.90199810   FALSE FALSE
## maritl      3.197802    0.23786870   FALSE FALSE
## race        8.446602    0.19029496   FALSE FALSE
## education   1.434783    0.23786870   FALSE FALSE
## region      0.000000    0.04757374    TRUE  TRUE
## jobclass    1.062807    0.09514748   FALSE FALSE
## health      2.605489    0.09514748   FALSE FALSE
## health_ins  2.279251    0.09514748   FALSE FALSE
## logwage     1.050633   19.64795433   FALSE FALSE
## wage        1.050633   19.64795433   FALSE FALSE

We got the percentage of unique values and the last column ‘nzv’ shows whether or not a certain variable has near zero values. ‘FALSE’ refers that the variable doesn’t have any near zero value, when ‘TRUE’ refers otherwise. Any any variable that marked TRUE by the above algorithm is a good indicator that we toss the variable from our model. For example, ‘sex’ in our dataset has only one group, i.e., male, and ‘nzv’ result is ‘TRUE’. As all the participants are males, there is no need for us to include this variable in our prediction algorithm.

The other thing we can do is, if we do linear regression or generalized linear regression as our prediction algorithm, which we’ll talk about, in a future lecture, the idea will be to fit, basically straight lines through the data. Sometimes, we want to be able to fit curvy lines, and one way to do that is with a basic functions. And you can find those functions in the splines package. One thing that we can do is create the bs function, whcih creates a polynomial variable. In this case, we pass at a single variable in the training set, we take the age variable, and we say we want a third degree polynomial for this variable. So when we do so we essentially get a three-column matrix out. So this is now three new variables.

library(splines)
bsBasis<-bs(training$age, df=3)
head(bsBasis)

##              1          2           3
## [1,] 0.0000000 0.00000000 0.000000000
## [2,] 0.2368501 0.02537679 0.000906314
## [3,] 0.4308138 0.29109043 0.065560908
## [4,] 0.3625256 0.38669397 0.137491189
## [5,] 0.3063341 0.42415495 0.195763821
## [6,] 0.4403553 0.25969672 0.051051492

summary(bsBasis)

##        1                2                3          
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.2936   1st Qu.:0.1482   1st Qu.:0.01719  
##  Median :0.3867   Median :0.2755   Median :0.05800  
##  Mean   :0.3524   Mean   :0.2611   Mean   :0.10317  
##  3rd Qu.:0.4308   3rd Qu.:0.3867   3rd Qu.:0.15079  
##  Max.   :0.4444   Max.   :0.4444   Max.   :1.00000

This variable corresponds to age, the actual age values. There are scales for computational purposes. The second variable is there age square, and the third column is the age cube to allow the cubic relationship between the age the wage. So if we include these covariates in the model instead of just the age variable when we’re fitting a linear regression, we allow for curvy model fitting.

Let’s see an example:

lm1<-lm(wage~bsBasis, data=training)
plot(training$age, training$wage, pch=19, cex=0.5)
points(training$age, predict(lm1, newdata=training),col="red", pch=19, cex=0.5)

And then we plotted the age data versus the wage data. So that’s age on the x axis, wage on the y axis. And we can see that there’s, a curvilinear relationship between these two variables. And so we can plot age and the predicted values from our linear model, including the, the curvy terms, polynomial terms and get a curve fit through the data set as opposed to just a straight line. So that’s one way that we can generate new variables is by allowing more flexibility in the way that we model specific variables.

So then on the test set, we’ll have to predict those same variables. So this is the idea that’s incredibly critical for machine learning when we create new covariates. We have to create the covariates on the test data set using the exact same procedure that we used on the training set. Please note that there can be some biased in the testset.

summary(predict(bsBasis, age=testing$age))

##        1                2                3          
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.2936   1st Qu.:0.1482   1st Qu.:0.01719  
##  Median :0.3867   Median :0.2755   Median :0.05800  
##  Mean   :0.3524   Mean   :0.2611   Mean   :0.10317  
##  3rd Qu.:0.4308   3rd Qu.:0.3867   3rd Qu.:0.15079  
##  Max.   :0.4444   Max.   :0.4444   Max.   :1.00000

head(predict(bsBasis, age=testing$age))

##              1          2           3
## [1,] 0.0000000 0.00000000 0.000000000
## [2,] 0.2368501 0.02537679 0.000906314
## [3,] 0.4308138 0.29109043 0.065560908
## [4,] 0.3625256 0.38669397 0.137491189
## [5,] 0.3063341 0.42415495 0.195763821
## [6,] 0.4403553 0.25969672 0.051051492

Notes:

Level 1 featue creation (raw data to covariates)

Science is key. Google “feature extraction for [data type]”
Error on over creation of features
IN some applications (images, voices) automated feature creation is possible/necessary
http://www.cs.nyu.edu/~yann/talks/lecun-ranzato-icml2013.pdf

Level 2 feature creation (covariates to new covariates)

The function preProcess in caret will handle some preprocessing
Create new covariates if you think they will improve fit
Use exploratory analysis on the training set for creating them
Be careful about overfitting

PreProcessing with Caret

If you want to fit spline models, use the gam method in the caret package which allows smoothing of multiple variables More on feature creation/data tidying in the Obtaining Data course from the Data Science course.

Covariate Creation

Nirmal Ghimire

11/26/2020