Credit to Jared Lander for his Marchine Learning with R training

Creating a test dataset called Boros in R

boros <- tibble::tribble( ~ Boro, ~ Pop, ~ Size, ~ Random, 'Manhattan', 1600000, 23, 7, 'Brooklyn', 2600000, 78, 24, 'Queens', 2330000, 104, pi, 'Bronx', 1455000, 42, 21, 'Staten Island', 475000, 60, 3 )

boros

## # A tibble: 5 x 4
##   Boro              Pop  Size Random
##   <chr>           <dbl> <dbl>  <dbl>
## 1 Manhattan     1600000    23   7   
## 2 Brooklyn      2600000    78  24   
## 3 Queens        2330000   104   3.14
## 4 Bronx         1455000    42  21   
## 5 Staten Island  475000    60   3

The Build function

Using the build to generate output of summary(formula) for now only with without the response variables but only explanatory variables.

The dataset boros is used below and the single explanatory variable Pop is used to generate the function. Notice that intercept is 1.

build.x(~Pop, data=boros)

##   (Intercept)     Pop
## 1           1 1600000
## 2           1 2600000
## 3           1 2330000
## 4           1 1455000
## 5           1  475000
## attr(,"assign")
## [1] 0 1

Now with two(02) variables Pop and Size

build.x(~Pop+Size, data = boros)

##   (Intercept)     Pop Size
## 1           1 1600000   23
## 2           1 2600000   78
## 3           1 2330000  104
## 4           1 1455000   42
## 5           1  475000   60
## attr(,"assign")
## [1] 0 1 2

Now with interaction between two(02) or more variables using multiplication and division

Two(02) variables multiplication

## multiplying 2 variables
build.x(~Pop*Size, data = boros)

##   (Intercept)     Pop Size  Pop:Size
## 1           1 1600000   23  36800000
## 2           1 2600000   78 202800000
## 3           1 2330000  104 242320000
## 4           1 1455000   42  61110000
## 5           1  475000   60  28500000
## attr(,"assign")
## [1] 0 1 2 3

Two(02) variables division

## dividing 2 variables
build.x(~Pop:Size, data = boros)

##   (Intercept)  Pop:Size
## 1           1  36800000
## 2           1 202800000
## 3           1 242320000
## 4           1  61110000
## 5           1  28500000
## attr(,"assign")
## [1] 0 1

Three(03) variables interaction (multiplication)

## multiplying 3 variables
build.x(~Pop*Size*Random, data = boros)

##   (Intercept)     Pop Size    Random  Pop:Size Pop:Random Size:Random
## 1           1 1600000   23  7.000000  36800000   11200000    161.0000
## 2           1 2600000   78 24.000000 202800000   62400000   1872.0000
## 3           1 2330000  104  3.141593 242320000    7319911    326.7256
## 4           1 1455000   42 21.000000  61110000   30555000    882.0000
## 5           1  475000   60  3.000000  28500000    1425000    180.0000
##   Pop:Size:Random
## 1       257600000
## 2      4867200000
## 3       761270732
## 4      1283310000
## 5        85500000
## attr(,"assign")
## [1] 0 1 2 3 4 5 6 7

Notice that for all outputs we have the intercept appearing by default. What if we do want to suppress the intercept from our output?

Removing the intercept using -1

### By putting the -1 you remove the intercept from the output
build.x(~Pop+Size, data = boros)

##   (Intercept)     Pop Size
## 1           1 1600000   23
## 2           1 2600000   78
## 3           1 2330000  104
## 4           1 1455000   42
## 5           1  475000   60
## attr(,"assign")
## [1] 0 1 2

build.x(~Pop+Size-1, data = boros)

##       Pop Size
## 1 1600000   23
## 2 2600000   78
## 3 2330000  104
## 4 1455000   42
## 5  475000   60
## attr(,"assign")
## [1] 1 2

Notice the second ouput does not have the intercept and this is done by putting the -1 at the end of the formula

Using build.x function with the categorical variable Boro

build.x(~Boro, data = boros)

##   (Intercept) BoroBrooklyn BoroManhattan BoroQueens BoroStaten Island
## 1           1            0             1          0                 0
## 2           1            1             0          0                 0
## 3           1            0             0          1                 0
## 4           1            0             0          0                 0
## 5           1            0             0          0                 1
## attr(,"assign")
## [1] 0 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Boro
## [1] "contr.treatment"

## Displaying the content of the variable Boro
boros$Boro

## [1] "Manhattan"     "Brooklyn"      "Queens"        "Bronx"        
## [5] "Staten Island"

Notice that new variables are created to generate the output and they are a concatenation of the variable name Boro and the different levels within the categorical variable which are the city names.

By displaying the content of the variable Boro we can see that “Bronx” does not appear on the output of the build.x variable. This means the function has selected “Bronx” as the baseline. This is done alphabetically. “Bronx” is the first in alphabetic order. Therefore, it was selected as baseline and dropped from the build.x output.

How to force the function not to drop any variable. Meanning the baseline will remain. Using the function attribute contrast = FALSE.

build.x(~Boro, data = boros, contrasts = FALSE)

## Warning: package 'bindrcpp' was built under R version 3.4.4

##   (Intercept) BoroBronx BoroBrooklyn BoroManhattan BoroQueens
## 1           1         0            0             1          0
## 2           1         0            1             0          0
## 3           1         0            0             0          1
## 4           1         1            0             0          0
## 5           1         0            0             0          0
##   BoroStaten Island
## 1                 0
## 2                 0
## 3                 0
## 4                 0
## 5                 1
## attr(,"assign")
## [1] 0 1 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Boro
##               Bronx Brooklyn Manhattan Queens Staten Island
## Bronx             1        0         0      0             0
## Brooklyn          0        1         0      0             0
## Manhattan         0        0         1      0             0
## Queens            0        0         0      1             0
## Staten Island     0        0         0      0             1

This time we have all levels of the categorical variable listed.

Combining categorical and numerical variables in the formula and dropping the intercept at the same time.

build.x(~Boro+Pop-1, data = boros, contrasts = FALSE)

##   BoroBronx BoroBrooklyn BoroManhattan BoroQueens BoroStaten Island
## 1         0            0             1          0                 0
## 2         0            1             0          0                 0
## 3         0            0             0          1                 0
## 4         1            0             0          0                 0
## 5         0            0             0          0                 1
##       Pop
## 1 1600000
## 2 2600000
## 3 2330000
## 4 1455000
## 5  475000
## attr(,"assign")
## [1] 1 1 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$Boro
##               Bronx Brooklyn Manhattan Queens Staten Island
## Bronx             1        0         0      0             0
## Brooklyn          0        1         0      0             0
## Manhattan         0        0         1      0             0
## Queens            0        0         0      1             0
## Staten Island     0        0         0      0             1

Note: Every single value is stored including the zero which makes it memory and processor intensive for calculation.

Removing the zero’s from the computation and the memory to improve performance with sparce = TRUE (Sparse Matrices)

build.x(~Boro+Pop-1, data = boros, contrasts = FALSE, sparse = TRUE)

## 5 x 6 sparse Matrix of class "dgCMatrix"
##   BoroBronx BoroBrooklyn BoroManhattan BoroQueens BoroStaten Island
## 1         .            .             1          .                 .
## 2         .            1             .          .                 .
## 3         .            .             .          1                 .
## 4         1            .             .          .                 .
## 5         .            .             .          .                 1
##       Pop
## 1 1600000
## 2 2600000
## 3 2330000
## 4 1455000
## 5  475000

The output is a sparse matrix = Matrix that does not store or contains zeros(0).

Understanding operations behind linear regression

Roland Mfondoum

July 21, 2018

Supervised learning

Building the X matrices using the build.x variable

Preparing for this training by loading necessary packages and dataset