Introduction

The goal is to explore a data set and convert two categorical variables into binary dummies.

We’ll start by loading a data set and explore the data

corolla.df <- read.csv("ToyotaCorolla.csv", header = TRUE)     # load data
dim(corolla.df)                         # find the dimension of data frame
## [1] 1436   39

The dim() function tells us that there are 1436 records or observations and 39 variables or attributes.

head(corolla.df)                        # show the first six rows
##   Id                                         Model Price Age_08_04 Mfg_Month
## 1  1 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13500        23        10
## 2  2 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13750        23        10
## 3  3 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13950        24         9
## 4  4 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 14950        26         7
## 5  5   TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors 13750        30         3
## 6  6   TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors 12950        32         1
##   Mfg_Year    KM Fuel_Type HP Met_Color  Color Automatic   CC Doors Cylinders
## 1     2002 46986    Diesel 90         1   Blue         0 2000     3         4
## 2     2002 72937    Diesel 90         1 Silver         0 2000     3         4
## 3     2002 41711    Diesel 90         1   Blue         0 2000     3         4
## 4     2002 48000    Diesel 90         0  Black         0 2000     3         4
## 5     2002 38500    Diesel 90         0  Black         0 2000     3         4
## 6     2002 61000    Diesel 90         0  White         0 2000     3         4
##   Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS
## 1     5           210   1165             0               1                3   1
## 2     5           210   1165             0               1                3   1
## 3     5           210   1165             1               1                3   1
## 4     5           210   1165             1               1                3   1
## 5     5           210   1170             1               1                3   1
## 6     5           210   1170             0               1                3   1
##   Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer CD_Player Central_Lock
## 1        1        1     0               0             1         0            1
## 2        1        1     1               0             1         1            1
## 3        1        1     0               0             1         0            0
## 4        1        1     0               0             1         0            0
## 5        1        1     1               0             1         0            1
## 6        1        1     1               0             1         0            1
##   Powered_Windows Power_Steering Radio Mistlamps Sport_Model Backseat_Divider
## 1               1              1     0         0           0                1
## 2               0              1     0         0           0                1
## 3               0              1     0         0           0                1
## 4               0              1     0         0           0                1
## 5               1              1     0         1           0                1
## 6               1              1     0         1           0                1
##   Metallic_Rim Radio_cassette Parking_Assistant Tow_Bar
## 1            0              0                 0       0
## 2            0              0                 0       0
## 3            0              0                 0       0
## 4            0              0                 0       0
## 5            0              0                 0       0
## 6            0              0                 0       0
View(corolla.df)                        # show all the data in a new tab 

Perform Summary Statistics

summary(corolla.df)                     # find summary statistics for each column
##        Id            Model               Price         Age_08_04    
##  Min.   :   1.0   Length:1436        Min.   : 4350   Min.   : 1.00  
##  1st Qu.: 361.8   Class :character   1st Qu.: 8450   1st Qu.:44.00  
##  Median : 721.5   Mode  :character   Median : 9900   Median :61.00  
##  Mean   : 721.6                      Mean   :10731   Mean   :55.95  
##  3rd Qu.:1081.2                      3rd Qu.:11950   3rd Qu.:70.00  
##  Max.   :1442.0                      Max.   :32500   Max.   :80.00  
##    Mfg_Month         Mfg_Year          KM          Fuel_Type        
##  Min.   : 1.000   Min.   :1998   Min.   :     1   Length:1436       
##  1st Qu.: 3.000   1st Qu.:1998   1st Qu.: 43000   Class :character  
##  Median : 5.000   Median :1999   Median : 63390   Mode  :character  
##  Mean   : 5.549   Mean   :2000   Mean   : 68533                     
##  3rd Qu.: 8.000   3rd Qu.:2001   3rd Qu.: 87021                     
##  Max.   :12.000   Max.   :2004   Max.   :243000                     
##        HP          Met_Color         Color             Automatic      
##  Min.   : 69.0   Min.   :0.0000   Length:1436        Min.   :0.00000  
##  1st Qu.: 90.0   1st Qu.:0.0000   Class :character   1st Qu.:0.00000  
##  Median :110.0   Median :1.0000   Mode  :character   Median :0.00000  
##  Mean   :101.5   Mean   :0.6748                      Mean   :0.05571  
##  3rd Qu.:110.0   3rd Qu.:1.0000                      3rd Qu.:0.00000  
##  Max.   :192.0   Max.   :1.0000                      Max.   :1.00000  
##        CC            Doors         Cylinders     Gears       Quarterly_Tax   
##  Min.   : 1300   Min.   :2.000   Min.   :4   Min.   :3.000   Min.   : 19.00  
##  1st Qu.: 1400   1st Qu.:3.000   1st Qu.:4   1st Qu.:5.000   1st Qu.: 69.00  
##  Median : 1600   Median :4.000   Median :4   Median :5.000   Median : 85.00  
##  Mean   : 1577   Mean   :4.033   Mean   :4   Mean   :5.026   Mean   : 87.12  
##  3rd Qu.: 1600   3rd Qu.:5.000   3rd Qu.:4   3rd Qu.:5.000   3rd Qu.: 85.00  
##  Max.   :16000   Max.   :5.000   Max.   :4   Max.   :6.000   Max.   :283.00  
##      Weight     Mfr_Guarantee    BOVAG_Guarantee  Guarantee_Period
##  Min.   :1000   Min.   :0.0000   Min.   :0.0000   Min.   : 3.000  
##  1st Qu.:1040   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.: 3.000  
##  Median :1070   Median :0.0000   Median :1.0000   Median : 3.000  
##  Mean   :1072   Mean   :0.4095   Mean   :0.8955   Mean   : 3.815  
##  3rd Qu.:1085   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 3.000  
##  Max.   :1615   Max.   :1.0000   Max.   :1.0000   Max.   :36.000  
##       ABS            Airbag_1         Airbag_2          Airco       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.8134   Mean   :0.9708   Mean   :0.7228   Mean   :0.5084  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  Automatic_airco   Boardcomputer      CD_Player       Central_Lock   
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.05641   Mean   :0.2946   Mean   :0.2187   Mean   :0.5801  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  Powered_Windows Power_Steering       Radio          Mistlamps    
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :1.000   Median :1.0000   Median :0.0000   Median :0.000  
##  Mean   :0.562   Mean   :0.9777   Mean   :0.1462   Mean   :0.257  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
##   Sport_Model     Backseat_Divider  Metallic_Rim    Radio_cassette  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.3001   Mean   :0.7702   Mean   :0.2047   Mean   :0.1455  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  Parking_Assistant     Tow_Bar      
##  Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :0.000000   Median :0.0000  
##  Mean   :0.002786   Mean   :0.2779  
##  3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :1.000000   Max.   :1.0000

If we wanted to find the average price of a Corolla we can use the mean() function and apply it to the Price attribute.

mean(corolla.df$Price)                  # find the average price of Corolla
## [1] 10730.82

If we wanted to find the average price of a 2004 Corolla, we can use the mean() function with a conditional.

mean(corolla.df[corolla.df$Mfg_Year=='2004', 'Price'])
## [1] 22324.17

Convert the attribute Fuel Type to binary dummies

xtotal <- model.matrix(~ 0 + Fuel_Type + Color, data = corolla.df)
xtotal <- as.data.frame(xtotal)
t(t(names(xtotal)))                     # check the names of the dummy variables
##       [,1]             
##  [1,] "Fuel_TypeCNG"   
##  [2,] "Fuel_TypeDiesel"
##  [3,] "Fuel_TypePetrol"
##  [4,] "ColorBlack"     
##  [5,] "ColorBlue"      
##  [6,] "ColorGreen"     
##  [7,] "ColorGrey"      
##  [8,] "ColorRed"       
##  [9,] "ColorSilver"    
## [10,] "ColorViolet"    
## [11,] "ColorWhite"     
## [12,] "ColorYellow"
head(xtotal)
##   Fuel_TypeCNG Fuel_TypeDiesel Fuel_TypePetrol ColorBlack ColorBlue ColorGreen
## 1            0               1               0          0         1          0
## 2            0               1               0          0         0          0
## 3            0               1               0          0         1          0
## 4            0               1               0          1         0          0
## 5            0               1               0          1         0          0
## 6            0               1               0          0         0          0
##   ColorGrey ColorRed ColorSilver ColorViolet ColorWhite ColorYellow
## 1         0        0           0           0          0           0
## 2         0        0           1           0          0           0
## 3         0        0           0           0          0           0
## 4         0        0           0           0          0           0
## 5         0        0           0           0          0           0
## 6         0        0           0           0          1           0

Partition data

In data mining, data is partitioned in two or three sets: training, validation, and test sets. The purpose of partitioning data is to keep a portion of data out of analysis in training a model, and use it later for verification.

# partitioning into training (50%), validation (30%), test(20%)
# randomly sample 50% of the row Ids for training
train.rows <- sample(rownames(corolla.df), dim(corolla.df)[1]*0.5)

# sample 30% of the row Ids into the validation set, drawing only from records
# not already in the training set
# use setdiff() to find records not already in the training set
valid.rows <- sample(setdiff(rownames(corolla.df), train.rows),
                      dim(corolla.df)[1]*0.3)

# assign the remaining 20% row Ids serve as test
test.rows <- setdiff(rownames(corolla.df), union(train.rows, valid.rows))

# create the 3 data frames by collecting all columns from the appropriate rows
train.data <- corolla.df[train.rows, ]
valid.data <- corolla.df[valid.rows, ]
test.data <- corolla.df[test.rows, ]