The goal is to explore a data set and convert two categorical variables into binary dummies.
We’ll start by loading a data set and explore the data
corolla.df <- read.csv("ToyotaCorolla.csv", header = TRUE) # load data
dim(corolla.df) # find the dimension of data frame
## [1] 1436 39
The dim() function tells us that there are 1436 records or observations and 39 variables or attributes.
head(corolla.df) # show the first six rows
## Id Model Price Age_08_04 Mfg_Month
## 1 1 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13500 23 10
## 2 2 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13750 23 10
## 3 3 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13950 24 9
## 4 4 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 14950 26 7
## 5 5 TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors 13750 30 3
## 6 6 TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors 12950 32 1
## Mfg_Year KM Fuel_Type HP Met_Color Color Automatic CC Doors Cylinders
## 1 2002 46986 Diesel 90 1 Blue 0 2000 3 4
## 2 2002 72937 Diesel 90 1 Silver 0 2000 3 4
## 3 2002 41711 Diesel 90 1 Blue 0 2000 3 4
## 4 2002 48000 Diesel 90 0 Black 0 2000 3 4
## 5 2002 38500 Diesel 90 0 Black 0 2000 3 4
## 6 2002 61000 Diesel 90 0 White 0 2000 3 4
## Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS
## 1 5 210 1165 0 1 3 1
## 2 5 210 1165 0 1 3 1
## 3 5 210 1165 1 1 3 1
## 4 5 210 1165 1 1 3 1
## 5 5 210 1170 1 1 3 1
## 6 5 210 1170 0 1 3 1
## Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer CD_Player Central_Lock
## 1 1 1 0 0 1 0 1
## 2 1 1 1 0 1 1 1
## 3 1 1 0 0 1 0 0
## 4 1 1 0 0 1 0 0
## 5 1 1 1 0 1 0 1
## 6 1 1 1 0 1 0 1
## Powered_Windows Power_Steering Radio Mistlamps Sport_Model Backseat_Divider
## 1 1 1 0 0 0 1
## 2 0 1 0 0 0 1
## 3 0 1 0 0 0 1
## 4 0 1 0 0 0 1
## 5 1 1 0 1 0 1
## 6 1 1 0 1 0 1
## Metallic_Rim Radio_cassette Parking_Assistant Tow_Bar
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
View(corolla.df) # show all the data in a new tab
summary(corolla.df) # find summary statistics for each column
## Id Model Price Age_08_04
## Min. : 1.0 Length:1436 Min. : 4350 Min. : 1.00
## 1st Qu.: 361.8 Class :character 1st Qu.: 8450 1st Qu.:44.00
## Median : 721.5 Mode :character Median : 9900 Median :61.00
## Mean : 721.6 Mean :10731 Mean :55.95
## 3rd Qu.:1081.2 3rd Qu.:11950 3rd Qu.:70.00
## Max. :1442.0 Max. :32500 Max. :80.00
## Mfg_Month Mfg_Year KM Fuel_Type
## Min. : 1.000 Min. :1998 Min. : 1 Length:1436
## 1st Qu.: 3.000 1st Qu.:1998 1st Qu.: 43000 Class :character
## Median : 5.000 Median :1999 Median : 63390 Mode :character
## Mean : 5.549 Mean :2000 Mean : 68533
## 3rd Qu.: 8.000 3rd Qu.:2001 3rd Qu.: 87021
## Max. :12.000 Max. :2004 Max. :243000
## HP Met_Color Color Automatic
## Min. : 69.0 Min. :0.0000 Length:1436 Min. :0.00000
## 1st Qu.: 90.0 1st Qu.:0.0000 Class :character 1st Qu.:0.00000
## Median :110.0 Median :1.0000 Mode :character Median :0.00000
## Mean :101.5 Mean :0.6748 Mean :0.05571
## 3rd Qu.:110.0 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :192.0 Max. :1.0000 Max. :1.00000
## CC Doors Cylinders Gears Quarterly_Tax
## Min. : 1300 Min. :2.000 Min. :4 Min. :3.000 Min. : 19.00
## 1st Qu.: 1400 1st Qu.:3.000 1st Qu.:4 1st Qu.:5.000 1st Qu.: 69.00
## Median : 1600 Median :4.000 Median :4 Median :5.000 Median : 85.00
## Mean : 1577 Mean :4.033 Mean :4 Mean :5.026 Mean : 87.12
## 3rd Qu.: 1600 3rd Qu.:5.000 3rd Qu.:4 3rd Qu.:5.000 3rd Qu.: 85.00
## Max. :16000 Max. :5.000 Max. :4 Max. :6.000 Max. :283.00
## Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period
## Min. :1000 Min. :0.0000 Min. :0.0000 Min. : 3.000
## 1st Qu.:1040 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.: 3.000
## Median :1070 Median :0.0000 Median :1.0000 Median : 3.000
## Mean :1072 Mean :0.4095 Mean :0.8955 Mean : 3.815
## 3rd Qu.:1085 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 3.000
## Max. :1615 Max. :1.0000 Max. :1.0000 Max. :36.000
## ABS Airbag_1 Airbag_2 Airco
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.8134 Mean :0.9708 Mean :0.7228 Mean :0.5084
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Automatic_airco Boardcomputer CD_Player Central_Lock
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.05641 Mean :0.2946 Mean :0.2187 Mean :0.5801
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Powered_Windows Power_Steering Radio Mistlamps
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :1.000 Median :1.0000 Median :0.0000 Median :0.000
## Mean :0.562 Mean :0.9777 Mean :0.1462 Mean :0.257
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.000
## Sport_Model Backseat_Divider Metallic_Rim Radio_cassette
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.3001 Mean :0.7702 Mean :0.2047 Mean :0.1455
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Parking_Assistant Tow_Bar
## Min. :0.000000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.000000 Median :0.0000
## Mean :0.002786 Mean :0.2779
## 3rd Qu.:0.000000 3rd Qu.:1.0000
## Max. :1.000000 Max. :1.0000
If we wanted to find the average price of a Corolla we can use the mean() function and apply it to the Price attribute.
mean(corolla.df$Price) # find the average price of Corolla
## [1] 10730.82
If we wanted to find the average price of a 2004 Corolla, we can use the mean() function with a conditional.
mean(corolla.df[corolla.df$Mfg_Year=='2004', 'Price'])
## [1] 22324.17
xtotal <- model.matrix(~ 0 + Fuel_Type + Color, data = corolla.df)
xtotal <- as.data.frame(xtotal)
t(t(names(xtotal))) # check the names of the dummy variables
## [,1]
## [1,] "Fuel_TypeCNG"
## [2,] "Fuel_TypeDiesel"
## [3,] "Fuel_TypePetrol"
## [4,] "ColorBlack"
## [5,] "ColorBlue"
## [6,] "ColorGreen"
## [7,] "ColorGrey"
## [8,] "ColorRed"
## [9,] "ColorSilver"
## [10,] "ColorViolet"
## [11,] "ColorWhite"
## [12,] "ColorYellow"
head(xtotal)
## Fuel_TypeCNG Fuel_TypeDiesel Fuel_TypePetrol ColorBlack ColorBlue ColorGreen
## 1 0 1 0 0 1 0
## 2 0 1 0 0 0 0
## 3 0 1 0 0 1 0
## 4 0 1 0 1 0 0
## 5 0 1 0 1 0 0
## 6 0 1 0 0 0 0
## ColorGrey ColorRed ColorSilver ColorViolet ColorWhite ColorYellow
## 1 0 0 0 0 0 0
## 2 0 0 1 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 1 0
In data mining, data is partitioned in two or three sets: training, validation, and test sets. The purpose of partitioning data is to keep a portion of data out of analysis in training a model, and use it later for verification.
# partitioning into training (50%), validation (30%), test(20%)
# randomly sample 50% of the row Ids for training
train.rows <- sample(rownames(corolla.df), dim(corolla.df)[1]*0.5)
# sample 30% of the row Ids into the validation set, drawing only from records
# not already in the training set
# use setdiff() to find records not already in the training set
valid.rows <- sample(setdiff(rownames(corolla.df), train.rows),
dim(corolla.df)[1]*0.3)
# assign the remaining 20% row Ids serve as test
test.rows <- setdiff(rownames(corolla.df), union(train.rows, valid.rows))
# create the 3 data frames by collecting all columns from the appropriate rows
train.data <- corolla.df[train.rows, ]
valid.data <- corolla.df[valid.rows, ]
test.data <- corolla.df[test.rows, ]