UMM Kaggle : Cleaning Data

1. Outline

This PDF is for explaining the structure of Google Merchandise Store(GMS) data set, the things we need to do and discover, and the direction we should go.

2. Basic Infos

Above all, what we need to predict is the sum of all transactions PER USER. Primary index for detecting users is fullVisitorId.

2-1. Training set

903653 rows
12 columns
colnames : channelGrouping, date, device, fullVisitorId, geoNetwork, sessionId, socialEngagementType, totals, trafficSource, visitId, visitNumber, visitStartTime

Among these columns, device, geoNetwork, totals and trafficSource are JSON(JavaScript Object Notation) format. Therefore, their format needs to be converted to vector format through jsonlite package. At the data section of kaggle homepage, the column trafficSource was censored to remove target leakage, so we’ll except that.

2-2. Test set

804684 rows
12 columns
colnames : channelGrouping, date, device, fullVisitorId, geoNetwork, sessionId, socialEngagementType, totals, trafficSource, visitId, visitNumber, visitStartTime

Also device, geoNetwork, totals and trafficSource are JSON(JavaScript Object Notation) format. Same as train table, trafficSource will be censored.

3. Data Cleaning and Preprocessing

3-1. Flattening JSON formats

There are 3 JSON format variables in this training and test set. They should be coverted into suitable format.

We made treating_json function for convenience.

treating_json <- function(x){
        x <- paste(x, collapse = ",")
        x <- paste("[", x, "]")
        x <- fromJSON(x, flatten=TRUE)
        return(x)
}

With treating_json function, we can flatten our JSON format data to normal vector format.

newtrain <- bind_cols(train, 
                      treating_json(train$device),
                      treating_json(train$geoNetwork),
                      treating_json(train$trafficSource),
                      treating_json(train$totals))
newtrain <- dplyr::select(newtrain, -device, -geoNetwork, -trafficSource, -totals)

newtest <- bind_cols(test, 
                      treating_json(test$device),
                      treating_json(test$geoNetwork),
                      treating_json(test$trafficSource), 
                      treating_json(test$totals))
newtest <- dplyr::select(newtest, -device, -geoNetwork, -trafficSource, -totals)

3-2. Cleaning Data and Treating `NA`s

First, we had to remove meaningless columns, which has only 1 value in them. Below is the procedure.

There are also vaguely represented NA values in country, region, metro, city, campaign and colnames(newtrain)[22] columns. We have to convert them into normal NA values.

NAs <- c("not available in demo dataset", "(not provided)",
         "(not set)", "unknown.unknown", "(none)")
for (i in c(8,9,12,13,14,15,16,17,18,19,20,21,22)){
        newtrain[,i] <- na_if(newtrain[,i], NAs[1])
        newtrain[,i] <- na_if(newtrain[,i], NAs[2])
        newtrain[,i] <- na_if(newtrain[,i], NAs[3])
        newtrain[,i] <- na_if(newtrain[,i], NAs[4])
        newtrain[,i] <- na_if(newtrain[,i], NAs[5])
        newtest[,i] <- na_if(newtest[,i], NAs[1])
        newtest[,i] <- na_if(newtest[,i], NAs[2])
        newtest[,i] <- na_if(newtest[,i], NAs[3])
        newtest[,i] <- na_if(newtest[,i], NAs[4])
        newtest[,i] <- na_if(newtest[,i], NAs[5])
}

After this procedure, we saw how many NAs were in our training set.

sum(is.na(newtrain)); sum(is.na(newtest))

## [1] 13021855

## [1] 9700465

There are huge amount of NAs in both sets.

Let’s see the same thing per columns in graphic.

numberNAs_train <- data.frame(colnames(newtrain))
for(i in 1:ncol(newtrain)){
        numberNAs_train[i,2] <- sum(is.na(newtrain[,i]))
}
colnames(numberNAs_train) <- c("Columns", "NumberOfNAs")
ggplot(numberNAs_train, aes(x=reorder(Columns, NumberOfNAs), y=NumberOfNAs)) +
        geom_bar(stat="identity", fill="orange", size=3) +
        coord_flip()

numberNAs_test <- data.frame(colnames(newtest))
for(i in 1:ncol(newtest)){
        numberNAs_test[i,2] <- sum(is.na(newtest[,i]))
}
colnames(numberNAs_test) <- c("Columns", "NumberOfNAs")
ggplot(numberNAs_test, aes(x=reorder(Columns, NumberOfNAs), y=NumberOfNAs)) +
        geom_bar(stat="identity", fill="orange", size=3) +
        coord_flip()

We can see that the number of NA values per columns. The plot says the columns 903652, 892707, 892138, 882193, 882193, 882193, 882193, 882092, 869292, 865347 are top 10 NA governors in newtrain, and 750893, 750870, 750870, 750870, 750870, 750822, 737423, 728927, 609860, 569361 in newtest. We need to impute those values, for our future. We’ll impute those to value 0, except some values which need meanings.

NA values in variable isTrueDirect means FALSE. On the other hand, NAs in adwordsClickInfo.isVideoAd means TRUE. We imputed those to those.

newtrain$isTrueDirect <- replace(newtrain$isTrueDirect, 
                                     is.na(newtrain$isTrueDirect),
                                     "FALSE")
newtrain$adwordsClickInfo.isVideoAd <- replace(newtrain$adwordsClickInfo.isVideoAd, 
                                     is.na(newtrain$adwordsClickInfo.isVideoAd),
                                     "TRUE")

newtest$isTrueDirect <- replace(newtest$isTrueDirect, 
                                     is.na(newtest$isTrueDirect),
                                     "FALSE")
newtest$adwordsClickInfo.isVideoAd <- replace(newtest$adwordsClickInfo.isVideoAd, 
                                     is.na(newtest$adwordsClickInfo.isVideoAd),
                                     "TRUE")

Others had no problem for explaining and interpreting if NAs were imputed to 0s. That’s why we did like below.

for(i in 1:ncol(newtrain)){
        newtrain[,i] <- replace(newtrain[,i], is.na(newtrain[,i]),0)
}
for(i in 1:ncol(newtest)){
        newtest[,i] <- replace(newtest[,i], is.na(newtest[,i]),0)
}

Also, for convenience, we converted data types of some columns.

Now we create new .csv files these new data sets.

write.csv(newtrain, file = "newtrain.csv")
write.csv(newtest, file = "newtest.csv")

UMM Kaggle : Cleaning Data

Baescott

2018 12 7

1. Outline

2. Basic Infos

2-1. Training set

2-2. Test set

3. Data Cleaning and Preprocessing

3-1. Flattening JSON formats

3-2. Cleaning Data and Treating `NA`s

UMM Kaggle : Cleaning Data

Baescott

2018 12 7

1. Outline

2. Basic Infos

2-1. Training set

2-2. Test set

3. Data Cleaning and Preprocessing

3-1. Flattening JSON formats

3-2. Cleaning Data and Treating NAs

3-2. Cleaning Data and Treating `NA`s