ETL(Extract, Transform, and Load) with RStudio

Introduction
Extracting the data
Data Types
Data Cleaning and Transformation
Tidying up the Data
Joining the Data
Exporting the Data

Introduction

Many data analysts claim that 80% of their job is extracting data from its native setting, loading it into the analysis tool that will be used, and transforming it so that it can be analyzed. These procedures are variously called ETL (extract, transform, and load), data wrangling, data munging, or just cleaning data, among others. Whether that 80% number is accurate or not, there is no doubt that getting data ready to analyze takes significant time, resources, and expertise. In this project, we will go through some hands-on practice with ETL.

We will extensively use the TECA dataset for ETL. TECA is a fictional company that owns over one hundred and fifty convenience stores and gas stations throughout the middle of the country, from Wyoming to Minnesota, from Arkansas to Alabama and many states in between. The particular data set we’ll be using is a random selection of customer transactions from these convenience stores for the year 2017- 2019. These transactions include, for example, purchases of fuel, lottery tickets, soda and candy. TECA is a big dataset consisting of 1 million and 50 thousands observations/rows and 24 variables/columns.

Next, let’s load the packages we are going to need. We are going load one package called Tidyverse. Tidyverse is a grouping of 8 packages. To load a package, we need to call the function called library(). We then put the package into the parentheses of the function. If you have never downloaded tidyverse you will need to do that first. I won’t install it since I already have it installed.

library(tidyverse) #loading the tidyverse package

Next, let’s load the TECA dataset. We will call the file mooc1_s. We are using the function read.csv(). First, we point to the file. You can put the whole path here, or, if your working directory is already set to where this file is, you can just put in the file name and the extension. My file is in the same folder as this .rmd file, so I can just list its name here. Finally, I also passed the argument stringsAsFactors=TRUE to the function. This is not generally a good idea, but it does help with examining a data set. This argument generally converts all the columns with character inputs in to factors or categorical variables. So, let’s run this now. It may take a few minutes since it is a large file.

Extracting the data

mooc1 <- read.csv(file = "mooc1_s.csv", stringsAsFactors=TRUE)
dim(mooc1) #dimensions of the dataset

## [1] 1050000      24

We can see this big dataset has 1 million and 50 thousands rows and 24 columns.

Now that we have our data loaded, let’s look at the data to get a sense what we have here. Let’s use four functions to examine the data and to see what we have. The functions we will use are: * str(mooc1) – This gives us a summary of the data. * summary(mooc1) – this gives us descriptive statistics of each column * head(mooc1, n=10) – this prints the first 10 rows of the dataframe * tail(mooc1, n=10) – this prints the last 10 rows of the dataframe

Data Types

In this section we will discuss different types of data we are going to observe in this particular dataset along with the types of data we will not observe here but equally important to have a general knowledge on their types.

Let’s start with the str() function and learn a bit about the different data formats that R uses.

str(mooc1)

## 'data.frame':    1050000 obs. of  24 variables:
##  $ unique_id       : int  1161545 324500 2389914 1441536 1269552 2209421 1901742 2876118 2772808 1978627 ...
##  $ transaction_id  : Factor w/ 1047815 levels "20161008|101|3|2|898139",..: 538213 203077 1017291 292709 825017 1025331 160752 935694 239659 522233 ...
##  $ unformatted_date: Factor w/ 1095 levels "2017-01-01","2017-01-02",..: 557 212 1062 293 855 1071 170 978 245 539 ...
##  $ customer_id     : num  NA NA 5362 NA 7795 ...
##  $ product_id      : int  3437 6509 18234 9055 10299 18980 3055 5628 10468 5388 ...
##  $ product_name    : Factor w/ 9230 levels "$1 MEDIUM FTN OWASSO",..: 2463 5992 4623 5340 5464 2941 4858 4307 9140 4937 ...
##  $ category_id     : int  53 39 135 138 142 1 206 152 152 145 ...
##  $ category_name   : Factor w/ 224 levels "1% Milk","2% Milk",..: 174 112 121 75 65 74 34 44 44 107 ...
##  $ parent_id       : int  5 140 307 175 215 78 11 233 233 232 ...
##  $ parent_name     : Factor w/ 84 levels "Automotive (35)",..: 70 42 77 49 23 28 58 13 13 39 ...
##  $ product_count   : int  243 52 1 1575 59 68 199 11 11 9 ...
##  $ site_id         : int  197 131 336 297 257 157 250 210 482 365 ...
##  $ site_name       : Factor w/ 189 levels "101 Arab","1056 Rockwell City",..: 83 56 136 123 107 70 104 88 188 142 ...
##  $ address         : Factor w/ 189 levels "1 Short St","100 Community Dr",..: 162 141 32 7 4 2 169 25 112 82 ...
##  $ city            : Factor w/ 170 levels "Adamsville","Adrian",..: 20 120 83 47 126 24 48 33 9 151 ...
##  $ zip             : int  36507 68761 81136 65203 63662 72617 80022 64730 73006 50477 ...
##  $ latitude        : num  40.1 42.1 37.7 39 37.5 ...
##  $ longitude       : num  -92.4 -98 -105.9 -92.3 -90 ...
##  $ site_status     : Factor w/ 2 levels "ACTIVE","CLOSED": 1 1 1 1 1 1 1 1 1 1 ...
##  $ revenue         : num  1.69 11.96 -5 1 2.79 ...
##  $ gross_profit    : num  0.51 4.96 0 0.05 1.41 ...
##  $ costs           : num  1.18 7 0 0.95 1.38 ...
##  $ units           : num  1 4 5 1 1 ...
##  $ gp_margin       : num  0.302 0.415 0 0.05 0.504 ...

Lets take a look at a first few rows of the dataset with the head() function

head(mooc1, n = 10) #looking at first 10 rows

##    unique_id             transaction_id unformatted_date customer_id product_id
## 1    1161545   20180417|433|2|1|4888093       2018-07-11          NA       3437
## 2     324500   20170507|361|1|2|2198879       2017-07-31          NA       6509
## 3    2389914   20190904|667|3|1|6216813       2019-11-28    5362.030      18234
## 4    1441536   20170727|562|2|2|3827487       2017-10-20          NA       9055
## 5    1269552   20190209|497|2|2|3338196       2019-05-05    7795.258      10299
## 6    2209421 20190913|387|1|100|5532047       2019-12-07          NA      18980
## 7    1901742  20170326|4923|3|1|3310847       2017-06-19          NA       3055
## 8    2876118   20190612|453|2|2|5160330       2019-09-05    1750.850       5628
## 9    2772808   20170609|986|1|1|1953037       2017-09-02          NA      10468
## 10   1978627    20180330|80|3|3|2643450       2018-06-23          NA       5388
##                    product_name category_id
## 1  DORITO SMPLY WHT CHDDR 2.5OZ          53
## 2        NESTLE TH VNL SAND 6OZ          39
## 3                   LOTTERY WIN         135
## 4           MO HAPPY CAMPER 158         138
## 5  MONSTER REHAB ORNGADE 15.5OZ         142
## 6                       G87E10S           1
## 7        MARLB SPCL BLND BLK BX         206
## 8                     LARGE FTN         152
## 9                   X-LARGE FTN         152
## 10                MEDIUM COFFEE         145
##                           category_name parent_id             parent_name
## 1   Salty Snacks -tort/corn Chips (152)         5            Salty Snacks
## 2             Ice Cream-novelties (124)       140               Ice Cream
## 3                     Lottery-win (270)       307     Store Services (92)
## 4  Gaming-lottery (scratch Tickt) (272)       175           Lottery (360)
## 5              Energy non-carb FS (279)       215         Energy non-carb
## 6                          Fuel Premuim        78                    Fuel
## 7             Cig-off Inv-premium (441)        11  Off Invoice Cigs (104)
## 8                 Cold Bev-retail (311)       233 Cold Dispensed Beverage
## 9                 Cold Bev-retail (311)       233 Cold Dispensed Beverage
## 10            Hot Disp Bev-retail (301)       232  Hot Dispensed Beverage
##    product_count site_id          site_name                address
## 1            243     197    433 Bay Minette        701 Mcmeans Ave
## 2             52     131        361 Oakdale          500 Hohman St
## 3              1     336         667 Hooper 11701 State Highway 17
## 4           1575     297       562 Columbia    101 S Providence Rd
## 5             59     257         497 Patton    1000 State Route 51
## 6             68     157       387 Big Flat       100 Community Dr
## 7            199     250 4923 Commerce City       7351 Magnolia St
## 8             11     210         453 Butler          114 W Ohio St
## 9             11     482         986 Apache        302 E Evans Ave
## 10             9     365       80 Swaledale           202 5Th St S
##             city   zip latitude longitude site_status revenue gross_profit
## 1    Bay Minette 36507  40.1474  -92.3793      ACTIVE    1.69     0.510000
## 2        Oakdale 68761  42.0694  -97.9698      ACTIVE   11.96     4.960000
## 3         Hooper 81136  37.7451 -105.8810      ACTIVE   -5.00     0.000000
## 4       Columbia 65203  38.9509  -92.3362      ACTIVE    1.00     0.050000
## 5         Patton 63662  37.5003  -90.0134      ACTIVE    2.79     1.406670
## 6       Big Flat 72617  36.0018  -92.4051      ACTIVE   15.80     2.136600
## 7  Commerce City 80022  39.8303 -104.9110      ACTIVE    6.17     1.486150
## 8         Butler 64730  38.2588  -94.3340      ACTIVE    1.00     0.221618
## 9         Apache 73006  34.8936  -98.3630      ACTIVE    1.00     1.000000
## 10     Swaledale 50477  42.9779  -93.3154      ACTIVE    1.59     1.590000
##        costs units gp_margin
## 1   1.180000 1.000 0.3017751
## 2   7.000000 4.000 0.4147157
## 3   0.000000 5.000 0.0000000
## 4   0.950000 1.000 0.0500000
## 5   1.383330 1.000 0.5041828
## 6  13.663400 7.122 0.1352278
## 7   4.683850 1.000 0.2408671
## 8   0.778382 1.000 0.2216180
## 9   0.000000 1.000 1.0000000
## 10  0.000000 1.000 1.0000000

We can also have a look at the last 10 rows of the dataset.

tail(mooc1, n=10)

##         unique_id             transaction_id unformatted_date customer_id
## 1049991    769782   20190507|200|3|3|5458756       2019-07-31          NA
## 1049992   1548163   20190511|667|2|2|5896773       2019-08-04    2373.894
## 1049993   2519995 20170315|228|1|100|4057751       2017-06-08          NA
## 1049994    603476  20170412|32|1|100|2275541       2017-07-06    7849.586
## 1049995   1785526   20181008|973|3|3|4891897       2019-01-01          NA
## 1049996    578832    20170509|28|2|1|2224306       2017-08-02          NA
## 1049997    913658   20170210|362|2|2|4062730       2017-05-06          NA
## 1049998   1891118  20170213|2926|2|2|2226767       2017-05-09          NA
## 1049999   1334676   20170810|520|6|2|4317475       2017-11-03    3012.562
## 1050000   1534863   20181002|667|1|1|5351842       2018-12-26          NA
##         product_id               product_name category_id
## 1049991      16179   VUSE ALTO MEN POD 2ML CT         339
## 1049992      11153          APPLE FRITTER FCS         270
## 1049993      19020               No Lead Plus           1
## 1049994      19018        Mid Grade Blend E15           1
## 1049995       5366                     KOOLEE         158
## 1049996        651         CAMEL TK RYL 85 BX         161
## 1049997       2396             MM ORNG 15.2OZ         260
## 1049998      10846      SUB RST BEEF CHDR FCS         274
## 1049999      17877       HOSTESS HOHO 3OZ 3CT          60
## 1050000      15842 NAKED JC BLU SMTHIE 15.2OZ         260
##                    category_name parent_id               parent_name
## 1049991               Cartridges       337              E-Cigarettes
## 1049992              Bakery Food       269                    Bakery
## 1049993             Fuel Premuim        78                      Fuel
## 1049994             Fuel Premuim        78                      Fuel
## 1049995        Carbonated Drinks       234 Frozen Dispensed Beverage
## 1049996        Cig-premium (331)         8                Cigarettes
## 1049997 Pk Bev-juice No Tax (78)       151        Juice No Tax (311)
## 1049998                Cold Food       273           Cold Sandwiches
## 1049999          Cakes & pastrie       262             Bread & Cakes
## 1050000 Pk Bev-juice No Tax (78)       151        Juice No Tax (311)
##         product_count site_id          site_name                address
## 1049991           142      64    200 Cedar Falls           221 W 6Th St
## 1049992           112     336         667 Hooper 11701 State Highway 17
## 1049993            68      91        228 Brandon            704 Main St
## 1049994            68     121      32 Shell Rock        203 S Walnut St
## 1049995            10     476      973 Lafayette        603 S Public Rd
## 1049996           435     108    28 Sioux Rapids          114 Thomas St
## 1049997            95     132        362 Norfork      13862 Highway 5 S
## 1049998            34     114 2926 Commerce City       8096 Brighton Rd
## 1049999           217     274       520 Atlantic          1630 E 7Th St
## 1050000            95     336         667 Hooper 11701 State Highway 17
##                  city   zip latitude longitude site_status revenue gross_profit
## 1049991   Cedar Falls 50613  42.5336  -92.4478      ACTIVE   13.49     4.848890
## 1049992        Hooper 81136  37.7451 -105.8810      ACTIVE    1.89     1.002000
## 1049993       Brandon 52210  42.3147  -92.0057      ACTIVE   12.00     1.636800
## 1049994    Shell Rock 50670  42.7116  -92.5839      ACTIVE   12.00     1.565400
## 1049995     Lafayette 80026  39.9932 -105.0910      ACTIVE    1.39     0.821617
## 1049996  Sioux Rapids 50585  42.8925  -95.1503      ACTIVE    7.15     1.349090
## 1049997       Norfork 72658  36.2088  -92.2863      ACTIVE    1.89     0.770000
## 1049998 Commerce City 80022  39.8437 -104.9210      ACTIVE    4.49     2.786000
## 1049999      Atlantic 50022  41.4048  -94.9895      ACTIVE    1.99     1.205000
## 1050000        Hooper 81136  37.7451 -105.8810      ACTIVE    3.49     1.552140
##             costs units gp_margin
## 1049991  8.641110 1.000 0.3594433
## 1049992  0.888000 1.000 0.5301587
## 1049993 10.363200 5.456 0.1364000
## 1049994 10.434600 5.218 0.1304500
## 1049995  0.568383 1.000 0.5910914
## 1049996  5.800910 1.000 0.1886839
## 1049997  1.120000 1.000 0.4074074
## 1049998  1.704000 1.000 0.6204900
## 1049999  0.785000 1.000 0.6055276
## 1050000  1.937860 1.000 0.4447393

Now, we go about describing the details of each of the columns in the dataset. First, we shall provide brief description of the types of data we normally expect to see in a dataframe.

Int – unique_id is an integer. This means that it is a number with no decimal points.
Number – customer_id is a number. This is like an integer but might have a decimal point
Character – this is just a string or set of letters. We don’t have any of these in the dataset right now because we converted them all to Factors when we imported the data.
Factor – city is a factor. When we imported the data into R, we told it to convert characters or strings to Factors. A factor is a set of ordered, unique categories or levels that R can do stuff with. It is convenient for viewing and understanding data. For example, we can see that there are 170 unique cities in this column. Factors, however, are not good for manipulating data, since weird things can happen when we need to modify a factor. Thus, it is generally better to keep data in the character format and convert it to a factor only when needed for analysis.
date: we can convert a character or factor to a date. It is nice to do this because we can then perform operations on the dates, such as extracting the month of the year. Thus, later will convert unformatted_date to the date format.
NA is not a data type, but a stand-in when data is not available. It means “not applicable” and indicates an entry should be available but is not. Thus, the object should be there.
NULL is similar to NA but indicates that the value does not exist or is not measurable. Thus, the entry should not be there. We don’t have any of these in the data right now.

Now, we will go through column by column to examine what we have.

Unique_id is an integer, unique identifier for each row that has no duplicates and no nulls.
Transaction_id seems to be an identifier for transactions. There are duplicates since there are less factors than the numbers of rows and no nulls.
unformatted_date is clearly dates that are not formatted as a date. We will need to change that.
customer_id has a lot of missing values, but still lists thousands of customers. These are loyalty costumers using their loyalty card/number when they make a purchase.
product_id & product_name show that there are over 10,000 different products that are sold.
category_id & category_name represent 227 different categories of products. The most commonly purchased are “Fuel Premium” and “Cold Ben-retail (311)”
parent_id & parent_name represent 86 parent categories for the product. The most common, as we might suspect by now, is “Fuel” and “Cold Dispensed Beverage.”
product_count. It is not clear what this means, but it could be the number of different products for each parent category.
site_id, site_name, and address represent 189 different stores.
city: the stores are spread across 170 different cities.
zip is the zip code for the store. It is curious that we don’t have the state. It would be nice to obtain that data.
latitude & longitude: these are the coordinates to map each of the stores
site_status has two levels “ACTIVE” and “CLOSED”. We can’t really tell what this means at this point.
revenue is a very useful number. It shows how much revenue is earned for each product. This will display a sample of the data for the number of rows we specify. Let’s look at a sample of 100. You may have to rerun this to get a different sample, but eventually you will see that lottery payouts, cigarette discounts, and store discounts seem to be the primary reason for negative revenue.
costs appears to be the cost of the product that was sold.
gross_profit then must be the financial concept of the profit remaining for the company after direct costs are subtracted. Thus, gross profit is just revenue minus costs.

gp_margin must then be gross profit margin. This is another financial concept calculated as revenue minus costs, which equals gross_profit, divided by revenue. Gross profit margin represents the percentage of profit for each dollar of revenue. It is an important metric for measuring the profitability of the overall business and individual product categories. We will use these measures a lot in the class.

Note that gross_profit and costs and gp_marign have several thousand NAs. Using slice_sample() repeatedly or other methods, we can see that these are for abnormal and fairly rare transactions like prepaid fuel, money orders, coupons, etc.

units appears to list how many units of that product were sold.

The summary() function will essentially reveals of lot of information for each of the columns.

summary(mooc1)

##    unique_id                        transaction_id      unformatted_date  
##  Min.   :      6   20170313|254|2|1|2067734:      3   2017-11-25:   1404  
##  1st Qu.: 728240   20181207|877|2|2|4894428:      3   2017-10-14:   1386  
##  Median :1455571   20190502|381|2|2|4595573:      3   2017-11-11:   1350  
##  Mean   :1455677   20161008|248|2|1|1726302:      2   2017-09-02:   1343  
##  3rd Qu.:2184373   20161008|372|2|1|1625787:      2   2017-08-26:   1341  
##  Max.   :2909475   20161008|520|3|1|3537099:      2   2017-11-16:   1338  
##                    (Other)                 :1049985   (Other)   :1041838  
##   customer_id       product_id            product_name     category_id   
##  Min.   :    1    Min.   :    2   G87E10S       : 96450   Min.   :  1.0  
##  1st Qu.: 3349    1st Qu.: 4036   LARGE FTN     : 29811   1st Qu.: 58.0  
##  Median : 5436    Median : 9103   MEDIUM FTN    : 28475   Median :158.0  
##  Mean   : 5106    Mean   : 9728   POP DEP SINGLE: 24959   Mean   :161.2  
##  3rd Qu.: 6672    3rd Qu.:16289   No Lead Plus  : 22137   3rd Qu.:255.0  
##  Max.   :13336    Max.   :19544   PrepayFuel    : 18410   Max.   :371.0  
##  NA's   :782896                   (Other)       :829758                  
##                               category_name      parent_id    
##  Fuel Premuim                        :204294   Min.   :  2.0  
##  Cold Bev-retail (311)               : 72455   1st Qu.: 78.0  
##  Pk Bev-csd/pop (71)                 : 58697   Median :185.0  
##  Pk Bev-energy (carbonated) Fs (79)  : 44895   Mean   :175.4  
##  Gaming-lottery (scratch Tickt) (272): 34638   3rd Qu.:244.0  
##  Cig-off Inv-premium (441)           : 33862   Max.   :372.0  
##  (Other)                             :601159                  
##                   parent_name     product_count       site_id     
##  Fuel                   :204294   Min.   :   1.0   Min.   :  2.0  
##  Cold Dispensed Beverage: 84244   1st Qu.:  37.0   1st Qu.:111.0  
##  Juice/tonics           : 65673   Median :  68.0   Median :222.0  
##  Pop (587)              : 58697   Mean   : 185.1   Mean   :231.2  
##  Energy carbonated      : 45005   3rd Qu.: 230.0   3rd Qu.:336.0  
##  Candy                  : 41573   Max.   :1575.0   Max.   :485.0  
##  (Other)                :550514                                   
##             site_name                            address      
##  561 Gardendale  : 13729   890 Odum Road             : 13729  
##  670 Alamosa     : 12033   505 3Rd St                : 12033  
##  391 Hamilton    : 10642   1706 Military Street South: 10642  
##  158 Russellville: 10594   13675 Hwy 43              : 10594  
##  497 Patton      : 10234   1000 State Route 51       : 10234  
##  370 Dunning     : 10174   403 Jewett Ave            : 10174  
##  (Other)         :982594   (Other)                   :982594  
##             city             zip           latitude       longitude      
##  Columbia     : 28359   Min.   :35005   Min.   :33.06   Min.   :-108.47  
##  Aurora       : 23678   1st Qu.:50274   1st Qu.:36.53   1st Qu.: -98.44  
##  Rogers       : 16854   Median :64769   Median :39.83   Median : -94.76  
##  Commerce City: 15062   Mean   :59806   Mean   :39.37   Mean   : -96.42  
##  Gardendale   : 13729   3rd Qu.:73040   3rd Qu.:41.86   3rd Qu.: -92.41  
##  Mobile       : 12818   Max.   :82718   Max.   :46.65   Max.   : -90.01  
##  (Other)      :939500                                                    
##  site_status        revenue         gross_profit          costs         
##  ACTIVE:986019   Min.   :  -1680   Min.   :-169.320   Min.   :-475.610  
##  CLOSED: 63981   1st Qu.:      1   1st Qu.:   0.639   1st Qu.:   0.690  
##                  Median :      2   Median :   1.055   Median :   1.286  
##                  Mean   :     41   Mean   :   1.693   Mean   :   6.417  
##                  3rd Qu.:      7   3rd Qu.:   1.840   3rd Qu.:   5.219  
##                  Max.   :4500000   Max.   : 500.000   Max.   : 949.050  
##                                    NA's   :21599      NA's   :21599     
##      units           gp_margin        
##  Min.   :-200.00   Min.   :-2859.000  
##  1st Qu.:   1.00   1st Qu.:    0.128  
##  Median :   1.00   Median :    0.349  
##  Mean   :   3.44   Mean   :   -0.416  
##  3rd Qu.:   2.00   3rd Qu.:    0.542  
##  Max.   : 999.00   Max.   :   52.639  
##  NA's   :61003     NA's   :21599

From the output of the summary() function we can see that customer_id has over 700,000 NAs.

The output also shows that some transaction_ids are repeated 3 and 4 times. This means that some transactions must have more than one product purchased.

The most frequently purchased product is “G87E10S”. The next most common product sold is a large fountain drink.

The most commonly purchased are “Fuel Premium” and “Cold Ben-retail (311)”.

The store with the most transactions is “561 Gradendale” located at 890 odum road.

The city with the most purchases in our dataset is Columbia.

Since the median for ‘units’ is 1 and the mean is higher, it is likely that most products sold are 1 unit, but that a few products have significantly more than one unit. Just looking at the head() of tail() functions seems to verify this since we see that fuel has over multiple units every time, which would be gallons of fuel, while soda and other items are mainly 1 unit. Additionally, there are a few missing values for units. While the head and tail functions don’t provide any examples, using slice_sample() shows that prepaid fuel and “pop deposits” appear to be the reason why. This likely means the customer is returning a glass pop/soda bottle and receiving a refund.

There are no missing values in the revenue column but some negative values. Examining the head() and tail() functions does not reveal much about any negative amounts. Let’s use the function slice_sample() down at the end of the notebook here. This will display a sample of the data for the number of rows we specify. Let’s look at a sample of 100. You may have to rerun this to get a different sample, but eventually you will see that lottery payouts, cigarette discounts, and store discounts seem to be the primary reason for negative revenue.

slice_sample(mooc1, n=100) #a random set of data with 100 observations. We will not run this code now.

Data Cleaning and Transformation

Next, we will the open the .rds file. An rds file is an R data file. This is just the file from the last video saved in rds format. Saving it in rds format is useful because it uses less room on your hard drive and maintains all of the column formats.

mooc1 <- read_rds('mooc1_etl2.rds')

Previously, we examined each column of our dataset called “mooc1”. In this section, we will clean or tidy that dataset. There are multiple ways that a dataset can be messy, including but not limited to the following: * Column headers that don’t have names * Multiple variables stored in one column * Variables stored in rows and columns * Outliers, which are extreme values—either vary large or vary small * Mistakes * Columns that have the wrong format or data type * Missing values * Missing columns – or the need to join or merge additional data into the current data set

Relatively speaking, our data is quite tidy. For example, each row is already one observation and each column is already one variable. However, there are a few things that we can tidy up.

Let’s start with fixing column data formats. In particular, we want to turn the unformatted_date column into a date object class so that we can perform operations on it.

As with almost everything in R, there are multiple ways to create a date object. This is true because it is an open-source software and anyone can create a package to make it easier to do something in R. Let’s use the tools from the package lubridate. Thus, first, load the lubridate package.

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.0.5

The first step in changing unformatted_date to a date object is to see what format the date is in. We can do this very easily using looking at the “Environment” tab or the str() function. To look at just one column from a dataframe, we can put the name of the dataframe first, followed by a dollar sign, followed by the name of the column. Let’s use both of these functions to examine this column. We can place them both in the same code block and run them both.

While this is not the prettiest output, these two functions show that unformatted_date has the four-digit year followed by a dash, the two digit month followed by a dash, and then the two digit day.

str(mooc1$unformatted_date) #Examine the `unformatted_date` column

##  Factor w/ 1095 levels "2017-01-01","2017-01-02",..: 557 212 1062 293 855 1071 170 978 245 539 ...

The unformatted_date column has been converted to categorical data.

head(mooc1$unformatted_date, n=100) #Check out the first 100 entries in the *unformatted_date* column.

##   [1] 2018-07-11 2017-07-31 2019-11-28 2017-10-20 2019-05-05 2019-12-07
##   [7] 2017-06-19 2019-09-05 2017-09-02 2018-06-23 2019-05-10 2019-08-01
##  [13] 2017-07-30 2019-07-28 2017-08-29 2017-10-25 2019-08-21 2019-12-07
##  [19] 2018-10-10 2019-04-30 2017-01-15 2018-08-21 2019-01-01 2019-08-08
##  [25] 2017-09-16 2018-08-29 2019-11-06 2019-03-05 2017-10-26 2018-11-08
##  [31] 2017-03-31 2017-03-14 2017-01-20 2018-03-25 2019-09-03 2017-06-08
##  [37] 2018-11-07 2018-01-28 2017-12-04 2018-04-02 2018-06-23 2017-05-26
##  [43] 2018-01-10 2019-11-16 2018-03-26 2017-04-18 2017-02-28 2018-08-23
##  [49] 2017-04-22 2018-02-07 2017-11-07 2019-05-01 2019-02-24 2017-01-19
##  [55] 2018-07-06 2018-11-08 2018-08-07 2019-01-29 2018-03-13 2017-07-12
##  [61] 2019-01-25 2017-05-29 2019-02-27 2019-09-09 2018-02-09 2018-07-24
##  [67] 2018-04-30 2019-01-01 2017-10-19 2017-01-19 2018-10-03 2018-04-13
##  [73] 2019-03-22 2018-12-06 2019-09-07 2017-09-20 2017-06-07 2019-11-19
##  [79] 2018-08-24 2017-07-10 2019-11-24 2019-12-01 2019-06-16 2018-01-18
##  [85] 2017-06-27 2017-12-09 2017-09-10 2019-12-10 2018-07-09 2019-08-27
##  [91] 2018-09-04 2018-08-07 2018-06-20 2017-07-17 2017-09-11 2017-12-23
##  [97] 2018-08-11 2017-09-29 2018-12-01 2017-10-06
## 1095 Levels: 2017-01-01 2017-01-02 2017-01-03 2017-01-04 ... 2019-12-31

Lubridate allows us to easily convert the date once we know the text layout, using the correct function that matches that text layout. In this case, the correct function is ymd()—which we might guess stands for year, month, and day. Other examples would be mdy() for month-day-year, dmy() day-month-year, etc. Using this function, we can create a new column called date using the “$” symbol.

mooc1$date <- ymd(mooc1$unformatted_date)

Next, lets create two more columns, a year and a month column. Once we have created the formatted date column, we can easily create these. To create a column for the year,we just use the year() function. Looking at the Environment tab, we can see that year has been created and is a number.

mooc1$year <- year(mooc1$date)

Next, create the month column. This is done the same way with the exception that we have the option to see the labels of the months—like jan, feb, march—instead of the numbers so, take that option with label=TRUE.

mooc1$month <- month(mooc1$date, label=TRUE)

Finally, let’s look at the data to make sure these new columns worked. It is always a good idea to look at the data after manipulating it, to make sure the code did what we thought it was going to do. We can run the slice_sample() function to do this.

slice_sample(mooc1, n=10)

##    unique_id           transaction_id unformatted_date customer_id product_id
## 1     550490 20170822|222|1|1|2717796       2017-11-15     4064.10      16008
## 2     879117 20180807|251|3|2|2247743       2018-10-31          NA         36
## 3    2015272 20170825|961|2|1|2214387       2017-11-18          NA      18973
## 4    2787054 20190130|535|1|1|3694151       2019-04-25    10015.11      10468
## 5    2382569 20190706|453|1|1|5216393       2019-09-29          NA      18234
## 6    2601362 20180727|146|2|2|6017393       2018-10-20          NA       2179
## 7    1426595 20171226|561|5|4|7007729       2018-03-21          NA       3689
## 8    2811140 20180906|382|2|2|3706799       2018-11-30     6543.66       5365
## 9      39048 20171028|466|3|2|2792184       2018-01-21          NA       3309
## 10    611742  20170329|50|2|1|2831983       2017-06-22          NA      12300
##                  product_name category_id                       category_name
## 1  GOLD PEAK LMNDE TEA 18.5OZ         251                            Iced Tea
## 2             PAYDAY 3.4OZ KS         267                       Healthy Candy
## 3                     G85E10S           1                        Fuel Premuim
## 4                 X-LARGE FTN         152               Cold Bev-retail (311)
## 5                 LOTTERY WIN         135                   Lottery-win (270)
## 6                 HOT DOG FCS         280                   Roller Grill Food
## 7          FUNYUN REG 2.375OZ          53 Salty Snacks -tort/corn Chips (152)
## 8                LARGE COFFEE         145           Hot Disp Bev-retail (301)
## 9                  L&M 100 BX         207          Cig-off Inv-discount (442)
## 10        SKOAL LC WNTRGRN RL         334       Tobacco-round Can Snuff (960)
##    parent_id             parent_name product_count site_id       site_name
## 1        150            Juice/tonics           166      85 222 Pilot Mound
## 2        331                   Candy           230     102 251 Albert City
## 3         78                    Fuel            68     467  961 Burlington
## 4        233 Cold Dispensed Beverage            11     280      535 Lowden
## 5        307     Store Services (92)             1     210      453 Butler
## 6        279            Roller Grill             9      33 146 Bella Vista
## 7          5            Salty Snacks           243     296  561 Gardendale
## 8        232  Hot Dispensed Beverage             9     152      382 Gentry
## 9         11  Off Invoice Cigs (104)            27     222   466 Centralia
## 10       321         Smokeless (951)           363     260    50 Andalusia
##                   address        city   zip latitude longitude site_status
## 1            102 Pilot St Pilot Mound 50223  42.1620  -94.0176      ACTIVE
## 2             114 Main St Albert City 50510  42.7819  -94.9476      ACTIVE
## 3           211 S Main St  Burlington 82411  44.4447 -108.4310      ACTIVE
## 4      308 Washington Ave      Lowden 52255  41.8580  -90.9264      ACTIVE
## 5           114 W Ohio St      Butler 64730  38.2588  -94.3340      ACTIVE
## 6  1750 Forest Hills Blvd Bella Vista 72715  36.4649  -94.3020      ACTIVE
## 7           890 Odum Road  Gardendale 35071  39.8873  -94.8933      ACTIVE
## 8           899 W Main St      Gentry 72734  36.2672  -94.4929      ACTIVE
## 9          104 W Sneed St   Centralia 65240  39.2103  -92.1384      ACTIVE
## 10        540 West Bypass   Andalusia 36420  41.2039  -94.4171      ACTIVE
##    revenue gross_profit   costs units  gp_margin       date year month
## 1     3.00      1.00000 2.00000 2.000 0.33333333 2017-11-15 2017   Nov
## 2     1.99      0.91700 1.07300 1.000 0.46080402 2018-10-31 2018   Oct
## 3     5.00      0.63870 4.36130 2.129 0.12774000 2017-11-18 2017   Nov
## 4     1.79      1.79000 0.00000 1.000 1.00000000 2019-04-25 2019   Apr
## 5    -3.00      0.00000 0.00000 3.000 0.00000000 2019-09-29 2019   Sep
## 6     1.34      0.57800 0.76200 1.000 0.43134328 2018-10-20 2018   Oct
## 7     1.35      0.17000 1.18000 1.000 0.12592593 2018-03-21 2018   Mar
## 8     1.00      1.00000 0.00000 1.000 1.00000000 2018-11-30 2018   Nov
## 9     4.17      0.26222 3.90778 1.000 0.06288249 2018-01-21 2018   Jan
## 10   12.68      3.54266 9.13734 2.000 0.27938959 2017-06-22 2017   Jun

Tidying up the Data

We will now the load the second rds file.

mooc1 <- read_rds('mooc1_etl3.rds')

When tidying your data, it’s important to make sure there are very few mistakes, outliers and missing data. Let’s focus on a variable, revenue, since revenue is critical to any business and check for and deal with these issues.First, let’s make sure that revenue has no missing values. We can check that with the summary() function.

Looking for missing data in revenue

summary(mooc1$revenue)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1680       1       2      41       7 4500000

There are no missing values, so let’s check for any mistakes. One type of mistake would be if there are non-numbers in this column, since we would expect revenue to be a number. We can check this with the str() function or by looking in the Environment window in RStudio.

str(mooc1$revenue)

##  num [1:1050000] 1.69 11.96 -5 1 2.79 ...

Luckily, everything is a number here.

Next, let’s see if there are any really high or low values, any outliers. First, let’s check for extreme values using the summary() function. So, let’s look back up at the summary() function results.

summary(mooc1$revenue)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1680       1       2      41       7 4500000

This does show some potential problems. First, there are some negative amounts. That seems strange for revenue. However, we previously learned that lottery payouts and cigarette discounts seem to be the primary reason for negative revenue. That makes sense and does not seem like a mistake.

However, a second problem is that there seems to be some really high revenue numbers. First, the mean is much higher than the median. This indicates that there are a relatively few observations that are really high. Second, the maximum revenue measure is really high. It seems pretty unlikely that any individual product, even if many of them are bought, is going to give revenue of 4,500,000!

Looking for missing values in revenue

str(mooc1$revenue)

##  num [1:1050000] 1.69 11.96 -5 1 2.79 ...

So, we need to investigate if these high amounts are errors. Again, there are many ways to do this, but one way is just to sort the dataset on revenue with the highest revenue coming first. So, there are really three things we need to do.

First, we need to tell R which dataframe we are working with. Second, we need to tell R to sort this dataframe with the highest revenue coming first. Third, we want to just look at the first 25 rows.

While we could do this all in one statement with a bunch of parentheses, let’s start using the package tidyverse to make things a little easier. Tidyverse is a grouping of 8 packages. It is really useful for use with dataframes. Here we are going to utilize a useful thing called the pipe symbol. This symbol, %>%, allows us to attach different tasks together like a train or a chain, instead of putting it all into one statement.

So, first, we say which DF we are working on. Next, we use the arrange() function with the desc() function nested in there to get descending order. Finally, we run the head() function with n=25. We could have done this all as one statement without using the %>% symbol, but that is sometimes more confusing.

Looking at the highest revenue items

mooc1 %>%  arrange(desc(revenue)) %>% head(n=25)

##    unique_id            transaction_id unformatted_date customer_id product_id
## 1     206905  20170919|840|3|1|3439349       2017-12-13          NA      17628
## 2      90931  20171017|865|1|1|3104362       2018-01-10          NA      17628
## 3    1613713  20180318|878|2|2|4048051       2018-06-11          NA      17628
## 4     159545  20180119|840|2|1|3617466       2018-04-14          NA      17628
## 5     232546  20180129|873|2|1|2472746       2018-04-24    5219.362      17628
## 6     442516  20171223|838|3|2|3043467       2018-03-18          NA      17628
## 7     197375  20171227|991|1|1|2144611       2018-03-22          NA      17628
## 8    1625557  20180219|881|5|1|6265910       2018-05-15          NA      17628
## 9    1635069  20170719|881|2|2|5624176       2017-10-12          NA      17628
## 10   1598185  20170901|863|1|1|4618493       2017-11-25          NA      17628
## 11    471717  20170904|876|1|2|2928779       2017-11-27    3820.074      17628
## 12    347583  20170714|860|1|2|3262062       2017-10-07          NA      17628
## 13    293589  20180307|991|1|1|2191536       2018-05-31          NA      17628
## 14     88717  20170720|981|3|1|2402113       2017-10-13          NA      17628
## 15     52783  20180202|873|2|2|2476587       2018-04-28    6033.016      17628
## 16   1598932  20170826|863|2|2|4604148       2017-11-19    5500.852      17628
## 17   1656405  20171025|887|1|2|4632713       2018-01-18          NA      17628
## 18    253500  20180119|840|1|1|3616496       2018-04-14          NA      17628
## 19    188498  20170916|840|2|1|3434192       2017-12-10    6600.666      17628
## 20   1854890 20180317|1837|3|3|1309896       2018-06-10          NA      17628
## 21   1624887  20171207|881|2|1|6074119       2018-03-02          NA      17628
## 22   1634113  20170714|881|3|1|5606866       2017-10-07          NA      17628
## 23    290645  20171229|834|3|1|3147892       2018-03-24          NA      17628
## 24   1859702  20171020|1837|1|1|931408       2018-01-13          NA      17628
## 25    316138  20170919|876|2|2|2953468       2017-12-13          NA      17628
##           product_name category_id                        category_name
## 1  OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 2  OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 3  OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 4  OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 5  OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 6  OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 7  OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 8  OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 9  OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 10 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 11 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 12 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 13 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 14 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 15 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 16 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 17 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 18 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 19 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 20 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 21 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 22 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 23 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 24 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
## 25 OK LUCKY RED 7S 467         138 Gaming-lottery (scratch Tickt) (272)
##    parent_id   parent_name product_count site_id      site_name
## 1        175 Lottery (360)          1575     382     840 Albany
## 2        175 Lottery (360)          1575     399       865 Lane
## 3        175 Lottery (360)          1575     410     878 Forgan
## 4        175 Lottery (360)          1575     382     840 Albany
## 5        175 Lottery (360)          1575     405     873 Turpin
## 6        175 Lottery (360)          1575     380     838 Oxford
## 7        175 Lottery (360)          1575     485     991 Binger
## 8        175 Lottery (360)          1575     413      881 Erick
## 9        175 Lottery (360)          1575     413      881 Erick
## 10       175 Lottery (360)          1575     397      863 Atoka
## 11       175 Lottery (360)          1575     408   876 Bokchito
## 12       175 Lottery (360)          1575     394   860 Cherokee
## 13       175 Lottery (360)          1575     485     991 Binger
## 14       175 Lottery (360)          1575     480  981 Fort Cobb
## 15       175 Lottery (360)          1575     405     873 Turpin
## 16       175 Lottery (360)          1575     397      863 Atoka
## 17       175 Lottery (360)          1575     418 887 Greenfield
## 18       175 Lottery (360)          1575     382     840 Albany
## 19       175 Lottery (360)          1575     382     840 Albany
## 20       175 Lottery (360)          1575      55  1837 Winfield
## 21       175 Lottery (360)          1575     413      881 Erick
## 22       175 Lottery (360)          1575     413      881 Erick
## 23       175 Lottery (360)          1575     377 834 Burlington
## 24       175 Lottery (360)          1575      55  1837 Winfield
## 25       175 Lottery (360)          1575     408   876 Bokchito
##              address       city   zip latitude longitude site_status revenue
## 1      109 N Wade St     Albany 74721  33.8818  -96.1620      CLOSED 4500000
## 2  11820 E Highway 3       Lane 74555  34.2980  -95.9867      ACTIVE 3000000
## 3  115 S Broadway St     Forgan 73938  36.9063 -100.5350      ACTIVE 2000000
## 4      109 N Wade St     Albany 74721  33.8818  -96.1620      CLOSED 2000000
## 5      419 S Main St     Turpin 73950  36.8634 -100.8780      ACTIVE 2000000
## 6      92 Plaza Lane     Oxford 36203  35.9938  -94.5681      ACTIVE 2000000
## 7      224 W Main St     Binger 73009  35.3086  -98.3442      CLOSED 1000000
## 8       119 W 3Rd St      Erick 73645  35.2141  -99.8677      ACTIVE 1000000
## 9       119 W 3Rd St      Erick 73645  35.2141  -99.8677      ACTIVE 1000000
## 10    409 E Court St      Atoka 74525  34.3854  -96.1240      ACTIVE 1000000
## 11     118 W Main St   Bokchito 74726  34.0169  -96.1456      ACTIVE 1000000
## 12  1780 S Grand Ave   Cherokee 73728  36.7396  -98.3596      ACTIVE  500000
## 13     224 W Main St     Binger 73009  35.3086  -98.3442      CLOSED  500000
## 14  408 N Hazlett St  Fort Cobb 73038  35.0990  -98.4358      ACTIVE  500000
## 15     419 S Main St     Turpin 73950  36.8634 -100.8780      ACTIVE  500000
## 16    409 E Court St      Atoka 74525  34.3854  -96.1240      ACTIVE  500000
## 17    100 Madison St Greenfield 73043  35.7276  -98.3789      ACTIVE  500000
## 18     109 N Wade St     Albany 74721  33.8818  -96.1620      CLOSED  500000
## 19     109 N Wade St     Albany 74721  33.8818  -96.1620      CLOSED  500000
## 20    2575 Us Hwy 43   Winfield 35594  36.5326  -98.1534      ACTIVE  500000
## 21      119 W 3Rd St      Erick 73645  35.2141  -99.8677      ACTIVE  500000
## 22      119 W 3Rd St      Erick 73645  35.2141  -99.8677      ACTIVE  500000
## 23       513 Main St Burlington 73722  36.9003  -98.4228      ACTIVE  500000
## 24    2575 Us Hwy 43   Winfield 35594  36.5326  -98.1534      ACTIVE  500000
## 25     118 W Main St   Bokchito 74726  34.0169  -96.1456      ACTIVE  500000
##    gross_profit costs units gp_margin       date year month
## 1           2.7  42.3     9      0.06 2017-12-13 2017   Dec
## 2           1.8  28.2     6      0.06 2018-01-10 2018   Jan
## 3           1.2  18.8     4      0.06 2018-06-11 2018   Jun
## 4           1.2  18.8     4      0.06 2018-04-14 2018   Apr
## 5           1.2  18.8     4      0.06 2018-04-24 2018   Apr
## 6           1.2  18.8     4      0.06 2018-03-18 2018   Mar
## 7           0.6   9.4     2      0.06 2018-03-22 2018   Mar
## 8           0.6   9.4     2      0.06 2018-05-15 2018   May
## 9           0.6   9.4     2      0.06 2017-10-12 2017   Oct
## 10          0.6   9.4     2      0.06 2017-11-25 2017   Nov
## 11          0.6   9.4     2      0.06 2017-11-27 2017   Nov
## 12          0.3   4.7     1      0.06 2017-10-07 2017   Oct
## 13          0.3   4.7     1      0.06 2018-05-31 2018   May
## 14          0.3   4.7     1      0.06 2017-10-13 2017   Oct
## 15          0.3   4.7     1      0.06 2018-04-28 2018   Apr
## 16          0.3   4.7     1      0.06 2017-11-19 2017   Nov
## 17          0.3   4.7     1      0.06 2018-01-18 2018   Jan
## 18          0.3   4.7     1      0.06 2018-04-14 2018   Apr
## 19          0.3   4.7     1      0.06 2017-12-10 2017   Dec
## 20          0.3   4.7     1      0.06 2018-06-10 2018   Jun
## 21          0.3   4.7     1      0.06 2018-03-02 2018   Mar
## 22          0.3   4.7     1      0.06 2017-10-07 2017   Oct
## 23          0.3   4.7     1      0.06 2018-03-24 2018   Mar
## 24          0.3   4.7     1      0.06 2018-01-13 2018   Jan
## 25          0.3   4.7     1      0.06 2017-12-13 2017   Dec

Looking at these high-revenue products, these first few rows, shows us that there is clearly a problem with product number 17628, which is the product called OK LUCKY RED 7S 46. Looking at the category name and the parent name we see that these are a particular type of scratch-off lottery ticket. These products make up all of the 25 highest revenue products and the revenue does not match up with the profit or the costs. So, clearly there has been a mistake made.

So, next, we need to check if all of these particular products have bad revenue. Let’s create a new dataframe so we can easily see all of them. Let’s name this dataframe “outliers.”

First, we name the new dataframe. Next, we enter the definition symbol. Next, we tell R to use mooc1 to create this DF. Mooc1 is the dataframe that we are taking the new dataframe from. After that, we use the handy function called filter(). Filter is a really useful function from Tidyverse. Filter will select only the rows of the data frame that we specify. Let’s pick product_id = 17628, which, of course, is the product id that belongs to “7628 OK LUCKY RED 7S 46”. Notice that we need to use two equals signs to do this. This tells R that we want product_id to equal 17628 and not to be assigned that number. That is all we need to do to create this dataframe.

outliers <- mooc1 %>% filter(product_id == 17628) #Creating a new dataframe to look at all 
#"7628 OK LUCKY RED 7S 46"

So, finally, we will need to eliminate these outliers since it is unclear what mistake was made here and these revenue amounts do not represent reality. There are several ways to do this, but for our purposes, let’s just delete all of these rows. We can use filter again to do this. First, let’s call the dataframe something new, since we are eliminating some data. This is just good practice in case we make a mistake. Second, use the filter function, just the way we did before, but this time with the not equal operator. !=.

mooc1_clean <- mooc1 %>% filter(product_id != 17628) #Creating a new dataframe to delete 
#"7628, OK LUCKY RED 7S 46"

Finally, let’s check the distribution of the revenue columns now using the summary() function again.

Looking at the distribution of revenue

summary(mooc1_clean$revenue)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1680.00     1.49     2.49     7.75     6.51   999.00

This seems much more reasonable. If we were to examine the highest revenue now, we would see that the highest amounts are purchases of hundreds of gallons of diesel fuel and hundreds of lottery tickets, which makes sense.

Let’s load the final rds file and the dataset with all the necessary state information in USA.

mooc1 <- read_rds('mooc1_etl4.rds')
mooc1_states <- read.csv('mooc1_states.csv', stringsAsFactors = FALSE)

Joining the Data

Another really important task in ETL is to merge or join data together. For some reason, we have no states in our dataset. We have cities and zip codes but not states. It would be nice to add that data. Luckily, we have created a state database that you can use that has all of the zip codes listed in this dataset, along with the states. Let’s open this dataset and join it to our other dataframe.

Let’s use the functions str(), summary(), head(), and tail() to examine this new dataframe.

str(mooc1_states)

## 'data.frame':    442 obs. of  4 variables:
##  $ postal_code        : chr  "36420" "35611" "35954" "36830" ...
##  $ state_province     : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ state_province_code: chr  "AL" "AL" "AL" "AL" ...
##  $ country_name       : chr  "USA" "USA" "USA" "USA" ...

summary(mooc1_states)

##  postal_code        state_province     state_province_code country_name      
##  Length:442         Length:442         Length:442          Length:442        
##  Class :character   Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character

head(mooc1_states, n=10)

##    postal_code state_province state_province_code country_name
## 1        36420        Alabama                  AL          USA
## 2        35611        Alabama                  AL          USA
## 3        35954        Alabama                  AL          USA
## 4        36830        Alabama                  AL          USA
## 5        36507        Alabama                  AL          USA
## 6        35022        Alabama                  AL          USA
## 7        35209        Alabama                  AL          USA
## 8        35210        Alabama                  AL          USA
## 9        35235        Alabama                  AL          USA
## 10       35206        Alabama                  AL          USA

tail(mooc1_states, n=10)

##     postal_code state_province state_province_code country_name
## 433       36206        Alabama                  AL          USA
## 434       35016        Alabama                  AL          USA
## 435  51333-7917           Iowa                  IA          USA
## 436  72633-9998       Arkansas                  AR          USA
## 437  80614-9990       Colorado                  CO          USA
## 438  80229-9998       Colorado                  CO          USA
## 439  80015-9998       Colorado                  CO          USA
## 440  80017-9998       Colorado                  CO          USA
## 441  80113-9997       Colorado                  CO          USA
## 442  80026-9998       Colorado                  CO          USA

We notice a few things here.

postal_code: the first column appears to be zip codes. There are two problems here. First, this column is called postal_code and not zip. We don’t have a postal_code column in our mooc1 dataframe, but instead we have a column called zip. This is a problem because in order to join our two databases together and get states into our dataframe, we need one column to link the databases—that is, one column that is the same between both tables. The only thing the two tables have in common is zip code. It looks like these two columns represent the same thing, zip codes, but we need to account for this different name. Second, this column is in a different format from the zip column. The column in the states dataframe is a character variable and zip is an integer. So, we need to figure that out.
state_province -this is the column we want to put into our mooc1 dataframe.
state_provice_code – this seems to be redundant with state_province, so, we could bring it in, but we don’t really have to.
country_name – this column is useless since all of the states are from the US. We can see this if we turn it into a factor, using the str() function or looking at the Environment tab, but we won’t do that here.

So, our first step is to put the postal code column into the same format as the zip column, which is integer. We can also put it into number which will work as well. To do this we will use the as.interger() function. First, we tell R which column we are going to create. Then, we apply the function to that column. To be safe, let’s do this to a new column called postal_code2.

mooc1_states$postal_code2 <- as.integer(mooc1_states$postal_code)

## Warning: NAs introduced by coercion

But wait, we get a warning message that NAs are introduced. Let’s us the summary() function to look at the three columns—postal_code, postal_code2, and zip.

summary(mooc1_states$postal_code)

##    Length     Class      Mode 
##       442 character character

summary(mooc1_states$postal_code2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   35005   35215   35806   35832   36330   36867     345

summary(mooc1$zip)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35005   50274   64769   59806   73040   82718

We see that, indeed, over 300 NAs are created. What is going on? If we scroll back up to our original view of the mooc1_states dataframe using the tail function, we see that the reason R made the column a character type in the first place is that some zip codes include a dash and 4 extra numbers. R must not have defined the dash and extra numbers as the number format, but rather the character format. If we look at mooc1 using summary() we don’t see those in zip. In fact, all numbers are only five digits. These extra four numbers are called “+four codes” and are used to make zip codes even more specific. We do not need these and mooc1 does not have them.

So, it looks like we are going to need to take off the dash and the extra numbers. Let’s do that with the str_sub() function. This function will take the characters from a column and subset that string to the characters we want. Since we only want the first through the fifth numbers we pass 1, 5. Run the tail() function and compare postal_code and postal_code2 to see that this worked.

mooc1_states$postal_code2 <- mooc1_states$postal_code %>% str_sub(1,5) #Split postal code column in 
#the state dataframe
tail(mooc1_states, n=10)

##     postal_code state_province state_province_code country_name postal_code2
## 433       36206        Alabama                  AL          USA        36206
## 434       35016        Alabama                  AL          USA        35016
## 435  51333-7917           Iowa                  IA          USA        51333
## 436  72633-9998       Arkansas                  AR          USA        72633
## 437  80614-9990       Colorado                  CO          USA        80614
## 438  80229-9998       Colorado                  CO          USA        80229
## 439  80015-9998       Colorado                  CO          USA        80015
## 440  80017-9998       Colorado                  CO          USA        80017
## 441  80113-9997       Colorado                  CO          USA        80113
## 442  80026-9998       Colorado                  CO          USA        80026

Sure enough, if looks like it did and we can move on.

Now that we have the only five numbers in postal code 2, we can retry the as.integer() function. After doing that, let’s run the summary() function and make sure things worked. Ok. That worked. Now, both zip and postal_code2 are in the integer format.

mooc1_states$postal_code2 <- as.integer(mooc1_states$postal_code2)
summary(mooc1_states)

##  postal_code        state_province     state_province_code country_name      
##  Length:442         Length:442         Length:442          Length:442        
##  Class :character   Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##   postal_code2  
##  Min.   :35005  
##  1st Qu.:50408  
##  Median :58063  
##  Mean   :58656  
##  3rd Qu.:72943  
##  Max.   :82725

Next, Let’s change the name of the postal_code2 column to zip. We could deal with these having a different name through the join() function itself that we will do next, but it will be good practice to do it this way. This is done quite easily with the rename() function from Tidyverse. First, we define a new dataframe. It is probably safe to use the old dataframe name here, since this doesn’t mess with the data in any way. Next, we tell R which dataframe we are manipulating. Then we use the rename() function with the new name first.

mooc1_states <- mooc1_states %>% rename(zip = postal_code2) #renaming the column to zip

Now, let’s clean up the dataframe before we merge it. First, let’s drop postal_code and country_name since we don’t need them. To do this, let’s introduce another really useful function from Tidyverse, the select() function. This function selects which columns we want and don’t want for a new dataframe. First, give a name to the new dataframe. We can use mooc1_states and rewrite it. Then, tell R which dataframe to manipulate, in this case, mooc1_states, of course.

mooc1_states <- mooc1_states %>% select(-postal_code, -country_name) ## Drop and reorder columns

Now, let’s use the select() function to reorder the columns and drop state_province_code. First, just put them in the order you want inside the select() function. Then, just omit the ones you don’t want. Thus, we drop state_province_code just by not mentioning it in the select() function.

mooc1_states <- mooc1_states %>% select(zip, state_province) # Reorder the remaining columns

Finally, we get to merge our two data frames. Merging or joining is really tricky business. It is really important to think about what you want before merging, and then to make sure you get what you want after merging. One key way to do this is to 1) know the level of aggregation in your data frames and 2) check the number of rows of your dataframes before and after you merge, to make sure you did what you thought you did.

What does “level of aggregation” mean? By this we just mean what each row of your data represents. For example, each row of mooc1 represents one or more of the same product from one transaction. Thus, the level of aggregation is “product-transaction”. On the other hand, each row in mooc1_states is a unique zip code. So, the level of aggregation is zip code.

Once we know the level of aggregation of our two dataframes, we can ask the critical question: What level of aggregation do we want when we merge these dataframes? We want to just add state names to rows that already exist in mooc1. So, our final dataframe should have exactly the number of rows that mooc1 has right now, which is 1,049,961. Keep that in mind.

So, that does not sound so difficult, what can go wrong? Well, a lot, but let’s focus on these potential problems. * First, maybe mooc1_states has duplicate zip codes. If this is true, then we might get duplicates in mooc1. We could check this with n_distinct(), but we already have and all zip codes are unique. * The next problem is that we use the wrong join method. There are multiple ways to join things and using the wrong method will cause problems * Using the wrong method could add rows that we don’t want, because they are in the state dataframe but not in the mooc1 dataframe. Alternatively, using the wrong method could take away rows from the mooc1 dataframe because they are not in the state dataframe. Remember, we want to keep everything in the mooc1 dataframe and add just states from the state dataframe.

Here is the documentation from the package dplyr. These are for mutating joins, which means joins that add new columns (https://dplyr.tidyverse.org/reference/mutate-joins.html). The mutating joins add columns from the second dataframe to the first dataframe, matching rows based on the keys supplied in the function. If a row in the first dataframe matches multiple rows in the second dataframe, all the rows in the second dataframe will be returned once for each matching row in the first dataframe. Here are the types:

inner_join(): includes all rows in the first dataframe and the second dataframe.
left_join(): includes all rows in the first dataframe.
right_join(): includes all rows in the second dataframe.
full_join(): includes all rows in the first dataframe or the second dataframe

So, which one of these methods do we want? Well, let’s use the wrong one first, just to illustrate why it is important to use the correct join method. Let’s use the full_join(), which keeps everything from both dataframes, whether the rows march up or not. First, let’s check the size of mooc1 before and then after using dim() This is how many rows we want to have when we are done.

dim(mooc1) # check the size of the dataframe

## [1] 1049961      27

We want to add one column to this table, since we are adding state_province. We do not add zip because we are using zip to merge with mooc1 and so the two zip columns will just become one column. Call the new dataframe mooc1_etl. We use the full_join() function. This function has as arguments the original dataframe and the new dataframe and which column we are using to join on. Look at the Environment tab. We can see there that the columns we want from state_province are now in mooc1.

mooc1_etl <- full_join(mooc1, mooc1_states, by='zip') #Join the tables together using `full_join` 
#(wrong way for our case)

Next, we use the dim() function to check the number of rows of the new table.

dim(mooc1_etl)

## [1] 1050222      28

Unfortunately, we have too many rows. What does this mean and why did this happen? Well, full_join() keeps all rows in both tables, even if they don’t match and there must have been some states in the states table that were not in the mooc1 table. This is a problem and we do not want these extra rows.

Let’s run the summary() function on unique_id.

summary(mooc1_etl$unique_id) #Looking at the unique_id column to see if we have NAs

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       6  728269 1455575 1455701 2184398 2909475     261

Sure enough, when we look at the summary of unique_id we see over 200 NAs. This happens because we are adding state_provinces that have no information in the mooc1 table. That means there were over 200 zip codes in mooc1_states that were not in mooc1_clean.

So, now let’s do this correctly. What join type do we actually want? We want the left_join() method. This includes all rows from the first dataframe and ONLY those rows from the second dataframe. Then, let’s run the dim() function and the summary() function to make sure we get the right number of columns and rows and to make sure we have no NAs in unique_id.

mooc1_etl <- left_join(mooc1, mooc1_states, by='zip') # Using `left_join()` 
#(the correct way for our case)
dim(mooc1_etl)

## [1] 1049961      28

summary(mooc1_etl$unique_id)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6  728269 1455575 1455701 2184398 2909475

Sure enough, when we do this, we get the same number of rows before and after, one additional column after, and no NAs.

Exporting the Data

Finally, let’s export this table. We can export the dataframe as an ‘rds’ file that is native to R. Again, this is good practice for several reasons. First, while we could continue to import the file from the csv version every time, this takes longer than an rds file. Second, by exporting the file midway through analysis, we can always come back and read in that file and not have to go back and redo all of the manipulation that we already did. This is not hard to do, but it does waste time. Finally, saving to rds rather than csv preserves the format of the columns. For example, the date column will be correctly formatted as the Date format, and we won’t have to go back and fix that again. We can also compress the file.

write_rds(mooc1_etl, 'mooc1_etl_done.rds', compress = 'gz') #Export mooc1_etl for use later 
#while retaining the column formats

This project on ETL included many of the key skills needed to load, transform and manipulate data. They will serve well in solving business problems with data.