Many data analysts claim that 80% of their job is extracting data from its native setting, loading it into the analysis tool that will be used, and transforming it so that it can be analyzed. These procedures are variously called ETL (extract, transform, and load), data wrangling, data munging, or just cleaning data, among others. Whether that 80% number is accurate or not, there is no doubt that getting data ready to analyze takes significant time, resources, and expertise. In this project, we will go through some hands-on practice with ETL.
We will extensively use the TECA dataset for ETL. TECA is a fictional company that owns over one hundred and fifty convenience stores and gas stations throughout the middle of the country, from Wyoming to Minnesota, from Arkansas to Alabama and many states in between. The particular data set we’ll be using is a random selection of customer transactions from these convenience stores for the year 2017- 2019. These transactions include, for example, purchases of fuel, lottery tickets, soda and candy. TECA is a big dataset consisting of 1 million and 50 thousands observations/rows and 24 variables/columns.
Next, let’s load the packages we are going to need. We are going load one package called Tidyverse. Tidyverse is a grouping of 8 packages. To load a package, we need to call the function called library(). We then put the package into the parentheses of the function. If you have never downloaded tidyverse you will need to do that first. I won’t install it since I already have it installed.
Next, let’s load the TECA dataset. We will call the file mooc1_s. We are using the function read.csv(). First, we point to the file. You can put the whole path here, or, if your working directory is already set to where this file is, you can just put in the file name and the extension. My file is in the same folder as this .rmd file, so I can just list its name here. Finally, I also passed the argument stringsAsFactors=TRUE to the function. This is not generally a good idea, but it does help with examining a data set. This argument generally converts all the columns with character inputs in to factors or categorical variables. So, let’s run this now. It may take a few minutes since it is a large file.
mooc1 <- read.csv(file = "mooc1_s.csv", stringsAsFactors=TRUE)
dim(mooc1) #dimensions of the dataset## [1] 1050000 24
We can see this big dataset has 1 million and 50 thousands rows and 24 columns.
Now that we have our data loaded, let’s look at the data to get a sense what we have here. Let’s use four functions to examine the data and to see what we have. The functions we will use are: * str(mooc1) – This gives us a summary of the data. * summary(mooc1) – this gives us descriptive statistics of each column * head(mooc1, n=10) – this prints the first 10 rows of the dataframe * tail(mooc1, n=10) – this prints the last 10 rows of the dataframe
In this section we will discuss different types of data we are going to observe in this particular dataset along with the types of data we will not observe here but equally important to have a general knowledge on their types.
Let’s start with the str() function and learn a bit about the different data formats that R uses.
## 'data.frame': 1050000 obs. of 24 variables:
## $ unique_id : int 1161545 324500 2389914 1441536 1269552 2209421 1901742 2876118 2772808 1978627 ...
## $ transaction_id : Factor w/ 1047815 levels "20161008|101|3|2|898139",..: 538213 203077 1017291 292709 825017 1025331 160752 935694 239659 522233 ...
## $ unformatted_date: Factor w/ 1095 levels "2017-01-01","2017-01-02",..: 557 212 1062 293 855 1071 170 978 245 539 ...
## $ customer_id : num NA NA 5362 NA 7795 ...
## $ product_id : int 3437 6509 18234 9055 10299 18980 3055 5628 10468 5388 ...
## $ product_name : Factor w/ 9230 levels "$1 MEDIUM FTN OWASSO",..: 2463 5992 4623 5340 5464 2941 4858 4307 9140 4937 ...
## $ category_id : int 53 39 135 138 142 1 206 152 152 145 ...
## $ category_name : Factor w/ 224 levels "1% Milk","2% Milk",..: 174 112 121 75 65 74 34 44 44 107 ...
## $ parent_id : int 5 140 307 175 215 78 11 233 233 232 ...
## $ parent_name : Factor w/ 84 levels "Automotive (35)",..: 70 42 77 49 23 28 58 13 13 39 ...
## $ product_count : int 243 52 1 1575 59 68 199 11 11 9 ...
## $ site_id : int 197 131 336 297 257 157 250 210 482 365 ...
## $ site_name : Factor w/ 189 levels "101 Arab","1056 Rockwell City",..: 83 56 136 123 107 70 104 88 188 142 ...
## $ address : Factor w/ 189 levels "1 Short St","100 Community Dr",..: 162 141 32 7 4 2 169 25 112 82 ...
## $ city : Factor w/ 170 levels "Adamsville","Adrian",..: 20 120 83 47 126 24 48 33 9 151 ...
## $ zip : int 36507 68761 81136 65203 63662 72617 80022 64730 73006 50477 ...
## $ latitude : num 40.1 42.1 37.7 39 37.5 ...
## $ longitude : num -92.4 -98 -105.9 -92.3 -90 ...
## $ site_status : Factor w/ 2 levels "ACTIVE","CLOSED": 1 1 1 1 1 1 1 1 1 1 ...
## $ revenue : num 1.69 11.96 -5 1 2.79 ...
## $ gross_profit : num 0.51 4.96 0 0.05 1.41 ...
## $ costs : num 1.18 7 0 0.95 1.38 ...
## $ units : num 1 4 5 1 1 ...
## $ gp_margin : num 0.302 0.415 0 0.05 0.504 ...
Lets take a look at a first few rows of the dataset with the head() function
## unique_id transaction_id unformatted_date customer_id product_id
## 1 1161545 20180417|433|2|1|4888093 2018-07-11 NA 3437
## 2 324500 20170507|361|1|2|2198879 2017-07-31 NA 6509
## 3 2389914 20190904|667|3|1|6216813 2019-11-28 5362.030 18234
## 4 1441536 20170727|562|2|2|3827487 2017-10-20 NA 9055
## 5 1269552 20190209|497|2|2|3338196 2019-05-05 7795.258 10299
## 6 2209421 20190913|387|1|100|5532047 2019-12-07 NA 18980
## 7 1901742 20170326|4923|3|1|3310847 2017-06-19 NA 3055
## 8 2876118 20190612|453|2|2|5160330 2019-09-05 1750.850 5628
## 9 2772808 20170609|986|1|1|1953037 2017-09-02 NA 10468
## 10 1978627 20180330|80|3|3|2643450 2018-06-23 NA 5388
## product_name category_id
## 1 DORITO SMPLY WHT CHDDR 2.5OZ 53
## 2 NESTLE TH VNL SAND 6OZ 39
## 3 LOTTERY WIN 135
## 4 MO HAPPY CAMPER 158 138
## 5 MONSTER REHAB ORNGADE 15.5OZ 142
## 6 G87E10S 1
## 7 MARLB SPCL BLND BLK BX 206
## 8 LARGE FTN 152
## 9 X-LARGE FTN 152
## 10 MEDIUM COFFEE 145
## category_name parent_id parent_name
## 1 Salty Snacks -tort/corn Chips (152) 5 Salty Snacks
## 2 Ice Cream-novelties (124) 140 Ice Cream
## 3 Lottery-win (270) 307 Store Services (92)
## 4 Gaming-lottery (scratch Tickt) (272) 175 Lottery (360)
## 5 Energy non-carb FS (279) 215 Energy non-carb
## 6 Fuel Premuim 78 Fuel
## 7 Cig-off Inv-premium (441) 11 Off Invoice Cigs (104)
## 8 Cold Bev-retail (311) 233 Cold Dispensed Beverage
## 9 Cold Bev-retail (311) 233 Cold Dispensed Beverage
## 10 Hot Disp Bev-retail (301) 232 Hot Dispensed Beverage
## product_count site_id site_name address
## 1 243 197 433 Bay Minette 701 Mcmeans Ave
## 2 52 131 361 Oakdale 500 Hohman St
## 3 1 336 667 Hooper 11701 State Highway 17
## 4 1575 297 562 Columbia 101 S Providence Rd
## 5 59 257 497 Patton 1000 State Route 51
## 6 68 157 387 Big Flat 100 Community Dr
## 7 199 250 4923 Commerce City 7351 Magnolia St
## 8 11 210 453 Butler 114 W Ohio St
## 9 11 482 986 Apache 302 E Evans Ave
## 10 9 365 80 Swaledale 202 5Th St S
## city zip latitude longitude site_status revenue gross_profit
## 1 Bay Minette 36507 40.1474 -92.3793 ACTIVE 1.69 0.510000
## 2 Oakdale 68761 42.0694 -97.9698 ACTIVE 11.96 4.960000
## 3 Hooper 81136 37.7451 -105.8810 ACTIVE -5.00 0.000000
## 4 Columbia 65203 38.9509 -92.3362 ACTIVE 1.00 0.050000
## 5 Patton 63662 37.5003 -90.0134 ACTIVE 2.79 1.406670
## 6 Big Flat 72617 36.0018 -92.4051 ACTIVE 15.80 2.136600
## 7 Commerce City 80022 39.8303 -104.9110 ACTIVE 6.17 1.486150
## 8 Butler 64730 38.2588 -94.3340 ACTIVE 1.00 0.221618
## 9 Apache 73006 34.8936 -98.3630 ACTIVE 1.00 1.000000
## 10 Swaledale 50477 42.9779 -93.3154 ACTIVE 1.59 1.590000
## costs units gp_margin
## 1 1.180000 1.000 0.3017751
## 2 7.000000 4.000 0.4147157
## 3 0.000000 5.000 0.0000000
## 4 0.950000 1.000 0.0500000
## 5 1.383330 1.000 0.5041828
## 6 13.663400 7.122 0.1352278
## 7 4.683850 1.000 0.2408671
## 8 0.778382 1.000 0.2216180
## 9 0.000000 1.000 1.0000000
## 10 0.000000 1.000 1.0000000
We can also have a look at the last 10 rows of the dataset.
## unique_id transaction_id unformatted_date customer_id
## 1049991 769782 20190507|200|3|3|5458756 2019-07-31 NA
## 1049992 1548163 20190511|667|2|2|5896773 2019-08-04 2373.894
## 1049993 2519995 20170315|228|1|100|4057751 2017-06-08 NA
## 1049994 603476 20170412|32|1|100|2275541 2017-07-06 7849.586
## 1049995 1785526 20181008|973|3|3|4891897 2019-01-01 NA
## 1049996 578832 20170509|28|2|1|2224306 2017-08-02 NA
## 1049997 913658 20170210|362|2|2|4062730 2017-05-06 NA
## 1049998 1891118 20170213|2926|2|2|2226767 2017-05-09 NA
## 1049999 1334676 20170810|520|6|2|4317475 2017-11-03 3012.562
## 1050000 1534863 20181002|667|1|1|5351842 2018-12-26 NA
## product_id product_name category_id
## 1049991 16179 VUSE ALTO MEN POD 2ML CT 339
## 1049992 11153 APPLE FRITTER FCS 270
## 1049993 19020 No Lead Plus 1
## 1049994 19018 Mid Grade Blend E15 1
## 1049995 5366 KOOLEE 158
## 1049996 651 CAMEL TK RYL 85 BX 161
## 1049997 2396 MM ORNG 15.2OZ 260
## 1049998 10846 SUB RST BEEF CHDR FCS 274
## 1049999 17877 HOSTESS HOHO 3OZ 3CT 60
## 1050000 15842 NAKED JC BLU SMTHIE 15.2OZ 260
## category_name parent_id parent_name
## 1049991 Cartridges 337 E-Cigarettes
## 1049992 Bakery Food 269 Bakery
## 1049993 Fuel Premuim 78 Fuel
## 1049994 Fuel Premuim 78 Fuel
## 1049995 Carbonated Drinks 234 Frozen Dispensed Beverage
## 1049996 Cig-premium (331) 8 Cigarettes
## 1049997 Pk Bev-juice No Tax (78) 151 Juice No Tax (311)
## 1049998 Cold Food 273 Cold Sandwiches
## 1049999 Cakes & pastrie 262 Bread & Cakes
## 1050000 Pk Bev-juice No Tax (78) 151 Juice No Tax (311)
## product_count site_id site_name address
## 1049991 142 64 200 Cedar Falls 221 W 6Th St
## 1049992 112 336 667 Hooper 11701 State Highway 17
## 1049993 68 91 228 Brandon 704 Main St
## 1049994 68 121 32 Shell Rock 203 S Walnut St
## 1049995 10 476 973 Lafayette 603 S Public Rd
## 1049996 435 108 28 Sioux Rapids 114 Thomas St
## 1049997 95 132 362 Norfork 13862 Highway 5 S
## 1049998 34 114 2926 Commerce City 8096 Brighton Rd
## 1049999 217 274 520 Atlantic 1630 E 7Th St
## 1050000 95 336 667 Hooper 11701 State Highway 17
## city zip latitude longitude site_status revenue gross_profit
## 1049991 Cedar Falls 50613 42.5336 -92.4478 ACTIVE 13.49 4.848890
## 1049992 Hooper 81136 37.7451 -105.8810 ACTIVE 1.89 1.002000
## 1049993 Brandon 52210 42.3147 -92.0057 ACTIVE 12.00 1.636800
## 1049994 Shell Rock 50670 42.7116 -92.5839 ACTIVE 12.00 1.565400
## 1049995 Lafayette 80026 39.9932 -105.0910 ACTIVE 1.39 0.821617
## 1049996 Sioux Rapids 50585 42.8925 -95.1503 ACTIVE 7.15 1.349090
## 1049997 Norfork 72658 36.2088 -92.2863 ACTIVE 1.89 0.770000
## 1049998 Commerce City 80022 39.8437 -104.9210 ACTIVE 4.49 2.786000
## 1049999 Atlantic 50022 41.4048 -94.9895 ACTIVE 1.99 1.205000
## 1050000 Hooper 81136 37.7451 -105.8810 ACTIVE 3.49 1.552140
## costs units gp_margin
## 1049991 8.641110 1.000 0.3594433
## 1049992 0.888000 1.000 0.5301587
## 1049993 10.363200 5.456 0.1364000
## 1049994 10.434600 5.218 0.1304500
## 1049995 0.568383 1.000 0.5910914
## 1049996 5.800910 1.000 0.1886839
## 1049997 1.120000 1.000 0.4074074
## 1049998 1.704000 1.000 0.6204900
## 1049999 0.785000 1.000 0.6055276
## 1050000 1.937860 1.000 0.4447393
Now, we go about describing the details of each of the columns in the dataset. First, we shall provide brief description of the types of data we normally expect to see in a dataframe.
Int – unique_id is an integer. This means that it is a number with no decimal points.
Number – customer_id is a number. This is like an integer but might have a decimal point
Character – this is just a string or set of letters. We don’t have any of these in the dataset right now because we converted them all to Factors when we imported the data.
Factor – city is a factor. When we imported the data into R, we told it to convert characters or strings to Factors. A factor is a set of ordered, unique categories or levels that R can do stuff with. It is convenient for viewing and understanding data. For example, we can see that there are 170 unique cities in this column. Factors, however, are not good for manipulating data, since weird things can happen when we need to modify a factor. Thus, it is generally better to keep data in the character format and convert it to a factor only when needed for analysis.
date: we can convert a character or factor to a date. It is nice to do this because we can then perform operations on the dates, such as extracting the month of the year. Thus, later will convert unformatted_date to the date format.
NA is not a data type, but a stand-in when data is not available. It means “not applicable” and indicates an entry should be available but is not. Thus, the object should be there.
NULL is similar to NA but indicates that the value does not exist or is not measurable. Thus, the entry should not be there. We don’t have any of these in the data right now.
Now, we will go through column by column to examine what we have.
Unique_id is an integer, unique identifier for each row that has no duplicates and no nulls.
Transaction_id seems to be an identifier for transactions. There are duplicates since there are less factors than the numbers of rows and no nulls.
unformatted_date is clearly dates that are not formatted as a date. We will need to change that.
customer_id has a lot of missing values, but still lists thousands of customers. These are loyalty costumers using their loyalty card/number when they make a purchase.
product_id & product_name show that there are over 10,000 different products that are sold.
category_id & category_name represent 227 different categories of products. The most commonly purchased are “Fuel Premium” and “Cold Ben-retail (311)”
parent_id & parent_name represent 86 parent categories for the product. The most common, as we might suspect by now, is “Fuel” and “Cold Dispensed Beverage.”
product_count. It is not clear what this means, but it could be the number of different products for each parent category.
site_id, site_name, and address represent 189 different stores.
city: the stores are spread across 170 different cities.
zip is the zip code for the store. It is curious that we don’t have the state. It would be nice to obtain that data.
latitude & longitude: these are the coordinates to map each of the stores
site_status has two levels “ACTIVE” and “CLOSED”. We can’t really tell what this means at this point.
revenue is a very useful number. It shows how much revenue is earned for each product. This will display a sample of the data for the number of rows we specify. Let’s look at a sample of 100. You may have to rerun this to get a different sample, but eventually you will see that lottery payouts, cigarette discounts, and store discounts seem to be the primary reason for negative revenue.
costs appears to be the cost of the product that was sold.
gross_profit then must be the financial concept of the profit remaining for the company after direct costs are subtracted. Thus, gross profit is just revenue minus costs.
gp_margin must then be gross profit margin. This is another financial concept calculated as revenue minus costs, which equals gross_profit, divided by revenue. Gross profit margin represents the percentage of profit for each dollar of revenue. It is an important metric for measuring the profitability of the overall business and individual product categories. We will use these measures a lot in the class.
Note that gross_profit and costs and gp_marign have several thousand NAs. Using slice_sample() repeatedly or other methods, we can see that these are for abnormal and fairly rare transactions like prepaid fuel, money orders, coupons, etc.
units appears to list how many units of that product were sold.The summary() function will essentially reveals of lot of information for each of the columns.
## unique_id transaction_id unformatted_date
## Min. : 6 20170313|254|2|1|2067734: 3 2017-11-25: 1404
## 1st Qu.: 728240 20181207|877|2|2|4894428: 3 2017-10-14: 1386
## Median :1455571 20190502|381|2|2|4595573: 3 2017-11-11: 1350
## Mean :1455677 20161008|248|2|1|1726302: 2 2017-09-02: 1343
## 3rd Qu.:2184373 20161008|372|2|1|1625787: 2 2017-08-26: 1341
## Max. :2909475 20161008|520|3|1|3537099: 2 2017-11-16: 1338
## (Other) :1049985 (Other) :1041838
## customer_id product_id product_name category_id
## Min. : 1 Min. : 2 G87E10S : 96450 Min. : 1.0
## 1st Qu.: 3349 1st Qu.: 4036 LARGE FTN : 29811 1st Qu.: 58.0
## Median : 5436 Median : 9103 MEDIUM FTN : 28475 Median :158.0
## Mean : 5106 Mean : 9728 POP DEP SINGLE: 24959 Mean :161.2
## 3rd Qu.: 6672 3rd Qu.:16289 No Lead Plus : 22137 3rd Qu.:255.0
## Max. :13336 Max. :19544 PrepayFuel : 18410 Max. :371.0
## NA's :782896 (Other) :829758
## category_name parent_id
## Fuel Premuim :204294 Min. : 2.0
## Cold Bev-retail (311) : 72455 1st Qu.: 78.0
## Pk Bev-csd/pop (71) : 58697 Median :185.0
## Pk Bev-energy (carbonated) Fs (79) : 44895 Mean :175.4
## Gaming-lottery (scratch Tickt) (272): 34638 3rd Qu.:244.0
## Cig-off Inv-premium (441) : 33862 Max. :372.0
## (Other) :601159
## parent_name product_count site_id
## Fuel :204294 Min. : 1.0 Min. : 2.0
## Cold Dispensed Beverage: 84244 1st Qu.: 37.0 1st Qu.:111.0
## Juice/tonics : 65673 Median : 68.0 Median :222.0
## Pop (587) : 58697 Mean : 185.1 Mean :231.2
## Energy carbonated : 45005 3rd Qu.: 230.0 3rd Qu.:336.0
## Candy : 41573 Max. :1575.0 Max. :485.0
## (Other) :550514
## site_name address
## 561 Gardendale : 13729 890 Odum Road : 13729
## 670 Alamosa : 12033 505 3Rd St : 12033
## 391 Hamilton : 10642 1706 Military Street South: 10642
## 158 Russellville: 10594 13675 Hwy 43 : 10594
## 497 Patton : 10234 1000 State Route 51 : 10234
## 370 Dunning : 10174 403 Jewett Ave : 10174
## (Other) :982594 (Other) :982594
## city zip latitude longitude
## Columbia : 28359 Min. :35005 Min. :33.06 Min. :-108.47
## Aurora : 23678 1st Qu.:50274 1st Qu.:36.53 1st Qu.: -98.44
## Rogers : 16854 Median :64769 Median :39.83 Median : -94.76
## Commerce City: 15062 Mean :59806 Mean :39.37 Mean : -96.42
## Gardendale : 13729 3rd Qu.:73040 3rd Qu.:41.86 3rd Qu.: -92.41
## Mobile : 12818 Max. :82718 Max. :46.65 Max. : -90.01
## (Other) :939500
## site_status revenue gross_profit costs
## ACTIVE:986019 Min. : -1680 Min. :-169.320 Min. :-475.610
## CLOSED: 63981 1st Qu.: 1 1st Qu.: 0.639 1st Qu.: 0.690
## Median : 2 Median : 1.055 Median : 1.286
## Mean : 41 Mean : 1.693 Mean : 6.417
## 3rd Qu.: 7 3rd Qu.: 1.840 3rd Qu.: 5.219
## Max. :4500000 Max. : 500.000 Max. : 949.050
## NA's :21599 NA's :21599
## units gp_margin
## Min. :-200.00 Min. :-2859.000
## 1st Qu.: 1.00 1st Qu.: 0.128
## Median : 1.00 Median : 0.349
## Mean : 3.44 Mean : -0.416
## 3rd Qu.: 2.00 3rd Qu.: 0.542
## Max. : 999.00 Max. : 52.639
## NA's :61003 NA's :21599
From the output of the summary() function we can see that customer_id has over 700,000 NAs.
The output also shows that some transaction_ids are repeated 3 and 4 times. This means that some transactions must have more than one product purchased.
The most frequently purchased product is “G87E10S”. The next most common product sold is a large fountain drink.
The most commonly purchased are “Fuel Premium” and “Cold Ben-retail (311)”.
The store with the most transactions is “561 Gradendale” located at 890 odum road.
The city with the most purchases in our dataset is Columbia.
Since the median for ‘units’ is 1 and the mean is higher, it is likely that most products sold are 1 unit, but that a few products have significantly more than one unit. Just looking at the head() of tail() functions seems to verify this since we see that fuel has over multiple units every time, which would be gallons of fuel, while soda and other items are mainly 1 unit. Additionally, there are a few missing values for units. While the head and tail functions don’t provide any examples, using slice_sample() shows that prepaid fuel and “pop deposits” appear to be the reason why. This likely means the customer is returning a glass pop/soda bottle and receiving a refund.
There are no missing values in the revenue column but some negative values. Examining the head() and tail() functions does not reveal much about any negative amounts. Let’s use the function slice_sample() down at the end of the notebook here. This will display a sample of the data for the number of rows we specify. Let’s look at a sample of 100. You may have to rerun this to get a different sample, but eventually you will see that lottery payouts, cigarette discounts, and store discounts seem to be the primary reason for negative revenue.
Next, we will the open the .rds file. An rds file is an R data file. This is just the file from the last video saved in rds format. Saving it in rds format is useful because it uses less room on your hard drive and maintains all of the column formats.
Previously, we examined each column of our dataset called “mooc1”. In this section, we will clean or tidy that dataset. There are multiple ways that a dataset can be messy, including but not limited to the following: * Column headers that don’t have names * Multiple variables stored in one column * Variables stored in rows and columns * Outliers, which are extreme values—either vary large or vary small * Mistakes * Columns that have the wrong format or data type * Missing values * Missing columns – or the need to join or merge additional data into the current data set
Relatively speaking, our data is quite tidy. For example, each row is already one observation and each column is already one variable. However, there are a few things that we can tidy up.
Let’s start with fixing column data formats. In particular, we want to turn the unformatted_date column into a date object class so that we can perform operations on it.
lubridate. Thus, first, load the lubridate package.## Warning: package 'lubridate' was built under R version 4.0.5
The first step in changing unformatted_date to a date object is to see what format the date is in. We can do this very easily using looking at the “Environment” tab or the str() function. To look at just one column from a dataframe, we can put the name of the dataframe first, followed by a dollar sign, followed by the name of the column. Let’s use both of these functions to examine this column. We can place them both in the same code block and run them both.
While this is not the prettiest output, these two functions show that unformatted_date has the four-digit year followed by a dash, the two digit month followed by a dash, and then the two digit day.
## Factor w/ 1095 levels "2017-01-01","2017-01-02",..: 557 212 1062 293 855 1071 170 978 245 539 ...
The unformatted_date column has been converted to categorical data.
head(mooc1$unformatted_date, n=100) #Check out the first 100 entries in the *unformatted_date* column.## [1] 2018-07-11 2017-07-31 2019-11-28 2017-10-20 2019-05-05 2019-12-07
## [7] 2017-06-19 2019-09-05 2017-09-02 2018-06-23 2019-05-10 2019-08-01
## [13] 2017-07-30 2019-07-28 2017-08-29 2017-10-25 2019-08-21 2019-12-07
## [19] 2018-10-10 2019-04-30 2017-01-15 2018-08-21 2019-01-01 2019-08-08
## [25] 2017-09-16 2018-08-29 2019-11-06 2019-03-05 2017-10-26 2018-11-08
## [31] 2017-03-31 2017-03-14 2017-01-20 2018-03-25 2019-09-03 2017-06-08
## [37] 2018-11-07 2018-01-28 2017-12-04 2018-04-02 2018-06-23 2017-05-26
## [43] 2018-01-10 2019-11-16 2018-03-26 2017-04-18 2017-02-28 2018-08-23
## [49] 2017-04-22 2018-02-07 2017-11-07 2019-05-01 2019-02-24 2017-01-19
## [55] 2018-07-06 2018-11-08 2018-08-07 2019-01-29 2018-03-13 2017-07-12
## [61] 2019-01-25 2017-05-29 2019-02-27 2019-09-09 2018-02-09 2018-07-24
## [67] 2018-04-30 2019-01-01 2017-10-19 2017-01-19 2018-10-03 2018-04-13
## [73] 2019-03-22 2018-12-06 2019-09-07 2017-09-20 2017-06-07 2019-11-19
## [79] 2018-08-24 2017-07-10 2019-11-24 2019-12-01 2019-06-16 2018-01-18
## [85] 2017-06-27 2017-12-09 2017-09-10 2019-12-10 2018-07-09 2019-08-27
## [91] 2018-09-04 2018-08-07 2018-06-20 2017-07-17 2017-09-11 2017-12-23
## [97] 2018-08-11 2017-09-29 2018-12-01 2017-10-06
## 1095 Levels: 2017-01-01 2017-01-02 2017-01-03 2017-01-04 ... 2019-12-31
Lubridate allows us to easily convert the date once we know the text layout, using the correct function that matches that text layout. In this case, the correct function is ymd()—which we might guess stands for year, month, and day. Other examples would be mdy() for month-day-year, dmy() day-month-year, etc. Using this function, we can create a new column called date using the “$” symbol.
Next, lets create two more columns, a year and a month column. Once we have created the formatted date column, we can easily create these. To create a column for the year,we just use the year() function. Looking at the Environment tab, we can see that year has been created and is a number.
Next, create the month column. This is done the same way with the exception that we have the option to see the labels of the months—like jan, feb, march—instead of the numbers so, take that option with label=TRUE.
Finally, let’s look at the data to make sure these new columns worked. It is always a good idea to look at the data after manipulating it, to make sure the code did what we thought it was going to do. We can run the slice_sample() function to do this.
## unique_id transaction_id unformatted_date customer_id product_id
## 1 550490 20170822|222|1|1|2717796 2017-11-15 4064.10 16008
## 2 879117 20180807|251|3|2|2247743 2018-10-31 NA 36
## 3 2015272 20170825|961|2|1|2214387 2017-11-18 NA 18973
## 4 2787054 20190130|535|1|1|3694151 2019-04-25 10015.11 10468
## 5 2382569 20190706|453|1|1|5216393 2019-09-29 NA 18234
## 6 2601362 20180727|146|2|2|6017393 2018-10-20 NA 2179
## 7 1426595 20171226|561|5|4|7007729 2018-03-21 NA 3689
## 8 2811140 20180906|382|2|2|3706799 2018-11-30 6543.66 5365
## 9 39048 20171028|466|3|2|2792184 2018-01-21 NA 3309
## 10 611742 20170329|50|2|1|2831983 2017-06-22 NA 12300
## product_name category_id category_name
## 1 GOLD PEAK LMNDE TEA 18.5OZ 251 Iced Tea
## 2 PAYDAY 3.4OZ KS 267 Healthy Candy
## 3 G85E10S 1 Fuel Premuim
## 4 X-LARGE FTN 152 Cold Bev-retail (311)
## 5 LOTTERY WIN 135 Lottery-win (270)
## 6 HOT DOG FCS 280 Roller Grill Food
## 7 FUNYUN REG 2.375OZ 53 Salty Snacks -tort/corn Chips (152)
## 8 LARGE COFFEE 145 Hot Disp Bev-retail (301)
## 9 L&M 100 BX 207 Cig-off Inv-discount (442)
## 10 SKOAL LC WNTRGRN RL 334 Tobacco-round Can Snuff (960)
## parent_id parent_name product_count site_id site_name
## 1 150 Juice/tonics 166 85 222 Pilot Mound
## 2 331 Candy 230 102 251 Albert City
## 3 78 Fuel 68 467 961 Burlington
## 4 233 Cold Dispensed Beverage 11 280 535 Lowden
## 5 307 Store Services (92) 1 210 453 Butler
## 6 279 Roller Grill 9 33 146 Bella Vista
## 7 5 Salty Snacks 243 296 561 Gardendale
## 8 232 Hot Dispensed Beverage 9 152 382 Gentry
## 9 11 Off Invoice Cigs (104) 27 222 466 Centralia
## 10 321 Smokeless (951) 363 260 50 Andalusia
## address city zip latitude longitude site_status
## 1 102 Pilot St Pilot Mound 50223 42.1620 -94.0176 ACTIVE
## 2 114 Main St Albert City 50510 42.7819 -94.9476 ACTIVE
## 3 211 S Main St Burlington 82411 44.4447 -108.4310 ACTIVE
## 4 308 Washington Ave Lowden 52255 41.8580 -90.9264 ACTIVE
## 5 114 W Ohio St Butler 64730 38.2588 -94.3340 ACTIVE
## 6 1750 Forest Hills Blvd Bella Vista 72715 36.4649 -94.3020 ACTIVE
## 7 890 Odum Road Gardendale 35071 39.8873 -94.8933 ACTIVE
## 8 899 W Main St Gentry 72734 36.2672 -94.4929 ACTIVE
## 9 104 W Sneed St Centralia 65240 39.2103 -92.1384 ACTIVE
## 10 540 West Bypass Andalusia 36420 41.2039 -94.4171 ACTIVE
## revenue gross_profit costs units gp_margin date year month
## 1 3.00 1.00000 2.00000 2.000 0.33333333 2017-11-15 2017 Nov
## 2 1.99 0.91700 1.07300 1.000 0.46080402 2018-10-31 2018 Oct
## 3 5.00 0.63870 4.36130 2.129 0.12774000 2017-11-18 2017 Nov
## 4 1.79 1.79000 0.00000 1.000 1.00000000 2019-04-25 2019 Apr
## 5 -3.00 0.00000 0.00000 3.000 0.00000000 2019-09-29 2019 Sep
## 6 1.34 0.57800 0.76200 1.000 0.43134328 2018-10-20 2018 Oct
## 7 1.35 0.17000 1.18000 1.000 0.12592593 2018-03-21 2018 Mar
## 8 1.00 1.00000 0.00000 1.000 1.00000000 2018-11-30 2018 Nov
## 9 4.17 0.26222 3.90778 1.000 0.06288249 2018-01-21 2018 Jan
## 10 12.68 3.54266 9.13734 2.000 0.27938959 2017-06-22 2017 Jun
We will now the load the second rds file.
When tidying your data, it’s important to make sure there are very few mistakes, outliers and missing data. Let’s focus on a variable, revenue, since revenue is critical to any business and check for and deal with these issues.First, let’s make sure that revenue has no missing values. We can check that with the summary() function.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1680 1 2 41 7 4500000
There are no missing values, so let’s check for any mistakes. One type of mistake would be if there are non-numbers in this column, since we would expect revenue to be a number. We can check this with the str() function or by looking in the Environment window in RStudio.
## num [1:1050000] 1.69 11.96 -5 1 2.79 ...
Luckily, everything is a number here.
Next, let’s see if there are any really high or low values, any outliers. First, let’s check for extreme values using the summary() function. So, let’s look back up at the summary() function results.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1680 1 2 41 7 4500000
This does show some potential problems. First, there are some negative amounts. That seems strange for revenue. However, we previously learned that lottery payouts and cigarette discounts seem to be the primary reason for negative revenue. That makes sense and does not seem like a mistake.
However, a second problem is that there seems to be some really high revenue numbers. First, the mean is much higher than the median. This indicates that there are a relatively few observations that are really high. Second, the maximum revenue measure is really high. It seems pretty unlikely that any individual product, even if many of them are bought, is going to give revenue of 4,500,000!
## num [1:1050000] 1.69 11.96 -5 1 2.79 ...
So, we need to investigate if these high amounts are errors. Again, there are many ways to do this, but one way is just to sort the dataset on revenue with the highest revenue coming first. So, there are really three things we need to do.
First, we need to tell R which dataframe we are working with. Second, we need to tell R to sort this dataframe with the highest revenue coming first. Third, we want to just look at the first 25 rows.
While we could do this all in one statement with a bunch of parentheses, let’s start using the package tidyverse to make things a little easier. Tidyverse is a grouping of 8 packages. It is really useful for use with dataframes. Here we are going to utilize a useful thing called the pipe symbol. This symbol, %>%, allows us to attach different tasks together like a train or a chain, instead of putting it all into one statement.
So, first, we say which DF we are working on. Next, we use the arrange() function with the desc() function nested in there to get descending order. Finally, we run the head() function with n=25. We could have done this all as one statement without using the %>% symbol, but that is sometimes more confusing.
## unique_id transaction_id unformatted_date customer_id product_id
## 1 206905 20170919|840|3|1|3439349 2017-12-13 NA 17628
## 2 90931 20171017|865|1|1|3104362 2018-01-10 NA 17628
## 3 1613713 20180318|878|2|2|4048051 2018-06-11 NA 17628
## 4 159545 20180119|840|2|1|3617466 2018-04-14 NA 17628
## 5 232546 20180129|873|2|1|2472746 2018-04-24 5219.362 17628
## 6 442516 20171223|838|3|2|3043467 2018-03-18 NA 17628
## 7 197375 20171227|991|1|1|2144611 2018-03-22 NA 17628
## 8 1625557 20180219|881|5|1|6265910 2018-05-15 NA 17628
## 9 1635069 20170719|881|2|2|5624176 2017-10-12 NA 17628
## 10 1598185 20170901|863|1|1|4618493 2017-11-25 NA 17628
## 11 471717 20170904|876|1|2|2928779 2017-11-27 3820.074 17628
## 12 347583 20170714|860|1|2|3262062 2017-10-07 NA 17628
## 13 293589 20180307|991|1|1|2191536 2018-05-31 NA 17628
## 14 88717 20170720|981|3|1|2402113 2017-10-13 NA 17628
## 15 52783 20180202|873|2|2|2476587 2018-04-28 6033.016 17628
## 16 1598932 20170826|863|2|2|4604148 2017-11-19 5500.852 17628
## 17 1656405 20171025|887|1|2|4632713 2018-01-18 NA 17628
## 18 253500 20180119|840|1|1|3616496 2018-04-14 NA 17628
## 19 188498 20170916|840|2|1|3434192 2017-12-10 6600.666 17628
## 20 1854890 20180317|1837|3|3|1309896 2018-06-10 NA 17628
## 21 1624887 20171207|881|2|1|6074119 2018-03-02 NA 17628
## 22 1634113 20170714|881|3|1|5606866 2017-10-07 NA 17628
## 23 290645 20171229|834|3|1|3147892 2018-03-24 NA 17628
## 24 1859702 20171020|1837|1|1|931408 2018-01-13 NA 17628
## 25 316138 20170919|876|2|2|2953468 2017-12-13 NA 17628
## product_name category_id category_name
## 1 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 2 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 3 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 4 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 5 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 6 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 7 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 8 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 9 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 10 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 11 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 12 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 13 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 14 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 15 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 16 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 17 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 18 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 19 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 20 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 21 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 22 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 23 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 24 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## 25 OK LUCKY RED 7S 467 138 Gaming-lottery (scratch Tickt) (272)
## parent_id parent_name product_count site_id site_name
## 1 175 Lottery (360) 1575 382 840 Albany
## 2 175 Lottery (360) 1575 399 865 Lane
## 3 175 Lottery (360) 1575 410 878 Forgan
## 4 175 Lottery (360) 1575 382 840 Albany
## 5 175 Lottery (360) 1575 405 873 Turpin
## 6 175 Lottery (360) 1575 380 838 Oxford
## 7 175 Lottery (360) 1575 485 991 Binger
## 8 175 Lottery (360) 1575 413 881 Erick
## 9 175 Lottery (360) 1575 413 881 Erick
## 10 175 Lottery (360) 1575 397 863 Atoka
## 11 175 Lottery (360) 1575 408 876 Bokchito
## 12 175 Lottery (360) 1575 394 860 Cherokee
## 13 175 Lottery (360) 1575 485 991 Binger
## 14 175 Lottery (360) 1575 480 981 Fort Cobb
## 15 175 Lottery (360) 1575 405 873 Turpin
## 16 175 Lottery (360) 1575 397 863 Atoka
## 17 175 Lottery (360) 1575 418 887 Greenfield
## 18 175 Lottery (360) 1575 382 840 Albany
## 19 175 Lottery (360) 1575 382 840 Albany
## 20 175 Lottery (360) 1575 55 1837 Winfield
## 21 175 Lottery (360) 1575 413 881 Erick
## 22 175 Lottery (360) 1575 413 881 Erick
## 23 175 Lottery (360) 1575 377 834 Burlington
## 24 175 Lottery (360) 1575 55 1837 Winfield
## 25 175 Lottery (360) 1575 408 876 Bokchito
## address city zip latitude longitude site_status revenue
## 1 109 N Wade St Albany 74721 33.8818 -96.1620 CLOSED 4500000
## 2 11820 E Highway 3 Lane 74555 34.2980 -95.9867 ACTIVE 3000000
## 3 115 S Broadway St Forgan 73938 36.9063 -100.5350 ACTIVE 2000000
## 4 109 N Wade St Albany 74721 33.8818 -96.1620 CLOSED 2000000
## 5 419 S Main St Turpin 73950 36.8634 -100.8780 ACTIVE 2000000
## 6 92 Plaza Lane Oxford 36203 35.9938 -94.5681 ACTIVE 2000000
## 7 224 W Main St Binger 73009 35.3086 -98.3442 CLOSED 1000000
## 8 119 W 3Rd St Erick 73645 35.2141 -99.8677 ACTIVE 1000000
## 9 119 W 3Rd St Erick 73645 35.2141 -99.8677 ACTIVE 1000000
## 10 409 E Court St Atoka 74525 34.3854 -96.1240 ACTIVE 1000000
## 11 118 W Main St Bokchito 74726 34.0169 -96.1456 ACTIVE 1000000
## 12 1780 S Grand Ave Cherokee 73728 36.7396 -98.3596 ACTIVE 500000
## 13 224 W Main St Binger 73009 35.3086 -98.3442 CLOSED 500000
## 14 408 N Hazlett St Fort Cobb 73038 35.0990 -98.4358 ACTIVE 500000
## 15 419 S Main St Turpin 73950 36.8634 -100.8780 ACTIVE 500000
## 16 409 E Court St Atoka 74525 34.3854 -96.1240 ACTIVE 500000
## 17 100 Madison St Greenfield 73043 35.7276 -98.3789 ACTIVE 500000
## 18 109 N Wade St Albany 74721 33.8818 -96.1620 CLOSED 500000
## 19 109 N Wade St Albany 74721 33.8818 -96.1620 CLOSED 500000
## 20 2575 Us Hwy 43 Winfield 35594 36.5326 -98.1534 ACTIVE 500000
## 21 119 W 3Rd St Erick 73645 35.2141 -99.8677 ACTIVE 500000
## 22 119 W 3Rd St Erick 73645 35.2141 -99.8677 ACTIVE 500000
## 23 513 Main St Burlington 73722 36.9003 -98.4228 ACTIVE 500000
## 24 2575 Us Hwy 43 Winfield 35594 36.5326 -98.1534 ACTIVE 500000
## 25 118 W Main St Bokchito 74726 34.0169 -96.1456 ACTIVE 500000
## gross_profit costs units gp_margin date year month
## 1 2.7 42.3 9 0.06 2017-12-13 2017 Dec
## 2 1.8 28.2 6 0.06 2018-01-10 2018 Jan
## 3 1.2 18.8 4 0.06 2018-06-11 2018 Jun
## 4 1.2 18.8 4 0.06 2018-04-14 2018 Apr
## 5 1.2 18.8 4 0.06 2018-04-24 2018 Apr
## 6 1.2 18.8 4 0.06 2018-03-18 2018 Mar
## 7 0.6 9.4 2 0.06 2018-03-22 2018 Mar
## 8 0.6 9.4 2 0.06 2018-05-15 2018 May
## 9 0.6 9.4 2 0.06 2017-10-12 2017 Oct
## 10 0.6 9.4 2 0.06 2017-11-25 2017 Nov
## 11 0.6 9.4 2 0.06 2017-11-27 2017 Nov
## 12 0.3 4.7 1 0.06 2017-10-07 2017 Oct
## 13 0.3 4.7 1 0.06 2018-05-31 2018 May
## 14 0.3 4.7 1 0.06 2017-10-13 2017 Oct
## 15 0.3 4.7 1 0.06 2018-04-28 2018 Apr
## 16 0.3 4.7 1 0.06 2017-11-19 2017 Nov
## 17 0.3 4.7 1 0.06 2018-01-18 2018 Jan
## 18 0.3 4.7 1 0.06 2018-04-14 2018 Apr
## 19 0.3 4.7 1 0.06 2017-12-10 2017 Dec
## 20 0.3 4.7 1 0.06 2018-06-10 2018 Jun
## 21 0.3 4.7 1 0.06 2018-03-02 2018 Mar
## 22 0.3 4.7 1 0.06 2017-10-07 2017 Oct
## 23 0.3 4.7 1 0.06 2018-03-24 2018 Mar
## 24 0.3 4.7 1 0.06 2018-01-13 2018 Jan
## 25 0.3 4.7 1 0.06 2017-12-13 2017 Dec
Looking at these high-revenue products, these first few rows, shows us that there is clearly a problem with product number 17628, which is the product called OK LUCKY RED 7S 46. Looking at the category name and the parent name we see that these are a particular type of scratch-off lottery ticket. These products make up all of the 25 highest revenue products and the revenue does not match up with the profit or the costs. So, clearly there has been a mistake made.
So, next, we need to check if all of these particular products have bad revenue. Let’s create a new dataframe so we can easily see all of them. Let’s name this dataframe “outliers.”
First, we name the new dataframe. Next, we enter the definition symbol. Next, we tell R to use mooc1 to create this DF. Mooc1 is the dataframe that we are taking the new dataframe from. After that, we use the handy function called filter(). Filter is a really useful function from Tidyverse. Filter will select only the rows of the data frame that we specify. Let’s pick product_id = 17628, which, of course, is the product id that belongs to “7628 OK LUCKY RED 7S 46”. Notice that we need to use two equals signs to do this. This tells R that we want product_id to equal 17628 and not to be assigned that number. That is all we need to do to create this dataframe.
outliers <- mooc1 %>% filter(product_id == 17628) #Creating a new dataframe to look at all
#"7628 OK LUCKY RED 7S 46" So, finally, we will need to eliminate these outliers since it is unclear what mistake was made here and these revenue amounts do not represent reality. There are several ways to do this, but for our purposes, let’s just delete all of these rows. We can use filter again to do this. First, let’s call the dataframe something new, since we are eliminating some data. This is just good practice in case we make a mistake. Second, use the filter function, just the way we did before, but this time with the not equal operator. !=.
mooc1_clean <- mooc1 %>% filter(product_id != 17628) #Creating a new dataframe to delete
#"7628, OK LUCKY RED 7S 46"Finally, let’s check the distribution of the revenue columns now using the summary() function again.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1680.00 1.49 2.49 7.75 6.51 999.00
This seems much more reasonable. If we were to examine the highest revenue now, we would see that the highest amounts are purchases of hundreds of gallons of diesel fuel and hundreds of lottery tickets, which makes sense.
Let’s load the final rds file and the dataset with all the necessary state information in USA.
Another really important task in ETL is to merge or join data together. For some reason, we have no states in our dataset. We have cities and zip codes but not states. It would be nice to add that data. Luckily, we have created a state database that you can use that has all of the zip codes listed in this dataset, along with the states. Let’s open this dataset and join it to our other dataframe.
Let’s use the functions str(), summary(), head(), and tail() to examine this new dataframe.
## 'data.frame': 442 obs. of 4 variables:
## $ postal_code : chr "36420" "35611" "35954" "36830" ...
## $ state_province : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ state_province_code: chr "AL" "AL" "AL" "AL" ...
## $ country_name : chr "USA" "USA" "USA" "USA" ...
## postal_code state_province state_province_code country_name
## Length:442 Length:442 Length:442 Length:442
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## postal_code state_province state_province_code country_name
## 1 36420 Alabama AL USA
## 2 35611 Alabama AL USA
## 3 35954 Alabama AL USA
## 4 36830 Alabama AL USA
## 5 36507 Alabama AL USA
## 6 35022 Alabama AL USA
## 7 35209 Alabama AL USA
## 8 35210 Alabama AL USA
## 9 35235 Alabama AL USA
## 10 35206 Alabama AL USA
## postal_code state_province state_province_code country_name
## 433 36206 Alabama AL USA
## 434 35016 Alabama AL USA
## 435 51333-7917 Iowa IA USA
## 436 72633-9998 Arkansas AR USA
## 437 80614-9990 Colorado CO USA
## 438 80229-9998 Colorado CO USA
## 439 80015-9998 Colorado CO USA
## 440 80017-9998 Colorado CO USA
## 441 80113-9997 Colorado CO USA
## 442 80026-9998 Colorado CO USA
We notice a few things here.
postal_code: the first column appears to be zip codes. There are two problems here. First, this column is called postal_code and not zip. We don’t have a postal_code column in our mooc1 dataframe, but instead we have a column called zip. This is a problem because in order to join our two databases together and get states into our dataframe, we need one column to link the databases—that is, one column that is the same between both tables. The only thing the two tables have in common is zip code. It looks like these two columns represent the same thing, zip codes, but we need to account for this different name. Second, this column is in a different format from the zip column. The column in the states dataframe is a character variable and zip is an integer. So, we need to figure that out.state_province -this is the column we want to put into our mooc1 dataframe.state_provice_code – this seems to be redundant with state_province, so, we could bring it in, but we don’t really have to.country_name – this column is useless since all of the states are from the US. We can see this if we turn it into a factor, using the str() function or looking at the Environment tab, but we won’t do that here.So, our first step is to put the postal code column into the same format as the zip column, which is integer. We can also put it into number which will work as well. To do this we will use the as.interger() function. First, we tell R which column we are going to create. Then, we apply the function to that column. To be safe, let’s do this to a new column called postal_code2.
## Warning: NAs introduced by coercion
But wait, we get a warning message that NAs are introduced. Let’s us the summary() function to look at the three columns—postal_code, postal_code2, and zip.
## Length Class Mode
## 442 character character
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 35005 35215 35806 35832 36330 36867 345
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35005 50274 64769 59806 73040 82718
We see that, indeed, over 300 NAs are created. What is going on? If we scroll back up to our original view of the mooc1_states dataframe using the tail function, we see that the reason R made the column a character type in the first place is that some zip codes include a dash and 4 extra numbers. R must not have defined the dash and extra numbers as the number format, but rather the character format. If we look at mooc1 using summary() we don’t see those in zip. In fact, all numbers are only five digits. These extra four numbers are called “+four codes” and are used to make zip codes even more specific. We do not need these and mooc1 does not have them.
So, it looks like we are going to need to take off the dash and the extra numbers. Let’s do that with the str_sub() function. This function will take the characters from a column and subset that string to the characters we want. Since we only want the first through the fifth numbers we pass 1, 5. Run the tail() function and compare postal_code and postal_code2 to see that this worked.
mooc1_states$postal_code2 <- mooc1_states$postal_code %>% str_sub(1,5) #Split postal code column in
#the state dataframe
tail(mooc1_states, n=10)## postal_code state_province state_province_code country_name postal_code2
## 433 36206 Alabama AL USA 36206
## 434 35016 Alabama AL USA 35016
## 435 51333-7917 Iowa IA USA 51333
## 436 72633-9998 Arkansas AR USA 72633
## 437 80614-9990 Colorado CO USA 80614
## 438 80229-9998 Colorado CO USA 80229
## 439 80015-9998 Colorado CO USA 80015
## 440 80017-9998 Colorado CO USA 80017
## 441 80113-9997 Colorado CO USA 80113
## 442 80026-9998 Colorado CO USA 80026
Sure enough, if looks like it did and we can move on.
Now that we have the only five numbers in postal code 2, we can retry the as.integer() function. After doing that, let’s run the summary() function and make sure things worked. Ok. That worked. Now, both zip and postal_code2 are in the integer format.
## postal_code state_province state_province_code country_name
## Length:442 Length:442 Length:442 Length:442
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## postal_code2
## Min. :35005
## 1st Qu.:50408
## Median :58063
## Mean :58656
## 3rd Qu.:72943
## Max. :82725
Next, Let’s change the name of the postal_code2 column to zip. We could deal with these having a different name through the join() function itself that we will do next, but it will be good practice to do it this way. This is done quite easily with the rename() function from Tidyverse. First, we define a new dataframe. It is probably safe to use the old dataframe name here, since this doesn’t mess with the data in any way. Next, we tell R which dataframe we are manipulating. Then we use the rename() function with the new name first.
Now, let’s clean up the dataframe before we merge it. First, let’s drop postal_code and country_name since we don’t need them. To do this, let’s introduce another really useful function from Tidyverse, the select() function. This function selects which columns we want and don’t want for a new dataframe. First, give a name to the new dataframe. We can use mooc1_states and rewrite it. Then, tell R which dataframe to manipulate, in this case, mooc1_states, of course.
Now, let’s use the select() function to reorder the columns and drop state_province_code. First, just put them in the order you want inside the select() function. Then, just omit the ones you don’t want. Thus, we drop state_province_code just by not mentioning it in the select() function.
Finally, we get to merge our two data frames. Merging or joining is really tricky business. It is really important to think about what you want before merging, and then to make sure you get what you want after merging. One key way to do this is to 1) know the level of aggregation in your data frames and 2) check the number of rows of your dataframes before and after you merge, to make sure you did what you thought you did.
What does “level of aggregation” mean? By this we just mean what each row of your data represents. For example, each row of mooc1 represents one or more of the same product from one transaction. Thus, the level of aggregation is “product-transaction”. On the other hand, each row in mooc1_states is a unique zip code. So, the level of aggregation is zip code.
Once we know the level of aggregation of our two dataframes, we can ask the critical question: What level of aggregation do we want when we merge these dataframes? We want to just add state names to rows that already exist in mooc1. So, our final dataframe should have exactly the number of rows that mooc1 has right now, which is 1,049,961. Keep that in mind.
So, that does not sound so difficult, what can go wrong? Well, a lot, but let’s focus on these potential problems. * First, maybe mooc1_states has duplicate zip codes. If this is true, then we might get duplicates in mooc1. We could check this with n_distinct(), but we already have and all zip codes are unique. * The next problem is that we use the wrong join method. There are multiple ways to join things and using the wrong method will cause problems * Using the wrong method could add rows that we don’t want, because they are in the state dataframe but not in the mooc1 dataframe. Alternatively, using the wrong method could take away rows from the mooc1 dataframe because they are not in the state dataframe. Remember, we want to keep everything in the mooc1 dataframe and add just states from the state dataframe.
Here is the documentation from the package dplyr. These are for mutating joins, which means joins that add new columns (https://dplyr.tidyverse.org/reference/mutate-joins.html). The mutating joins add columns from the second dataframe to the first dataframe, matching rows based on the keys supplied in the function. If a row in the first dataframe matches multiple rows in the second dataframe, all the rows in the second dataframe will be returned once for each matching row in the first dataframe. Here are the types:
So, which one of these methods do we want? Well, let’s use the wrong one first, just to illustrate why it is important to use the correct join method. Let’s use the full_join(), which keeps everything from both dataframes, whether the rows march up or not. First, let’s check the size of mooc1 before and then after using dim() This is how many rows we want to have when we are done.
## [1] 1049961 27
We want to add one column to this table, since we are adding state_province. We do not add zip because we are using zip to merge with mooc1 and so the two zip columns will just become one column. Call the new dataframe mooc1_etl. We use the full_join() function. This function has as arguments the original dataframe and the new dataframe and which column we are using to join on. Look at the Environment tab. We can see there that the columns we want from state_province are now in mooc1.
mooc1_etl <- full_join(mooc1, mooc1_states, by='zip') #Join the tables together using `full_join`
#(wrong way for our case)Next, we use the dim() function to check the number of rows of the new table.
## [1] 1050222 28
Unfortunately, we have too many rows. What does this mean and why did this happen? Well, full_join() keeps all rows in both tables, even if they don’t match and there must have been some states in the states table that were not in the mooc1 table. This is a problem and we do not want these extra rows.
Let’s run the summary() function on unique_id.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6 728269 1455575 1455701 2184398 2909475 261
Sure enough, when we look at the summary of unique_id we see over 200 NAs. This happens because we are adding state_provinces that have no information in the mooc1 table. That means there were over 200 zip codes in mooc1_states that were not in mooc1_clean.
So, now let’s do this correctly. What join type do we actually want? We want the left_join() method. This includes all rows from the first dataframe and ONLY those rows from the second dataframe. Then, let’s run the dim() function and the summary() function to make sure we get the right number of columns and rows and to make sure we have no NAs in unique_id.
mooc1_etl <- left_join(mooc1, mooc1_states, by='zip') # Using `left_join()`
#(the correct way for our case)
dim(mooc1_etl)## [1] 1049961 28
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6 728269 1455575 1455701 2184398 2909475
Sure enough, when we do this, we get the same number of rows before and after, one additional column after, and no NAs.
Finally, let’s export this table. We can export the dataframe as an ‘rds’ file that is native to R. Again, this is good practice for several reasons. First, while we could continue to import the file from the csv version every time, this takes longer than an rds file. Second, by exporting the file midway through analysis, we can always come back and read in that file and not have to go back and redo all of the manipulation that we already did. This is not hard to do, but it does waste time. Finally, saving to rds rather than csv preserves the format of the columns. For example, the date column will be correctly formatted as the Date format, and we won’t have to go back and fix that again. We can also compress the file.
write_rds(mooc1_etl, 'mooc1_etl_done.rds', compress = 'gz') #Export mooc1_etl for use later
#while retaining the column formatsThis project on ETL included many of the key skills needed to load, transform and manipulate data. They will serve well in solving business problems with data.