Cleaning Zip Code Formatting

Purpose

We needed to verify a set of data was correctly reporting web access by zip code. The dataset listed the state of access and the first three digits of the zip code as collected by the client’s service.

Upon examination of the data, we noticed that a number of reported zip codes did not match the state of origin. We also noticed that many of the reported zip codes had one or two digits instead of the full three. Lastly, a small number of zip codes were actually Canadian (which uses alpha-numerical characters in the first three digits of their zip codes.)

The project proceeded with adding preceeding zeros to the client’s dataset where applicable, assuming that a program like Excel dropped them. Later, we confirmed whether or not the zip codes matched the state reported. This was finally visualized using Tableau.

As part of the non-confidential work, this code will clean a zipcode database available from the federal government at http://federalgovernmentzipcodes.us/

The file is labeled free-zipcode-database.csv

It provides entire zip codes, the municipality they are associated with and latitude - longitude coordinates. For our purposes, we only needed the first three digits of the zip code and the state from which it came.

Making a leading 3 digit codex for US zip codes

The file is located in the current working directory.

USzips = read.csv("free-zipcode-database.csv", header = TRUE, sep = ",", colClasses = "character")
head(USzips)

##   RecordNumber Zipcode ZipCodeType                City State
## 1            1   00704    STANDARD         PARC PARQUE    PR
## 2            2   00704    STANDARD PASEO COSTA DEL SUR    PR
## 3            3   00704    STANDARD       SECT LANAUSSE    PR
## 4            4   00704    STANDARD     URB EUGENE RICE    PR
## 5            5   00704    STANDARD        URB GONZALEZ    PR
## 6            6   00704    STANDARD      URB LA FABRICA    PR
##     LocationType   Lat   Long Xaxis Yaxis Zaxis WorldRegion Country
## 1 NOT ACCEPTABLE 17.96 -66.22  0.38 -0.87  0.30        <NA>      US
## 2 NOT ACCEPTABLE 17.96 -66.22  0.38 -0.87  0.30        <NA>      US
## 3 NOT ACCEPTABLE 17.96 -66.22  0.38 -0.87  0.30        <NA>      US
## 4 NOT ACCEPTABLE 17.96 -66.22  0.38 -0.87  0.30        <NA>      US
## 5 NOT ACCEPTABLE 17.96 -66.22  0.38 -0.87  0.30        <NA>      US
## 6 NOT ACCEPTABLE 17.96 -66.22  0.38 -0.87  0.30        <NA>      US
##              LocationText                     Location Decommisioned
## 1         Parc Parque, PR         NA-US-PR-PARC PARQUE         false
## 2 Paseo Costa Del Sur, PR NA-US-PR-PASEO COSTA DEL SUR         false
## 3       Sect Lanausse, PR       NA-US-PR-SECT LANAUSSE         false
## 4     Urb Eugene Rice, PR     NA-US-PR-URB EUGENE RICE         false
## 5        Urb Gonzalez, PR        NA-US-PR-URB GONZALEZ         false
## 6      Urb La Fabrica, PR      NA-US-PR-URB LA FABRICA         false
##   TaxReturnsFiled EstimatedPopulation TotalWages Notes
## 1                                                     
## 2                                                     
## 3                                                     
## 4                                                     
## 5                                                     
## 6

names(USzips)

##  [1] "RecordNumber"        "Zipcode"             "ZipCodeType"        
##  [4] "City"                "State"               "LocationType"       
##  [7] "Lat"                 "Long"                "Xaxis"              
## [10] "Yaxis"               "Zaxis"               "WorldRegion"        
## [13] "Country"             "LocationText"        "Location"           
## [16] "Decommisioned"       "TaxReturnsFiled"     "EstimatedPopulation"
## [19] "TotalWages"          "Notes"

dim(USzips)

## [1] 81831    20

The only columns of interest are presently “Zipcode” and “State”. So let’s only look at those.

longZips = USzips[,c("Zipcode", "State")]
head(longZips)

##   Zipcode State
## 1   00704    PR
## 2   00704    PR
## 3   00704    PR
## 4   00704    PR
## 5   00704    PR
## 6   00704    PR

Now to extract the first 3 digits

shortZips = longZips
shortZips$Zip3 = substring(longZips$Zipcode, 1,3)
head(shortZips)

##   Zipcode State Zip3
## 1   00704    PR  007
## 2   00704    PR  007
## 3   00704    PR  007
## 4   00704    PR  007
## 5   00704    PR  007
## 6   00704    PR  007

For each distinct 3 digit zip, there needs to be an associated state. But there doesn’t need to be one hundred 007’s for example, just one.

shortZips = shortZips[,c(2:3)]
zipList = unique(shortZips)
head(zipList)

##     State Zip3
## 1      PR  007
## 26     PR  006
## 113    NJ  070
## 115    NJ  076
## 122    NJ  078
## 123    NJ  079

dim(zipList)

## [1] 939   2

Save the file as csv.

write.csv(zipList, "zipsFirst3.csv", row.names = FALSE)

Now that this csv file has been created, it can be imported into Excel (the zipcodes must be imported as text) and sorted or filtered as desired. It can also be merged with the client’s file in Tableau. But just as importantly, we have a reference for future use of geographic locations for the first 3 digits of US zip codes!

Correcting the Client’s Data

I will now load an anonymized subset of the client’s data. I did some additional preparation by changing some of the data descriptions, sorting the data by zip code, and changing the different versions of ‘NA’ and the three Canadian zip codes to actually say ‘NA’.

client.data = read.csv("anonData.csv", header = TRUE, sep = ",", colClasses = "character")
head(client.data)

##   Category.1 Category.Type  Gender         State Merge Zip.First.3
## 1         A8           T25       F    Washington    00           0
## 2         A5           T24       F        Oregon    00           0
## 3         A6           T24       F          Utah    00           0
## 4         A3           T28       F    Washington    00           0
## 5         A7           T24 Not Set Massachusetts     0          10
## 6         A7           T01 Not Set Massachusetts     0          10
##   Total.Events     Month    Region
## 1            1  7/1/2016      West
## 2            3 10/1/2016      West
## 3            6 10/1/2016      West
## 4            1 11/1/2016      West
## 5            1  1/1/2015 Northeast
## 6            1  3/1/2015 Northeast

names(client.data)

## [1] "Category.1"    "Category.Type" "Gender"        "State"        
## [5] "Merge"         "Zip.First.3"   "Total.Events"  "Month"        
## [9] "Region"

dim(client.data)

## [1] 49191     9

summary(client.data)

##   Category.1        Category.Type         Gender         
##  Length:49191       Length:49191       Length:49191      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##     State              Merge           Zip.First.3       
##  Length:49191       Length:49191       Length:49191      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  Total.Events          Month              Region         
##  Length:49191       Length:49191       Length:49191      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

We can see from the summary that the lowest zip code is ‘0’. As a side note, no US zip codes lead with ‘000’ so this data set has more than one kind of data that will be cleaned. Naturally, this is trusting the assumption that the client’s data provider did indeed provide the first three digits of all the web logged zip codes.

Now, having identified the zip codes which have only one and two digits, I will add leading zeros. Lines 1:4 need two preceeding zeros. Lines 5:667 need one preceeding zero. I added the correct number of zeros in Excel.

zipX = client.data[c(1:667),c(5:6)]
zipX[,3] = paste(zipX[,1], zipX[,2], sep = '')
head(zipX)

##   Merge Zip.First.3  V3
## 1    00           0 000
## 2    00           0 000
## 3    00           0 000
## 4    00           0 000
## 5     0          10 010
## 6     0          10 010

$V3 is the correctly formatted zip code for rows 1:667.

Now we update the original column with the correct zip code data.

fixed.data = client.data
fixed.data[c(1:667),"Zip.First.3"] = zipX$V3
head(fixed.data)

##   Category.1 Category.Type  Gender         State Merge Zip.First.3
## 1         A8           T25       F    Washington    00         000
## 2         A5           T24       F        Oregon    00         000
## 3         A6           T24       F          Utah    00         000
## 4         A3           T28       F    Washington    00         000
## 5         A7           T24 Not Set Massachusetts     0         010
## 6         A7           T01 Not Set Massachusetts     0         010
##   Total.Events     Month    Region
## 1            1  7/1/2016      West
## 2            3 10/1/2016      West
## 3            6 10/1/2016      West
## 4            1 11/1/2016      West
## 5            1  1/1/2015 Northeast
## 6            1  3/1/2015 Northeast

Now we can remove the unwanted column “Merge”, column 5 and save the data.

fixed.data = fixed.data[, c(1:4, 6:9)]
head(fixed.data)

##   Category.1 Category.Type  Gender         State Zip.First.3 Total.Events
## 1         A8           T25       F    Washington         000            1
## 2         A5           T24       F        Oregon         000            3
## 3         A6           T24       F          Utah         000            6
## 4         A3           T28       F    Washington         000            1
## 5         A7           T24 Not Set Massachusetts         010            1
## 6         A7           T01 Not Set Massachusetts         010            1
##       Month    Region
## 1  7/1/2016      West
## 2 10/1/2016      West
## 3 10/1/2016      West
## 4 11/1/2016      West
## 5  1/1/2015 Northeast
## 6  3/1/2015 Northeast

write.csv(fixed.data, "updated.client.data.csv")

Perfect! Now the data may be analyzed in Tableau!