We needed to verify a set of data was correctly reporting web access by zip code. The dataset listed the state of access and the first three digits of the zip code as collected by the client’s service.
Upon examination of the data, we noticed that a number of reported zip codes did not match the state of origin. We also noticed that many of the reported zip codes had one or two digits instead of the full three. Lastly, a small number of zip codes were actually Canadian (which uses alpha-numerical characters in the first three digits of their zip codes.)
The project proceeded with adding preceeding zeros to the client’s dataset where applicable, assuming that a program like Excel dropped them. Later, we confirmed whether or not the zip codes matched the state reported. This was finally visualized using Tableau.
As part of the non-confidential work, this code will clean a zipcode database available from the federal government at http://federalgovernmentzipcodes.us/
The file is labeled free-zipcode-database.csv
It provides entire zip codes, the municipality they are associated with and latitude - longitude coordinates. For our purposes, we only needed the first three digits of the zip code and the state from which it came.
The file is located in the current working directory.
USzips = read.csv("free-zipcode-database.csv", header = TRUE, sep = ",", colClasses = "character")
head(USzips)
## RecordNumber Zipcode ZipCodeType City State
## 1 1 00704 STANDARD PARC PARQUE PR
## 2 2 00704 STANDARD PASEO COSTA DEL SUR PR
## 3 3 00704 STANDARD SECT LANAUSSE PR
## 4 4 00704 STANDARD URB EUGENE RICE PR
## 5 5 00704 STANDARD URB GONZALEZ PR
## 6 6 00704 STANDARD URB LA FABRICA PR
## LocationType Lat Long Xaxis Yaxis Zaxis WorldRegion Country
## 1 NOT ACCEPTABLE 17.96 -66.22 0.38 -0.87 0.30 <NA> US
## 2 NOT ACCEPTABLE 17.96 -66.22 0.38 -0.87 0.30 <NA> US
## 3 NOT ACCEPTABLE 17.96 -66.22 0.38 -0.87 0.30 <NA> US
## 4 NOT ACCEPTABLE 17.96 -66.22 0.38 -0.87 0.30 <NA> US
## 5 NOT ACCEPTABLE 17.96 -66.22 0.38 -0.87 0.30 <NA> US
## 6 NOT ACCEPTABLE 17.96 -66.22 0.38 -0.87 0.30 <NA> US
## LocationText Location Decommisioned
## 1 Parc Parque, PR NA-US-PR-PARC PARQUE false
## 2 Paseo Costa Del Sur, PR NA-US-PR-PASEO COSTA DEL SUR false
## 3 Sect Lanausse, PR NA-US-PR-SECT LANAUSSE false
## 4 Urb Eugene Rice, PR NA-US-PR-URB EUGENE RICE false
## 5 Urb Gonzalez, PR NA-US-PR-URB GONZALEZ false
## 6 Urb La Fabrica, PR NA-US-PR-URB LA FABRICA false
## TaxReturnsFiled EstimatedPopulation TotalWages Notes
## 1
## 2
## 3
## 4
## 5
## 6
names(USzips)
## [1] "RecordNumber" "Zipcode" "ZipCodeType"
## [4] "City" "State" "LocationType"
## [7] "Lat" "Long" "Xaxis"
## [10] "Yaxis" "Zaxis" "WorldRegion"
## [13] "Country" "LocationText" "Location"
## [16] "Decommisioned" "TaxReturnsFiled" "EstimatedPopulation"
## [19] "TotalWages" "Notes"
dim(USzips)
## [1] 81831 20
The only columns of interest are presently “Zipcode” and “State”. So let’s only look at those.
longZips = USzips[,c("Zipcode", "State")]
head(longZips)
## Zipcode State
## 1 00704 PR
## 2 00704 PR
## 3 00704 PR
## 4 00704 PR
## 5 00704 PR
## 6 00704 PR
Now to extract the first 3 digits
shortZips = longZips
shortZips$Zip3 = substring(longZips$Zipcode, 1,3)
head(shortZips)
## Zipcode State Zip3
## 1 00704 PR 007
## 2 00704 PR 007
## 3 00704 PR 007
## 4 00704 PR 007
## 5 00704 PR 007
## 6 00704 PR 007
For each distinct 3 digit zip, there needs to be an associated state. But there doesn’t need to be one hundred 007’s for example, just one.
shortZips = shortZips[,c(2:3)]
zipList = unique(shortZips)
head(zipList)
## State Zip3
## 1 PR 007
## 26 PR 006
## 113 NJ 070
## 115 NJ 076
## 122 NJ 078
## 123 NJ 079
dim(zipList)
## [1] 939 2
Save the file as csv.
write.csv(zipList, "zipsFirst3.csv", row.names = FALSE)
Now that this csv file has been created, it can be imported into Excel (the zipcodes must be imported as text) and sorted or filtered as desired. It can also be merged with the client’s file in Tableau. But just as importantly, we have a reference for future use of geographic locations for the first 3 digits of US zip codes!
I will now load an anonymized subset of the client’s data. I did some additional preparation by changing some of the data descriptions, sorting the data by zip code, and changing the different versions of ‘NA’ and the three Canadian zip codes to actually say ‘NA’.
client.data = read.csv("anonData.csv", header = TRUE, sep = ",", colClasses = "character")
head(client.data)
## Category.1 Category.Type Gender State Merge Zip.First.3
## 1 A8 T25 F Washington 00 0
## 2 A5 T24 F Oregon 00 0
## 3 A6 T24 F Utah 00 0
## 4 A3 T28 F Washington 00 0
## 5 A7 T24 Not Set Massachusetts 0 10
## 6 A7 T01 Not Set Massachusetts 0 10
## Total.Events Month Region
## 1 1 7/1/2016 West
## 2 3 10/1/2016 West
## 3 6 10/1/2016 West
## 4 1 11/1/2016 West
## 5 1 1/1/2015 Northeast
## 6 1 3/1/2015 Northeast
names(client.data)
## [1] "Category.1" "Category.Type" "Gender" "State"
## [5] "Merge" "Zip.First.3" "Total.Events" "Month"
## [9] "Region"
dim(client.data)
## [1] 49191 9
summary(client.data)
## Category.1 Category.Type Gender
## Length:49191 Length:49191 Length:49191
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## State Merge Zip.First.3
## Length:49191 Length:49191 Length:49191
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Total.Events Month Region
## Length:49191 Length:49191 Length:49191
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
We can see from the summary that the lowest zip code is ‘0’. As a side note, no US zip codes lead with ‘000’ so this data set has more than one kind of data that will be cleaned. Naturally, this is trusting the assumption that the client’s data provider did indeed provide the first three digits of all the web logged zip codes.
Now, having identified the zip codes which have only one and two digits, I will add leading zeros. Lines 1:4 need two preceeding zeros. Lines 5:667 need one preceeding zero. I added the correct number of zeros in Excel.
zipX = client.data[c(1:667),c(5:6)]
zipX[,3] = paste(zipX[,1], zipX[,2], sep = '')
head(zipX)
## Merge Zip.First.3 V3
## 1 00 0 000
## 2 00 0 000
## 3 00 0 000
## 4 00 0 000
## 5 0 10 010
## 6 0 10 010
$V3 is the correctly formatted zip code for rows 1:667.
Now we update the original column with the correct zip code data.
fixed.data = client.data
fixed.data[c(1:667),"Zip.First.3"] = zipX$V3
head(fixed.data)
## Category.1 Category.Type Gender State Merge Zip.First.3
## 1 A8 T25 F Washington 00 000
## 2 A5 T24 F Oregon 00 000
## 3 A6 T24 F Utah 00 000
## 4 A3 T28 F Washington 00 000
## 5 A7 T24 Not Set Massachusetts 0 010
## 6 A7 T01 Not Set Massachusetts 0 010
## Total.Events Month Region
## 1 1 7/1/2016 West
## 2 3 10/1/2016 West
## 3 6 10/1/2016 West
## 4 1 11/1/2016 West
## 5 1 1/1/2015 Northeast
## 6 1 3/1/2015 Northeast
Now we can remove the unwanted column “Merge”, column 5 and save the data.
fixed.data = fixed.data[, c(1:4, 6:9)]
head(fixed.data)
## Category.1 Category.Type Gender State Zip.First.3 Total.Events
## 1 A8 T25 F Washington 000 1
## 2 A5 T24 F Oregon 000 3
## 3 A6 T24 F Utah 000 6
## 4 A3 T28 F Washington 000 1
## 5 A7 T24 Not Set Massachusetts 010 1
## 6 A7 T01 Not Set Massachusetts 010 1
## Month Region
## 1 7/1/2016 West
## 2 10/1/2016 West
## 3 10/1/2016 West
## 4 11/1/2016 West
## 5 1/1/2015 Northeast
## 6 3/1/2015 Northeast
write.csv(fixed.data, "updated.client.data.csv")
Perfect! Now the data may be analyzed in Tableau!