TASK ONE: SITE NAMES AND ADDRESSES


Goal: In this process there will be three general goals that are accomplished by proofing the site data and attaching “flags” to the data representing any modifications made to or problems with the current file. Goals include:

  1. Ensure that there is a single site definition for each site.
  2. Ensure that previous site definitions can be seamlessly connected to previously used site codes.
  3. Ensure that there is only one address associated with each site and that this address is as accurate as possible.

Task One, location of file:

~/siteDataCleaning/data2beCleaned/[city]siteDefinition.csv

Task One, file metadata (i.e., field definitions):

  • site: The site code uses the convention: [first 4 letter of last name][first 3 letters of first name][2-letter state abbrv.][1 or 2]. For example, Bob Reitsma’s site in Maryland is REITBOBMD1. The 1 or 2 at the end of each site code represents whether a site has moved. For example, if Bob moved his previous house would be REITBOBMD1 and his new house would be REITBOBMD2. If there is a bad site code, include that code in the oldSite field and apply the correct site code in this field.

  • oldSite: This is ANY previous site code definition. In some cases the old site may include a misspelling. If this is the case, ensure that the misspelled name and correctly spelled name remain in the data file and that both are associated with the current site code.

  • lastName The last name of the participant. If there are multiple spellings of the last name, include all in the modified file.

  • firstName: The first name of the participant. If there are multiple spellings of the first name, include all in the cleaned file.

  • street: The number and street address of the participant associated with a given site.

  • city: Obvious

  • state: Two-letter state abbreviation.

  • zip: The five-digit zip code of the participant

  • flag: This field is used to provide information for the current stage in data cleaning. Flag codes include:
    • 0: Record has NOT been worked on yet
    • 1: Record is correct
    • 2: There was a problem with this record and a correct record is available or has been provided
    • 3: There was a problem with this record and a correct record IS NOT available or WAS NOT provided
  • notes: This field is used to provide a brief description of any problems associated with a record.

Task One, cleaning process:

For each record (i.e., row in the spreadsheet):

  1. Check that site follows the correct naming convention and ensure that all fields are uniquely associated with the appropriate site field. There may be multiple different oldSite’s associated with a single site field – this is acceptable. If no modification of the record is necessary, enter “1” as the flag code.

  2. If you discover a problem with any of these fields, search for a solution to the problem. For example, you may find that there are multiple zip codes associated with a single site. If this is the case, enter the address into Google and see what the zip code should be.

  3. If you are unnable to provide a correct record, enter “3” as the flag code.
  4. If you can provide a correct record, insert a blank row and provide the corrected record in that row. Enter “1” in the flag code field of the correct record and “2” in the flag code field of the problem record.
  5. If there was a problem with a record, add a brief note in the notes field describing the nature of the problem.

  6. Save file as a “.csv” (Microsoft will give you a warning message when doing so – ignore the warning)

4. Task One, example:


TASK TWO: ERROR CHECKING LATITUDE AND LONGITUDE DATA


Goal: To determine a site’s position along the rural-to-urban gradient, it is necessary that the spatial location of the site is provided as accurately as possible. In some cases, latitude and longitudes that we have on record have been found to not be representative of the site’s location. In other cases, we have multiple latitude and longitudes on file and need to choose the one closest to the site. Here, you will compare provided latitude and longitudes (or those that are absent) with the location of sites on Google Maps to determine the best approximation of the spatial location of Nestwatch sites.

1. Task two, location of file:

~/siteDataCleaning/data2beCleaned/[city]SiteLocation.csv

2. Task two, file metadata (i.e., field definitions):

  • site: The site code uses the convention: [first 4 letter of last name][first 3 letters of first name][2-letter state abbrv.][1 or 2]. For example, Bob Reitsma’s site in Maryland is REITBOBMD1. The 1 or 2 at the end of each site code represents whether a site has moved. For example, if Bob moved his old house would be REITBOBMD1 and his new house would be REITBOBMD2.

  • longitude: Ensure that longitude values are provided in decimal degrees

  • latitude: Ensure that latitude values are provided in decimal degrees

  • flag: This field is used to provide information for the current stage in data cleaning. Flag codes include:
    • 0: Record has NOT been worked on yet
    • 1: Spatial location of the site is estimated to be correct
    • 2: There was a problem with this record and a correct record is available or has been provided
    • 3: Site record has NOT been modified due to an unresolved issue
  • notes: This field is used to provide any data cleaning notes relavent to fixes that were necessary.

3. Task two, cleaning process:

For each record (i.e., row in the spreadsheet), check the latitude and longitude of the site. To do so, open the file ~/siteDataCleaning/data2beCleaned/[city]siteDefinition.csv alongside the site location file. Use Google Maps to navigate to the street address of the site. Google’s current geocoding isn’t perfect, but is rather good and will let you know if a site’s address is in the right ballpark. Use the following criteria to flag sites (providing a description of any problems in the notes field):

  • If the address of a site projected on Google Maps is within 100 m or so of the latitude and longitude on the spatial data spreadsheet, you can assign field code “1” and this record does not require modification.

  • If a site has multiple latitude and longitudes associated with it and one was deemed “close to” the site in Google Maps, those that are not close to the site can be flagged as “2”.

  • If a site has only one latitude and longitude on record and it is not projected close to the site address in Google Maps, assign the flag code “3” and do not proceed further with this site. Likewise, if a site has multiple latitude and longitudes on record and none are projected close to the site address in Google Maps, assign the flag code “3” to each of the records.

4. Task two, example: