2021 UK Postal Code Matching Report

This report details the results of Gallup’s ability to match the verbatim postal codes collected across three waves of the 2021 Gallup World Poll in the UK to the DG Regio reference file, for the purposes of urbanisation coding. This analysis was conducted by Andrew Dugan.

As will be seen, we were able to successfully match about 70% of all respondents across the 3 waves – somewhat below our goal of 90% (for 2022), buit a significant improvement over previous attempts to match the UK data to the reference file. This document explains what steps need to be taken to more easily match this data – it is not simply a straightforward match and merge operation.

First, this document provides important (though perhaps not exciting) information about the UK Postal Code structure. Understanding how this string can be decomposed – and the constituent parts signify can also help us with troubleshooting non-matching records, as will also be seen below.

Introduction: Background Information on the UK Postal Code format (provides useful information for dealing with matching problems)

This report reviews Gallup’s ability to match the verbatim postal-code information collected in 2021 in the three World Poll waves conducted in the United Kingdom. DG-Regio (Lewis) provided a reference file which matches all UK postal codes to his urbanicity variable.

One issue Gallup has had in attempting to match our postal codes with his file basically boils down to a formatting issue. The first step of this report is to format Lewis’s file to eliminate any extra space (not counting the customary space in between the two major portions of the UK postal code) that can sometimes appear when dealing with text data.

Below is a table which simply shows the first ten entries of the DG Regio file (note: some extraneous columns were dropped from the below table for the sake of space).

Our primary column of interest is the first one in Table 1, named “POSTCODE.” It’s worth emphasizing that the postcode is rendered in the traditional format, with its two meaning constituent components (source: https://ideal-postcodes.co.uk/guides/uk-postcode-format).

uk.ref%>%
  dplyr::select(-GISCO_ID, -NSI_CODE)%>%
  head(n=10)%>%
  kbl(caption="Table 1. Example data from Lewis's reference file for the UK")%>%
  kable_styling(font_size=14, bootstrap_options = c("striped"))%>%
  kable_classic_2(full_width =F)
Table 1. Example data from Lewis’s reference file for the UK
POSTCODE CNTR_ID PC_CNTR LAT_NAT LAU_LATIN DGURBA_FINAL_2018
AB10 1QD UK UK_AB10 1QD Aberdeen City 1
AB10 1QQ UK UK_AB10 1QQ Aberdeen City 1
AB10 1QT UK UK_AB10 1QT Aberdeen City 1
AB10 1QX UK UK_AB10 1QX Aberdeen City 1
AB10 1RG UK UK_AB10 1RG Aberdeen City 1
AB10 1RJ UK UK_AB10 1RJ Aberdeen City 1
AB10 1RL UK UK_AB10 1RL Aberdeen City 1
AB10 1RP UK UK_AB10 1RP Aberdeen City 1
AB10 1RX UK UK_AB10 1RX Aberdeen City 1
AB10 1RY UK UK_AB10 1RY Aberdeen City 1

Understanding the structure of the postcode can be helpful in dealing with partial matches, both in our current 2021 dataset, as well in the future. Take the first postcode in Table 1, which reads:

AB10 1QD

The first portion of the postcode (i.e. everything before the space) is known as the outward code (sometimes called the ‘outcode’). It can be 2-4 characters long, but will always begin with a letter. This portion of the post code is the least specific part, in terms of the geographic or location-based information it reveals, though it will at least help determine which country of the United Kingdom a person is located. For instance, codes beginning with “BT,” indicate an individual lives in Northern Ireland. While there are approximately 1.7 million postal codes in the UK, there are only about 3,000 unique outward codes.

The last string of alphanumeric characters – everything after the space – is the inward code. It is ALWAYS three characters long. It will also begin with a number – such as “1QD,” in the above example. This portion of the code is more helpful in terms of providing an individual’s location, but people may also be more sensitive to provide it. There are 4,001 unique inward codes.

The postcode can be decomposed further, including:

  • POSTCODE AREA: This is the longest initial string of letters in a post code. In the above example, it corresponds to “AB,” which would indicate Aberdeen, Scotland.

  • SECTOR: Note this is going to be very important in matching our SAMPLE_POSTAL_CODE information. This represents the outward code and the first character of the inward code (which recall is ALWAYS a number). This might be between 3-5 characters long. I believe many of our partial codes are the SECTOR, as this leaves off the telling last two characters. Still, these could be useful. There are 11,226 unique Sector codes (according to the source cited above).

Having gained a better understanding of the UK postal code system, we now turn to formatting the reference file (provided by Lewis). The issue is that all of the postcodes in column 1 of Lewis’s file are set to have a length of 8 characters (this includes spaces). This includes a postcode such as “E4 7SB,” which has only six characters, if one counts the customary space (and 5 if that is excluded). In the reference file, though, this code is rendered as “E4 7SB,” i.e. it has two spaces. Our first transformation is to remove extra spaces – meaning all codes will be seperated by one space only. This is accomplished using commands in the Stringer library of R, namely str_squish.

After making this alteration, we then take stock of the total number of characters in the reference file postcodes (including the space in between the outcode and incode). Postcodes will have between 6-8 characters; the vast majority will have 7-8 characters.

uk.ref$post.code<-str_squish(uk.ref$POSTCODE)

uk.ref$post.code.length<-str_length(uk.ref$post.code)



post.code.length.count<-uk.ref%>%
  group_by(post.code.length)%>%
  tally()

post.code.length.count$char.count<-factor(post.code.length.count$post.code.length, levels=c(6,7,8), labels=c("Six characters", "Seven characters", "Eight character"))


ggbarplot(post.code.length.count, x="char.count", y="n", fill="steelblue", color="steelblue", label=TRUE, lab.pos="out",ylim=c(0,900000),
          xlab="",ylab="",
          title="UK Postal Codes: Number of characters in reference file,\nincluding the space in between the outcode and incode")

Finally, for the purposes of easier matching, we create a new variable in the reference file “post.code.no.space.” As the name suggests, this simply provides the postcode characters without the space in between the outcode and incode. Crucially, all alphabetic characters in this string are expressed as capital letters. We also create a variable measuring the length of this new variable, which will be useful in troubleshooting the merge process. The below count of this new length variable shows, as we would expect, all UK postcodes now fall between 5-7 characters.

uk.ref$post.code.no.space<-
  str_replace_all(uk.ref$post.code, " ", "")

###Variable measuring length of the postcode 

uk.ref$post.code.no.space.length<-str_length(uk.ref$post.code.no.space)

table(uk.ref$post.code.no.space.length)
## 
##      5      6      7 
##  44595 857633 856993

MERGE OPERATION 1: UK Gallup World Poll, Waves 1 & 2

As the first two 2021 Gallup World Poll waves in the UK contain, for the purposes of this analysis, the same variables (i.e. they do not include NUTS 3, which is included in Wave 3), we will first combine the waves together. We will also subset on the variables of interest, or those that could be useful in matching back to the original file, including:

  • WPID
  • WP5
  • REGION_GBR
  • WP15450 (Live in the city of Birmingham?)
  • WP15451 (Live in the City of London?)
  • POSTAL_CODE
  • POSTAL_CODE_DUB
  • POSTAL_CODE_REPORTING
  • POSTAL_CODE_SAMPLE
df1.reduced<-df1%>%
  dplyr::select(WPID, WP5,REGION_GBR, WP15450,  WP15451, POSTAL_CODE, POSTAL_CODE_DUB, WP22007, POSTAL_CODE_REPORTING, POSTAL_CODE_SAMPLE,WT)
                
df2.reduced<-df2%>%
  dplyr::select(WPID, WP5,REGION_GBR, WP15450,  WP15451, POSTAL_CODE, POSTAL_CODE_DUB, WP22007, POSTAL_CODE_REPORTING, POSTAL_CODE_SAMPLE, WT)

###Combine the two waves#################

df.two.waves<-rbind(df1.reduced,df2.reduced)


df<-df.two.waves

First, we turn to POSTAL_CODE_SAMPLE, which is the postal code of the respondent as recorded via our sample file. This is only recorded for individuals contacted via landline; for other individuals it is blank. It’s important to recode these blank entries as NA in order to be able to work with this variable. The first section of code in the below chunk makes this change.

Next, we consider respondents who are coded as “98” or “99” in the POSTAL_CODE_REPORTING variable – meaning we received a DK/Refused response. For some of these respondents, we can instead insert the code in POSTAL_CODE_SAMPLE into POSTAL_CODE_REPORTING. The second line of code below makes this change.

df$POSTAL_CODE_SAMPLE[df$POSTAL_CODE_SAMPLE == ""]<-NA

###Now assigning POSTAL_CODE_sAMPLE to all 98 or 99 values in POSTAL_CODE_REPORTING

df$POSTAL_CODE_REPORTING[(df$POSTAL_CODE_REPORTING == "98.00" | df$POSTAL_CODE_REPORTING == "99.00") & !is.na(df$POSTAL_CODE_SAMPLE)]<-df$POSTAL_CODE_SAMPLE

Now it is important to determine how many respondents are code 98/99 in POSTAL_CODE_REPORTING and we have no further information. There are **97 such respondents in Waves 1 and 2, out of a total of 2000. It should be noted, though, that 17 of these individuals said they lived in either London or Birmingham. However, the urbanization codes of these respondents were not imputed, given Lewis’s reluctance to do so. Theoretically, we could*.

df.dk<-df%>%
  dplyr::filter(POSTAL_CODE_REPORTING == "99.00"|POSTAL_CODE_REPORTING == "98.00")

df.dk$no.code<-1

df.dk$no.code<-factor(df.dk$no.code, levels=c(1), labels=c("Respondent has no zip code"))

df.dk$lives.in.city<-0
df.dk$lives.in.city[df.dk$WP15451 == 1 | df.dk$WP15450 == 1]<-1

df.dk$lives.in.city<-factor(df.dk$lives.in.city, levels=c(0,1), labels=c("Does not live in London or Birmingham", "Lives in London or Birmingham"))

From here, we will exclude the DK/Refused respondents from the merge analysis. This brings the combined sample down to 1,903 respondents.

We now have a number of formatting issues. Many of the postal codes in the Gallup World Poll file are entered using lower-case letters. For ease of merging, we can use R’s “str_to_upper” function to capitalize all alphabetic characters. In general, our interviewers did NOT record the customary space in between the outcode and incode observed in the reference file, though in a few cases, the interviewed did (not sure why we see this inconsistency).

On any account, we will transform POSTAL_CODE_REPORTING to include upper-case letters only and include no extra spaces. This creates a new variable – and the main one of interest here – called “gwp.post.code.” (Note similar operations are done on POSTAL_CODE_SAMPLE, creating a new variable POSTAL_CODE_SAMPLE2).

Once this is done, we then take stock of the length of each of our post codes. We have some issues, when looking at this metric. Recall, the compressed postal code in the reference file (i.e. without the space) will between 5-7 characters. However, in the Gallup World Poll file the compressed character length ranges between 2-25 characters. Obviously, we will need to examine these carefully, but one quick solution we can implement is to replace the “gwp.post.code,” variable with “POSTAL_CODE_SAMPEL2,” if the legnth is below 5 or above 7. However, this only alters a few records.

## # A tibble: 10 x 1
##    POSTAL_CODE_SAMPLE2
##    <chr>              
##  1 TS146              
##  2 BH214              
##  3 <NA>               
##  4 TW33S              
##  5 EC1N2              
##  6 GU167              
##  7 CH632              
##  8 <NA>               
##  9 B169A              
## 10 <NA>
Table 2. Bizarre post codes in UK Waves 1 and 2
gwp.post.code.length gwp.post.code POSTAL_CODE_SAMPLE2
25 NOVEMBERNOVEMBER1006DELTA NA
19 PEPPEROSCOR52ALFAAS NA
13 M204TANGOALFA NA
10 PO3266AEEX S636B

We can now attempt to match the reference file to the Gallup World Poll file. The below code executes the merge and then creates a categorical variable called “degree.of.urbanisation.” It has 4 levels – “densely-populated area” (urban); “intermediate” (peri-urban); “thinly-populated areas,” (rural) and then a final code to indicate the record did NOT match with anything in the reference file. (Note the name of the urbanisation categories are taken from here: https://ec.europa.eu/regional_policy/sources/docgener/work/2014_01_new_urban.pdf).

Table 2 shows that match rate. The good news: we were able to match a majority of the records – approximately 60%. This is a significant improvement over previous attempts, which were only able to match a handful of records.

Table 2. UK 2021 Waves 1-2: Records Matched to Reference file by Urbanisation Category, Attempt 1
degree.of.urbanisation counts Prop
Densely populated area 724 38.0
Intermediate area 341 17.9
Thinly-populated area 76 4.0
Record not matched 762 40.0

The bad news? We still have 762 records that are not matched (not to mention the 97 respondents who are DK/Refused and have been removed from the data file for the purposes of these analyses).

We do have another trick up our sleeve: using, where possible, the sample postal code for the respondent if respondent did not match via the gwp.post.code variable. However, an analysis of the sample postal codes reveals that they are all 5 characters long. The output below shows an example of what these codes look like for those respondents who did NOT match via their “gwp.post.code,” entry (the output also shows this variable) and who also have data in the POSTAL_CODE_SAMPLE2 field.

Table 3. UK 2021 Waves 1-2: Example of POSTAL_CODE_SAMPLE2 data among respondents who did NOT match to reference file
gwp.post.code POSTAL_CODE_SAMPLE2
KY16DY BH214
EC1N2 EC1N2
GU167 GU167
G811UM B169A
D499PU TD74D
BT794FP DA74J
HU17GV PO202
LU66XQ NR203
TS21YM B461A
G344ST DE217
EH221 EH221
SS94P SS94P
NR14FO BA227
0AD8XP SW100
BT100 BT100

These postal codes look like the five-character Sector portion of the full postal code, which was explained above. Can we ask our sample providers to please provide the full postal code, rather than this version?

Nonetheless, the sector can be used to match records, though there will be some uncertainty in the process (and thus something we would want to clear with Lewis). Here’s the issue: if we take the (Lewis-provided) reference file and create a sector variable (which can be easily done by simply removing the final 2 characters from the postal code, as the sector consists of the outward code plus the first digit of the incode and, recall, all incodes are precisely 3 characters long). Of the approximately 11,000 sector codes in the reference file, 10,549 have 90% or more of the postal codes affiliated with them categorized in the same way for the “degree of urbanisation.” 10,891 (nearly all of them) sector codes have 70% of their underlying postal codes fall into the same degree of urbanisation category.

We will now try to match by Sector codes, only using those sector codes where the cross-urbanization of the lower-level postal codes was no more than 30%; put another way, AT LEAST 70% of the postal codes in the sector were coded the same way. We will of course be coding the dominant category for each sector.

The below code goes through this process and then shows the results in Table 4. Using Sector matching, we now are able to match about 70% of the records in Waves 1 and 2.

uk.ref$sector.original<-str_sub(uk.ref$post.code, 1, -3)


uk.ref<-uk.ref%>%
  mutate(sector=str_replace_all(uk.ref$sector.original, " ", ""))



uk.ref$urbanicity<-factor(uk.ref$DGURBA_FINAL_2018)

sector.by.urbanicity<-uk.ref%>%
  group_by(sector, urbanicity)%>%
  tally()%>%
  drop_na()%>%
  mutate(pct=round((n/sum(n))*100,1))%>%
  dplyr::select(-n)


sector.sum<-sector.by.urbanicity %>%
  group_by(sector)%>%
  dplyr::summarise(pct.max=max(pct))

sector.by.urbanicity.majority<-sector.by.urbanicity%>%
  dplyr:::filter(pct>70)



sector.by.urbanicity.majority$matched<-0



merge1<-merge1%>%
  left_join(sector.by.urbanicity.majority, by=c("matched" ="matched", "POSTAL_CODE_SAMPLE2" = "sector"))


merge1$DGURBA_FINAL_2018[merge1$urbanicity == 1]<-1

merge1$DGURBA_FINAL_2018[merge1$urbanicity == 2]<-2

merge1$DGURBA_FINAL_2018[merge1$urbanicity == 3]<-3


merge1$degree.of.urbanisation<-ifelse(is.na(merge1$DGURBA_FINAL_2018), 9, merge1$DGURBA_FINAL_2018)

merge1$degree.of.urbanisation<-factor(merge1$degree.of.urbanisation, levels=c(1,2,3,9),
                                      labels=c('Densely populated area', 'Intermediate area', 'Thinly-populated area', 'Record not matched'))


####Table of last results


  merge1%>%
  group_by(degree.of.urbanisation)%>%
  tally()%>%
  rename(counts=n)%>%
  mutate((Proportion=counts/1903)*100)%>%
  rename(Prop=3)%>%
  mutate(Prop=round(Prop,1))%>%
  kbl(caption="Table 4. UK 2021 Waves 1-2: Updated records match, after Using postal codes")%>%
  kable_styling(font_size=14, bootstrap_options = c("striped"))%>%
  kable_classic_2(full_width =T)
Table 4. UK 2021 Waves 1-2: Updated records match, after Using postal codes
degree.of.urbanisation counts Prop
Densely populated area 797 41.9
Intermediate area 419 22.0
Thinly-populated area 104 5.5
Record not matched 583 30.6

We can now employ this same strategy – attempting to match SECTOR values, but on the original variable (as many respondents gave incomplete postal codes). Doing so helps, marginally, brining down the number of unmatched respondents from 583 to 559.

We have one last validity check on the remaining, unmatched postal codes. R has a package called PostcodesioR – which is provided by the UK’s Office for National Statistics. It has a look-up tool that will validate a suspected postal code. I created a function that can loop through this look-up tool. However, none of the post codes came back as valid.

MERGE OPERATION 3: UK Gallup World Poll, Waves 3

Now we look at UK Gallup World Poll Wave 3, which includes NUTS information. Wave 3 also has 1,000 respondents.

Quickly looking over key statistics, we see that:

  • 64 respondents are coded as 98/99 in POSTAL_CODE_REPORTING
  • However NONE of these respondents have data in POSTAL_CODE_SAMPLE, so we will not be able to utilize this information.
  • We do have WP22547, corresponding to the NUTS 2 regions of the UK (40 categories). Of the 64 respondents with no postal code information at all, 18 of them also did not provide this information (see below table).
  • Another 7 of the DK/Refused postal code respondents said they lived in either London or Birmingham, via question items WP15450 and WP15451.
## # A tibble: 1 x 1
##       n
##   <int>
## 1    64
Table 5. UK 2021 Waves 3: WP22547 (NUTS 2 region) Respondents who said DK/Refused to Postal Code
WP22547_NUTS2 n
Northumberland and Tyne and Wear 2
Greater Manchester 2
Lancashire 1
Merseyside 1
East Yorkshire and Northern Lincolnshire 1
North Yorkshire 1
South Yorkshire 1
West Yorkshire 2
Leicestershire, Rutland and Northamptonshire 5
Herefordshire, Worcestershire and Warwickshire 1
Shropshire and Staffordshire 1
East Anglia 3
Essex 1
Inner London — West 1
Inner London — East 1
Outer London — East and North East 3
Outer London — South 2
Berkshire, Buckinghamshire and Oxfordshire 2
Surrey, East and West Sussex 2
Hampshire and Isle of Wight 2
Kent 1
Gloucestershire, Wiltshire and Bristol/Bath area 2
Dorset and Somerset 1
West Wales and The Valleys 1
East Wales 1
Eastern Scotland 1
West Central Scotland 1
Northern Ireland 3
(DK) 9
(Refused) 9

We now move forward with the matching process, as originally conducted above. Again, we need to reformat the data in the Gallup World Poll file. The code below goes through this familiar process. In the first place, we will remove the DK/Refused records from this file. This take the n size of the file down from 1,000 to 936.

We next implement the same functions used in the previous analysis to format the POSTAL_CODE_REPORTING and POSTAL_CODE_SAMPLE variables. Again, by “formatting,” this refers to ensuring that all alphabetic characters are uppercase, in accordance with the reference file as well as how the UK represents its postal codes. The Wave 3 file, in particular, had virtually no capital letters in the verbatim postal codes – it would be great to talk to our vendors about improving on this front.

We then need to “compress,” these codes – meaning we will take out the customary space in between the two portions of the postal code (as well as any other extraneous space). This will help with the matching process.

In performing these transformations, two new variables are created: “gwp.post.code,” (the main variable we will use for matching) and "POSTAL_CODE_SAMPLE2. This is done out of an abundance of caution – it is isn’t necessary to create new variables, we could simply write over the existing ones.

Once we have formatted and compressed the postal code verbatim responses into the the variable “gwp.post.code,” we then examine how many characters each of these strings have. Again, a normal UK postal code will have between 5-7 characters (not counting the space in between the outcode and incode).

However, the length of our formatted/compressed variable falls across a much broader range, as Chart 1 shows. 104 respondents have postal codes that are between 2-4 characters long; another 8 have codes that are 8 or more. The good news is some of these obviously problematic codes can be switched out for the SAMPLE_POSTAL_CODE2 – 40 respondents out of the 112 to be exact. This transformation is also performed in the below code.

We can now proceed to matching the postal code fields in the reference file and Gallup World Poll Wave 3. Our first attempt is a simple match of the respondents; (formatted and compressed) verbatim response to the verbatim file. The results are not terrible, but not are they terribly impressive: 56.2 per cent of the 961 records (again we have removed the DK/Refused here) were matched by their verbatim response. Another 43.8 per cent of respondents could NOT be matched on this basis alone.

Notably these results are very similar to Waves 1 and 2, and perhaps suggest this the baseline we might expect for this process in the future, if no improvements are made in how we record or ask about postal codes.

Table 6. UK 2021 Wave 3: Records Matched to Reference file by Urbanisation Category, Attempt 1
degree.of.urbanisation counts Prop
Densely populated area 348 37.2
Intermediate area 148 15.8
Thinly-populated area 30 3.2
Record not matched 410 43.8

Among the 410 respondents in wave 3 whose postal code we could not match, 171 of them have data in the POSTAL_CODE_SAMPLE2 field. Again, the POSTAL_CODE_SAMPLE data is expressed as the SECTOR portion of the UK postal code – meaning it represents most, but not ALL of the information contained in these important digits.

Like last time, we will attempt to match the sector codes in the Gallup World Poll file to the sector codes in the reference file – though, again, Gallup had to “impute” (for lack of a better word) the sector codes from the reference file, as that file contains ONLY the full digits.

A similar standard as before will be imposed here – we will merge in sector codes where 70% of the underlying postal codes in any given sector have the same urbanisation code. Respondents who live in such a sector will be given an urbanisation code which corresponds with the dominant code.

Performing this operation, improves our match rates slightly: the percentage of records we were unable to match has fallen to 33.3% from 43.8% – about a ten percentage point improvement.

Table 7. UK 2021 Wave 3: Updated records match, after Using postal codes
degree.of.urbanisation counts Prop
Densely populated area 394 42.1
Intermediate area 195 20.8
Thinly-populated area 35 3.7
Record not matched 312 33.3

We can now employ this same strategy – attempting to match SECTOR values, but on the original variable (as many respondents gave incomplete postal codes). Doing so helps, marginally, brining down the number of unmatched respondents from 312 to 289.

Finally there is the question of what the NUTS 2 data can tell us about those respondents who have non-matching codes. While DG Regio does have urbanization codes by NUTS region, Lewis has indicated that he does not want to use such high-level (from his perspective) information. So this is not providing much assistance.

Further Troubleshooting

Combining all waves, this analysis was able to match about 70% of respondents to the reference file (setting aside the DK/Refused respondents, who numbered 161).

Table 8. UK 2021 ALL WAVES: Total records matched
degree.of.urbanisation counts Prop
Densely populated area 1191 42.0
Intermediate area 614 21.6
Thinly-populated area 139 4.9
Record not matched 895 31.5

While this is a good starting point, there are definitely further troubleshooting methods we could employ for the 2022 data to get closer to that 90% mark we are aiming for. They will, though, come at a cost – namely time. However, we have a number of tools that can be utilized that will allow us to confidently make approximate matches for partial or slightly incorrect codes, however this will require a case by case review. Granted the inspection of any given case should not take long at all, but the cumulative effect could be cumbersome.

Here are some strategies we can consider in the future:

  • In some instances, an invalid post code provides enough information to “fill in the blanks.” One such example (in Wave 3) is the postcode “BT394JA.” It is invalid, according to the reference file; the R package also assesses it to be invalid via the postcode_lookup function.

But, using the same R package, we can use a function called “postcode_autocomplete,” to find the closest alternatives to this code. I tried this with with aforementioned invalid code. Without showing all of the output, it become evident that there is no post code beginning with “BT39 4.” Likely it is is “BT39 0JA,” – which is in Northern Ireland. This actually matches the respondent’s location via WP22457. Furthermore, ALL postal codes beginning with “BT39,” have the same code in Lewis’s file (2). While there is no quick, systematic way to do this review, I do believe we could confidently code respondents on the basis of this partial information, using the tools I have highlighted.

  • Respondents who are in London can be coded as 1.

  • Like we did with the “sector,” the UK post code can be broken into smaller units which, depending on where it is, may be useful in coding urbanization. This will take some time, as it is not always clear which part of the postal code a respondent is giving when they given an incomplete postal code.

  • On the front end, more careful recording from our interviewers. In a few instance, there appears to be evidence that respondents were trying to avoid confusion by using a word that started with the letter of their postal code "November, for instance, to signify the letter ‘N,’ which can be easily mistaken for ‘M.’ However the interviewer recorded November. No full words should be recorded when receiving a person’s postal code.

  • Other strategies may emerge as we look through the 2022 data. But while this seemed like a more challenging country, the results are actually not bad at all – about 70% of UK respondents were successfully coded by matching to the DG Regio reference file!

Please let me know if you have any questions.