This is the second tutorial of my R administrative records data cleaning series. This first tutorial is available here. The raw data were downloaded from the United States Department of State–Bureau of Consular Affairs.

Figure 1

I’ll be working with monthly immigrant statistics between March 2017 and September 2022.

Figure 2

Except for being in .pdf, the titles from month to month are inconsistent. I need to be extra careful these inconsistencies.

Figure 3

Figure 4

Sys.time()

## [1] "2022-11-11 23:58:08 EST"

file_list <- list.files(path = getwd(), pattern = ".txt", full.names = F)
file_list <- file_list[grepl("iv_", file_list, ignore.case = F)]

Create a List

nlist <- vector("list", length(file_list))
names(nlist) <- substr(file_list, 1, 9) # name each component of in the list as the .txt files to be imported. It is important to drop the .txt suffix
head(names(nlist)) # take a look at the names of the first several list components

## [1] "iv_012018" "iv_012019" "iv_012020" "iv_012021" "iv_012022" "iv_022018"

CLEANING: Remove Non-observation Rows

This step involves some non-programming tasks. After the list has been filled with the monthly immigrant visa issuances data, I need to see the content of the non-observations within each list component

for (n in 1: length(nlist)) {
  nlist[n] <- read.csv(file_list[n], sep = ";")
  nlist[[n]] <- (nlist[[n]])[!grepl("page ", nlist[[n]], ignore.case = T)]
  nlist[[n]] <- (nlist[[n]])[!grepl("Place of Birth", nlist[[n]], ignore.case = T)]# keyword updated after first round of cleaning
  nlist[[n]] <- (nlist[[n]])[!grepl("Immigrant Visa Issuances", nlist[[n]], ignore.case = T)]
  nlist[[n]] <- (nlist[[n]])[!grepl("Foreign State of Chargeability", nlist[[n]], ignore.case = T)]
  nlist[[n]] <- (nlist[[n]])[!grepl("FY 20", nlist[[n]], ignore.case = T)]
  nlist[[n]] <- (nlist[[n]])[!grepl("FY20", nlist[[n]], ignore.case = T)] # keyword added after first round of cleaning
  nlist[[n]] <- (nlist[[n]])[!grepl("*FORMERLY KNOWN", nlist[[n]], ignore.case = T)]# keyword added after first round of cleaning
  nlist[[n]] <- (nlist[[n]])[!grepl("grand total", nlist[[n]], ignore.case = T)]
  #nlist[[n]] <- gsub(",", "", nlist[[n]]) # This will also remove the commas in the country names. This causes a manageable amount of manual editions later for a few country names.
  nlist[[n]] <- data.frame(nlist[[n]])
  names(nlist[[n]]) <- "V1" # assign the column name for the only column in the data frames
}
# Uncomment the following to view the full data frame
#View(nlist[[1]])
head(nlist[[1]])

##                  V1
## 1 Afghanistan CR1 3
## 2 Afghanistan DV1 2
## 3 Afghanistan DV2 1
## 4 Afghanistan DV3 1
## 5 Afghanistan F24 1
## 6 Afghanistan F25 1

tail(nlist[[1]])

##                   V1
## 2436  Zimbabwe F41 1
## 2437  Zimbabwe F42 1
## 2438  Zimbabwe FX1 1
## 2439  Zimbabwe FX3 1
## 2440  Zimbabwe IR3 1
## 2441 Zimbabwe IR5 10

List to Dataframe

Because the structure of each component in the list is the same, it is relatively easy to combine the rows into one data frame. Once the list components are combined, extract the time information as new columns.

Sys.time()

## [1] "2022-11-11 23:58:08 EST"

df <- do.call(what = "rbind", lapply(nlist, as.data.frame))
tail(df) # the time information is stored in row names

##                            V1
## iv_122021.1486  Zimbabwe E1 1
## iv_122021.1487  Zimbabwe E3 1
## iv_122021.1488  Zimbabwe F4 1
## iv_122021.1489  Zimbabwe FX 1
## iv_122021.1490 Zimbabwe IR1 3
## iv_122021.1491 Zimbabwe IR2 1

# retain the useful time information from row names
df$year <- as.numeric(substr(rownames(df), 6, 9))
df$month <- as.numeric(substr(rownames(df), 4, 5))
rownames(df) <- seq_len(nrow(df)) # re-assign the row names from 1 to nrow()
Sys.time()

## [1] "2022-11-11 23:58:08 EST"

Extract Country Names, Visa Class, and Issuances

This is a challenging task. The challenge comes from the uneven length of country names. The following is an illustration of the various lengths and expressions of country names.

df[c(1, 326, 463, 525, 551), 1]

## [1] "Afghanistan CR1 3"                       
## [2] "Cabo Verde F33 3"                        
## [3] "China - mainland born T53 1"             
## [4] "Congo, Democratic Republic of the FX2 10"
## [5] "Cote d'Ivoire CR1 5"

Observe the shared patterns across the rows. In this data frame, the content after the last space is the issuance number. The content between the second to last and the last space is the visa category. The content before the second to last space is the country name.

strings <- scan(text = df$V1, what = " ") 
head(strings)

## [1] "Afghanistan" "CR1"         "3"           "Afghanistan" "DV1"        
## [6] "2"

length(strings) # 432875, which row each text string used to belong was not preserved

## [1] 423409

strings <- strsplit(df$V1, split = " ", fixed = TRUE)
length(strings) # 128264, which row each text string used to belong was preserved. length(strings) = nrow(df)

## [1] 126622

head(strings)

## [[1]]
## [1] "Afghanistan" "CR1"         "3"          
## 
## [[2]]
## [1] "Afghanistan" "DV1"         "2"          
## 
## [[3]]
## [1] "Afghanistan" "DV2"         "1"          
## 
## [[4]]
## [1] "Afghanistan" "DV3"         "1"          
## 
## [[5]]
## [1] "Afghanistan" "F24"         "1"          
## 
## [[6]]
## [1] "Afghanistan" "F25"         "1"

# Because strsplit() preserves the original row information, I choose strsplit() 
issuances <- NA
visa <- NA
nationality <- vector("list", length(strings)) # length(strings) = nrow(df)
nationality_conc <- NA # create this column to save the concatenated strings for country names

Write a loop to handle the text extraction tasks

Sys.time()

## [1] "2022-11-11 23:58:09 EST"

for (i in 1:length(strings)) {
  visa[i] <- (strings[[i]])[length(strings[[i]])-1] # content after the second to last space is the visa class
  issuances[i] <- (strings[[i]])[length(strings[[i]])] # content after the last space is the visa issuance number
  nationality[[i]] <- (strings[[i]])[1: (length(strings[[i]])-2)] # everthing before the second to last space is the country name
  # concatenate strings using collapse = " " not sep = " "
  nationality_conc[i] <- paste(nationality[[i]], collapse = " ")
}
Sys.time()

## [1] "2022-11-11 23:58:09 EST"

nationality_conc <- toupper(nationality_conc)
df$nationality <- nationality_conc 
df$visa <- visa
issuances <- gsub(",", "", issuances)
df$issuances <- as.numeric(issuances)
df <- df[,names(df)!="V1"] # drop V1 after all useful information has been extracted
df <- df[order(df$year, df$month),] # ascending order first by year then by month

Quality Check

The following commands are strongly recommended to examine the resultant data frame. For country names and visa types, use table() to see the names and frequencies. If there’s anything strange, such as a non-country name appeared in the nationality column, then we may need to revise the keywords in the CLEANING step in the beginning. In most cases, I’d have a few findings that make me revise the keywords.

# A quick glance at the visa class tabulation. Note this is not the aggregated visa issuances.
head(table(df$visa))

## 
##  AM AM1 AM2 AM3 B21 B22 
##  17  27  27  12   3  14

head(table(df$nationality))

## 
## AFGHANISTAN     ALBANIA     ALGERIA     ANDORRA      ANGOLA    ANGUILLA 
##        1015        1105         700           1         227          32

summary(df$issuances)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    3.00   18.84   10.00 5009.00

# Run the following if there're NAs in the summary statistics of issuances
df[is.na(df$issuances),]

## [1] year        month       nationality visa        issuances  
## <0 rows> (or 0-length row.names)

The inconsistencies in some of the country names need further attention. Otherwise, the data frame is ready to be saved.

table(df$nationality[grepl("CHINA", df$nationality)])

## 
##                 CHINA CHINA - MAINLAND BORN   CHINA - TAIWAN BORN 
##                     7                  1932                  1231 
##   CHINA-MAINLAND BORN     CHINA-TAIWAN BORN 
##                    46                    27

table(df$nationality[grepl("SWAZILAND|ESWATINI", df$nationality)]) # Swaziland changed to Eswatini in 2018

## 
##  ESWATINI ESWATINI* SWAZILAND 
##        19         2         9

# I will address the country names in a separate script. Now, I can save the data frame.
#write.csv(df, "iv.csv", row.names = F)

Figure 5: Final Product

Data Cleaning .pdf(.txt) to .csv

Tiangeng Lu

Created on 11/11/2022, last compiled on 11/11/2022

Create a List

CLEANING: Remove Non-observation Rows

List to Dataframe

Extract Country Names, Visa Class, and Issuances

Quality Check