This is the second tutorial of my R administrative records data cleaning series. This first tutorial is available here. The raw data were downloaded from the United States Department of State–Bureau of Consular Affairs.
Figure 1
I’ll be working with monthly immigrant statistics between March 2017 and September 2022.
Figure 2
Except for being in .pdf, the titles from month to month are inconsistent. I need to be extra careful these inconsistencies.
Figure 3
Figure 4
Sys.time()
## [1] "2022-11-11 23:58:08 EST"
file_list <- list.files(path = getwd(), pattern = ".txt", full.names = F)
file_list <- file_list[grepl("iv_", file_list, ignore.case = F)]
nlist <- vector("list", length(file_list))
names(nlist) <- substr(file_list, 1, 9) # name each component of in the list as the .txt files to be imported. It is important to drop the .txt suffix
head(names(nlist)) # take a look at the names of the first several list components
## [1] "iv_012018" "iv_012019" "iv_012020" "iv_012021" "iv_012022" "iv_022018"
This step involves some non-programming tasks. After the list has been filled with the monthly immigrant visa issuances data, I need to see the content of the non-observations within each list component
for (n in 1: length(nlist)) {
nlist[n] <- read.csv(file_list[n], sep = ";")
nlist[[n]] <- (nlist[[n]])[!grepl("page ", nlist[[n]], ignore.case = T)]
nlist[[n]] <- (nlist[[n]])[!grepl("Place of Birth", nlist[[n]], ignore.case = T)]# keyword updated after first round of cleaning
nlist[[n]] <- (nlist[[n]])[!grepl("Immigrant Visa Issuances", nlist[[n]], ignore.case = T)]
nlist[[n]] <- (nlist[[n]])[!grepl("Foreign State of Chargeability", nlist[[n]], ignore.case = T)]
nlist[[n]] <- (nlist[[n]])[!grepl("FY 20", nlist[[n]], ignore.case = T)]
nlist[[n]] <- (nlist[[n]])[!grepl("FY20", nlist[[n]], ignore.case = T)] # keyword added after first round of cleaning
nlist[[n]] <- (nlist[[n]])[!grepl("*FORMERLY KNOWN", nlist[[n]], ignore.case = T)]# keyword added after first round of cleaning
nlist[[n]] <- (nlist[[n]])[!grepl("grand total", nlist[[n]], ignore.case = T)]
#nlist[[n]] <- gsub(",", "", nlist[[n]]) # This will also remove the commas in the country names. This causes a manageable amount of manual editions later for a few country names.
nlist[[n]] <- data.frame(nlist[[n]])
names(nlist[[n]]) <- "V1" # assign the column name for the only column in the data frames
}
# Uncomment the following to view the full data frame
#View(nlist[[1]])
head(nlist[[1]])
## V1
## 1 Afghanistan CR1 3
## 2 Afghanistan DV1 2
## 3 Afghanistan DV2 1
## 4 Afghanistan DV3 1
## 5 Afghanistan F24 1
## 6 Afghanistan F25 1
tail(nlist[[1]])
## V1
## 2436 Zimbabwe F41 1
## 2437 Zimbabwe F42 1
## 2438 Zimbabwe FX1 1
## 2439 Zimbabwe FX3 1
## 2440 Zimbabwe IR3 1
## 2441 Zimbabwe IR5 10
Because the structure of each component in the list is the same, it is relatively easy to combine the rows into one data frame. Once the list components are combined, extract the time information as new columns.
Sys.time()
## [1] "2022-11-11 23:58:08 EST"
df <- do.call(what = "rbind", lapply(nlist, as.data.frame))
tail(df) # the time information is stored in row names
## V1
## iv_122021.1486 Zimbabwe E1 1
## iv_122021.1487 Zimbabwe E3 1
## iv_122021.1488 Zimbabwe F4 1
## iv_122021.1489 Zimbabwe FX 1
## iv_122021.1490 Zimbabwe IR1 3
## iv_122021.1491 Zimbabwe IR2 1
# retain the useful time information from row names
df$year <- as.numeric(substr(rownames(df), 6, 9))
df$month <- as.numeric(substr(rownames(df), 4, 5))
rownames(df) <- seq_len(nrow(df)) # re-assign the row names from 1 to nrow()
Sys.time()
## [1] "2022-11-11 23:58:08 EST"
This is a challenging task. The challenge comes from the uneven length of country names. The following is an illustration of the various lengths and expressions of country names.
df[c(1, 326, 463, 525, 551), 1]
## [1] "Afghanistan CR1 3"
## [2] "Cabo Verde F33 3"
## [3] "China - mainland born T53 1"
## [4] "Congo, Democratic Republic of the FX2 10"
## [5] "Cote d'Ivoire CR1 5"
Observe the shared patterns across the rows. In this data frame, the content after the last space is the issuance number. The content between the second to last and the last space is the visa category. The content before the second to last space is the country name.
strings <- scan(text = df$V1, what = " ")
head(strings)
## [1] "Afghanistan" "CR1" "3" "Afghanistan" "DV1"
## [6] "2"
length(strings) # 432875, which row each text string used to belong was not preserved
## [1] 423409
strings <- strsplit(df$V1, split = " ", fixed = TRUE)
length(strings) # 128264, which row each text string used to belong was preserved. length(strings) = nrow(df)
## [1] 126622
head(strings)
## [[1]]
## [1] "Afghanistan" "CR1" "3"
##
## [[2]]
## [1] "Afghanistan" "DV1" "2"
##
## [[3]]
## [1] "Afghanistan" "DV2" "1"
##
## [[4]]
## [1] "Afghanistan" "DV3" "1"
##
## [[5]]
## [1] "Afghanistan" "F24" "1"
##
## [[6]]
## [1] "Afghanistan" "F25" "1"
# Because strsplit() preserves the original row information, I choose strsplit()
issuances <- NA
visa <- NA
nationality <- vector("list", length(strings)) # length(strings) = nrow(df)
nationality_conc <- NA # create this column to save the concatenated strings for country names
Write a loop to handle the text extraction tasks
Sys.time()
## [1] "2022-11-11 23:58:09 EST"
for (i in 1:length(strings)) {
visa[i] <- (strings[[i]])[length(strings[[i]])-1] # content after the second to last space is the visa class
issuances[i] <- (strings[[i]])[length(strings[[i]])] # content after the last space is the visa issuance number
nationality[[i]] <- (strings[[i]])[1: (length(strings[[i]])-2)] # everthing before the second to last space is the country name
# concatenate strings using collapse = " " not sep = " "
nationality_conc[i] <- paste(nationality[[i]], collapse = " ")
}
Sys.time()
## [1] "2022-11-11 23:58:09 EST"
nationality_conc <- toupper(nationality_conc)
df$nationality <- nationality_conc
df$visa <- visa
issuances <- gsub(",", "", issuances)
df$issuances <- as.numeric(issuances)
df <- df[,names(df)!="V1"] # drop V1 after all useful information has been extracted
df <- df[order(df$year, df$month),] # ascending order first by year then by month
The following commands are strongly recommended to examine the
resultant data frame. For country names and visa types, use
table()
to see the names and frequencies. If there’s
anything strange, such as a non-country name appeared in the nationality
column, then we may need to revise the keywords in the
CLEANING
step in the beginning. In most cases, I’d have a
few findings that make me revise the keywords.
# A quick glance at the visa class tabulation. Note this is not the aggregated visa issuances.
head(table(df$visa))
##
## AM AM1 AM2 AM3 B21 B22
## 17 27 27 12 3 14
head(table(df$nationality))
##
## AFGHANISTAN ALBANIA ALGERIA ANDORRA ANGOLA ANGUILLA
## 1015 1105 700 1 227 32
summary(df$issuances)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 3.00 18.84 10.00 5009.00
# Run the following if there're NAs in the summary statistics of issuances
df[is.na(df$issuances),]
## [1] year month nationality visa issuances
## <0 rows> (or 0-length row.names)
The inconsistencies in some of the country names need further attention. Otherwise, the data frame is ready to be saved.
table(df$nationality[grepl("CHINA", df$nationality)])
##
## CHINA CHINA - MAINLAND BORN CHINA - TAIWAN BORN
## 7 1932 1231
## CHINA-MAINLAND BORN CHINA-TAIWAN BORN
## 46 27
table(df$nationality[grepl("SWAZILAND|ESWATINI", df$nationality)]) # Swaziland changed to Eswatini in 2018
##
## ESWATINI ESWATINI* SWAZILAND
## 19 2 9
# I will address the country names in a separate script. Now, I can save the data frame.
#write.csv(df, "iv.csv", row.names = F)
Figure 5: Final Product