A handy function for reading CSV files

I have a very large CSV file (>1,000,000KB), which wouldn’t read in using the basic functions

DIR = "G:/Work_folders/Limitations on achieving Target 12/Data/"
aid <- read.csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""), header=T)

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string

This is not marked as an error message, but it turns out that R has stopped reading the CSV.

dim(aid)

## [1] 9960   68

The file should in fact be 1,561,039 lines long.

Online help suggested that to avoid the warning, I should disable quoting.

aid <- read.csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""), header=T, quote="")

## Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names

At this point, it seems likely that there are commas in some of the columns that are not contained within quotes. The file is too big to open in excel in order to remove commas, and after trying to read a few different chunks of the file in, I was fairly sure the problem occurred in multiple places.

aid <- read.csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""), header=T, quote="", skip=100, nrows=1000)

## Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names

aid <- read.csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""), header=T, quote="", skip=1000, nrows=10000)

## Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names

So I posted the problem on stackoverflow, and the recommendation was to load tidyverse and the readr library in order to use a more sophisticated function for reading in CSVs

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(readr)

This function gives a huge amount of information on the content of the file as it is read in….

aid <- readr::read_csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""),
                       col_names=T)

## Warning: Duplicated column names deduplicated: 'finance_t' =>
## 'finance_t_1' [55]

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   aiddata_id = col_integer(),
##   aiddata_2_id = col_integer(),
##   year = col_integer(),
##   crs_bi_multi = col_double(),
##   aiddata_sector_code = col_double(),
##   aiddata_purpose_code = col_double(),
##   crs_sector_code = col_double(),
##   crs_purpose_code = col_double(),
##   coalesced_purpose_code = col_double(),
##   commitment_amount_usd_constant = col_integer(),
##   total_project_cost = col_double(),
##   crs_trade = col_double(),
##   crs_climate = col_double(),
##   crs_biodiversity = col_double(),
##   crs_gender = col_double(),
##   crs_environment = col_double(),
##   pdgg = col_double(),
##   channel_code = col_double(),
##   associated_financing = col_double(),
##   future_ds_principal_amount_usd_nominal = col_double()
##   # ... with 16 more columns
## )

## See spec(...) for full column specifications.

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 373137 parsing failures.
## row # A tibble: 5 x 5 col     row                            col   expected     actual expected   <int>                          <chr>      <chr>      <chr> actual 1 25755 commitment_amount_usd_constant an integer 2570705706 file 2 65370 commitment_amount_usd_constant an integer 2925287772 row 3 66899 commitment_amount_usd_constant an integer 2153599288 col 4 66905 commitment_amount_usd_constant an integer 2153599288 expected 5 67036 commitment_amount_usd_constant an integer 3994911186 actual # ... with 1 more variables: file <chr>
## ... ................. ... ............................................................ ........ ............................................................ ...... ............................................................ .... ............................................................ ... ............................................................ ... ............................................................ ........ ............................................................ ...... .......................................
## See problems(...) for more details.

It also allows you to inspect all the columns when the file has been read in:

spec(aid)

## cols(
##   aiddata_id = col_integer(),
##   aiddata_2_id = col_integer(),
##   year = col_integer(),
##   donor = col_character(),
##   donor_iso = col_character(),
##   donor_region = col_character(),
##   implementing_agency = col_character(),
##   financing_agency = col_character(),
##   crs_bi_multi = col_double(),
##   recipient = col_character(),
##   recipient_iso = col_character(),
##   recipient_region = col_character(),
##   start_date = col_character(),
##   end_date = col_character(),
##   commitment_date = col_character(),
##   planned_start_date = col_character(),
##   planned_end_date = col_character(),
##   title = col_character(),
##   short_description = col_character(),
##   long_description = col_character(),
##   donor_project_id = col_character(),
##   donor_secondary_id = col_character(),
##   aiddata_sector_code = col_double(),
##   aiddata_sector_name = col_character(),
##   aiddata_purpose_code = col_double(),
##   aiddata_purpose_name = col_character(),
##   aiddata_activity_codes = col_character(),
##   aiddata_activity_names = col_character(),
##   flow_name = col_character(),
##   crs_sector_code = col_double(),
##   crs_sector_name = col_character(),
##   crs_purpose_code = col_double(),
##   crs_purpose_name = col_character(),
##   coalesced_purpose_code = col_double(),
##   coalesced_purpose_name = col_character(),
##   commitment_amount_usd_constant = col_integer(),
##   total_project_cost = col_double(),
##   crs_trade = col_double(),
##   crs_climate = col_double(),
##   crs_biodiversity = col_double(),
##   crs_gender = col_double(),
##   crs_environment = col_double(),
##   crs_desertification = col_character(),
##   pdgg = col_double(),
##   channel_code = col_double(),
##   finance_t = col_character(),
##   associated_financing = col_double(),
##   future_ds_principal_amount_usd_nominal = col_double(),
##   future_ds_interest_amount_usd_nominal = col_double(),
##   received_amount_usd_nominal = col_double(),
##   irtc_amount_usd_nominal = col_double(),
##   untied_amount_usd_nominal = col_double(),
##   tied_amount_usd_nominal = col_double(),
##   partial_tied_amount_usd_nominal = col_double(),
##   finance_t_1 = col_character(),
##   arrears_interest_amount_usd_nominal = col_double(),
##   arrears_principal_amount_usd_nominal = col_double(),
##   initial_report = col_double(),
##   ftc = col_double(),
##   repay_type = col_double(),
##   outstanding_amount_usd_nominal = col_double(),
##   interest_amount_usd_nominal = col_double(),
##   expert_commitment_amount_usd_nominal = col_integer(),
##   export_credit_amount_usd_nominal = col_double(),
##   expert_extended_amount_usd_nominal = col_integer(),
##   additional_info = col_character(),
##   source = col_character(),
##   source_detail = col_character()
## )

problems(aid)

## # A tibble: 373,137 x 5
##       row                            col   expected       actual
##     <int>                          <chr>      <chr>        <chr>
##  1  25755 commitment_amount_usd_constant an integer   2570705706
##  2  65370 commitment_amount_usd_constant an integer   2925287772
##  3  66899 commitment_amount_usd_constant an integer   2153599288
##  4  66905 commitment_amount_usd_constant an integer   2153599288
##  5  67036 commitment_amount_usd_constant an integer   3994911186
##  6 105240                     aiddata_id an integer 906000131240
##  7 105241                     aiddata_id an integer 906000131243
##  8 105242                     aiddata_id an integer 906000131261
##  9 105243                     aiddata_id an integer 906000131264
## 10 105244                     aiddata_id an integer 906000131267
## # ... with 373,127 more rows, and 1 more variables: file <chr>

Solving the problem involved telling R what type of data to expect in each column. By working on one issue at a time, I was able to add in just a small amount of code to tell R what the contents of each column were, and finally read in the CSV in with no problems.

aid <- readr::read_csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""),
                       col_names=T, col_types=cols(aiddata_id="d", aiddata_2_id="d", commitment_amount_usd_constant="n",
                                                   expert_commitment_amount_usd_nominal="n", expert_extended_amount_usd_nominal="n"))

## Warning: Duplicated column names deduplicated: 'finance_t' =>
## 'finance_t_1' [55]

The only issue now flagged is that there are two columns with the same name, which R has handily fixed for me anyway. And the data are now usable!

dim(aid)

## [1] 1561039      68

A handy function for reading CSV files

Louise Mair

6th December 2017