I have a very large CSV file (>1,000,000KB), which wouldn’t read in using the basic functions
DIR = "G:/Work_folders/Limitations on achieving Target 12/Data/"
aid <- read.csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""), header=T)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string
This is not marked as an error message, but it turns out that R has stopped reading the CSV.
dim(aid)
## [1] 9960 68
The file should in fact be 1,561,039 lines long.
Online help suggested that to avoid the warning, I should disable quoting.
aid <- read.csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""), header=T, quote="")
## Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
At this point, it seems likely that there are commas in some of the columns that are not contained within quotes. The file is too big to open in excel in order to remove commas, and after trying to read a few different chunks of the file in, I was fairly sure the problem occurred in multiple places.
aid <- read.csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""), header=T, quote="", skip=100, nrows=1000)
## Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
aid <- read.csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""), header=T, quote="", skip=1000, nrows=10000)
## Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
So I posted the problem on stackoverflow, and the recommendation was to load tidyverse and the readr library in order to use a more sophisticated function for reading in CSVs
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(readr)
This function gives a huge amount of information on the content of the file as it is read in….
aid <- readr::read_csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""),
col_names=T)
## Warning: Duplicated column names deduplicated: 'finance_t' =>
## 'finance_t_1' [55]
## Parsed with column specification:
## cols(
## .default = col_character(),
## aiddata_id = col_integer(),
## aiddata_2_id = col_integer(),
## year = col_integer(),
## crs_bi_multi = col_double(),
## aiddata_sector_code = col_double(),
## aiddata_purpose_code = col_double(),
## crs_sector_code = col_double(),
## crs_purpose_code = col_double(),
## coalesced_purpose_code = col_double(),
## commitment_amount_usd_constant = col_integer(),
## total_project_cost = col_double(),
## crs_trade = col_double(),
## crs_climate = col_double(),
## crs_biodiversity = col_double(),
## crs_gender = col_double(),
## crs_environment = col_double(),
## pdgg = col_double(),
## channel_code = col_double(),
## associated_financing = col_double(),
## future_ds_principal_amount_usd_nominal = col_double()
## # ... with 16 more columns
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 373137 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual expected <int> <chr> <chr> <chr> actual 1 25755 commitment_amount_usd_constant an integer 2570705706 file 2 65370 commitment_amount_usd_constant an integer 2925287772 row 3 66899 commitment_amount_usd_constant an integer 2153599288 col 4 66905 commitment_amount_usd_constant an integer 2153599288 expected 5 67036 commitment_amount_usd_constant an integer 3994911186 actual # ... with 1 more variables: file <chr>
## ... ................. ... ............................................................ ........ ............................................................ ...... ............................................................ .... ............................................................ ... ............................................................ ... ............................................................ ........ ............................................................ ...... .......................................
## See problems(...) for more details.
It also allows you to inspect all the columns when the file has been read in:
spec(aid)
## cols(
## aiddata_id = col_integer(),
## aiddata_2_id = col_integer(),
## year = col_integer(),
## donor = col_character(),
## donor_iso = col_character(),
## donor_region = col_character(),
## implementing_agency = col_character(),
## financing_agency = col_character(),
## crs_bi_multi = col_double(),
## recipient = col_character(),
## recipient_iso = col_character(),
## recipient_region = col_character(),
## start_date = col_character(),
## end_date = col_character(),
## commitment_date = col_character(),
## planned_start_date = col_character(),
## planned_end_date = col_character(),
## title = col_character(),
## short_description = col_character(),
## long_description = col_character(),
## donor_project_id = col_character(),
## donor_secondary_id = col_character(),
## aiddata_sector_code = col_double(),
## aiddata_sector_name = col_character(),
## aiddata_purpose_code = col_double(),
## aiddata_purpose_name = col_character(),
## aiddata_activity_codes = col_character(),
## aiddata_activity_names = col_character(),
## flow_name = col_character(),
## crs_sector_code = col_double(),
## crs_sector_name = col_character(),
## crs_purpose_code = col_double(),
## crs_purpose_name = col_character(),
## coalesced_purpose_code = col_double(),
## coalesced_purpose_name = col_character(),
## commitment_amount_usd_constant = col_integer(),
## total_project_cost = col_double(),
## crs_trade = col_double(),
## crs_climate = col_double(),
## crs_biodiversity = col_double(),
## crs_gender = col_double(),
## crs_environment = col_double(),
## crs_desertification = col_character(),
## pdgg = col_double(),
## channel_code = col_double(),
## finance_t = col_character(),
## associated_financing = col_double(),
## future_ds_principal_amount_usd_nominal = col_double(),
## future_ds_interest_amount_usd_nominal = col_double(),
## received_amount_usd_nominal = col_double(),
## irtc_amount_usd_nominal = col_double(),
## untied_amount_usd_nominal = col_double(),
## tied_amount_usd_nominal = col_double(),
## partial_tied_amount_usd_nominal = col_double(),
## finance_t_1 = col_character(),
## arrears_interest_amount_usd_nominal = col_double(),
## arrears_principal_amount_usd_nominal = col_double(),
## initial_report = col_double(),
## ftc = col_double(),
## repay_type = col_double(),
## outstanding_amount_usd_nominal = col_double(),
## interest_amount_usd_nominal = col_double(),
## expert_commitment_amount_usd_nominal = col_integer(),
## export_credit_amount_usd_nominal = col_double(),
## expert_extended_amount_usd_nominal = col_integer(),
## additional_info = col_character(),
## source = col_character(),
## source_detail = col_character()
## )
problems(aid)
## # A tibble: 373,137 x 5
## row col expected actual
## <int> <chr> <chr> <chr>
## 1 25755 commitment_amount_usd_constant an integer 2570705706
## 2 65370 commitment_amount_usd_constant an integer 2925287772
## 3 66899 commitment_amount_usd_constant an integer 2153599288
## 4 66905 commitment_amount_usd_constant an integer 2153599288
## 5 67036 commitment_amount_usd_constant an integer 3994911186
## 6 105240 aiddata_id an integer 906000131240
## 7 105241 aiddata_id an integer 906000131243
## 8 105242 aiddata_id an integer 906000131261
## 9 105243 aiddata_id an integer 906000131264
## 10 105244 aiddata_id an integer 906000131267
## # ... with 373,127 more rows, and 1 more variables: file <chr>
Solving the problem involved telling R what type of data to expect in each column. By working on one issue at a time, I was able to add in just a small amount of code to tell R what the contents of each column were, and finally read in the CSV in with no problems.
aid <- readr::read_csv(paste(DIR, "AidDataCore_ResearchRelease_Level1_v3.0/AidDataCoreFull_ResearchRelease_Level1_v3.0.csv", sep=""),
col_names=T, col_types=cols(aiddata_id="d", aiddata_2_id="d", commitment_amount_usd_constant="n",
expert_commitment_amount_usd_nominal="n", expert_extended_amount_usd_nominal="n"))
## Warning: Duplicated column names deduplicated: 'finance_t' =>
## 'finance_t_1' [55]
The only issue now flagged is that there are two columns with the same name, which R has handily fixed for me anyway. And the data are now usable!
dim(aid)
## [1] 1561039 68