An R Learner’s Diary: Data Import

Data Frames: Basics

A data frame is a rectangular collection of data with the variables in the columns and the observations in the rows (with the first row showing the variables’ names). When R does its data analysis, the data is usually from a data frame. In most cases, data frames are created when data are imported into R from some external source, such as an Excel or CSV file. But let’s start with an internally created data frame

Creating a data frame of fake data

year <- 1990:2000     # Creates a vector of 11 years
population <- rnorm(mean = 500, sd = 10, n = 11)    # Creates fake population data from a normal distribution with mean 500 and standard deviation 10 for 11 years
pop <- data.frame(year, population)   # Combines the fake data into a data frame
class(pop)      # "data.frame", as expected

## [1] "data.frame"

str(object = pop)     # More info about pop, including number of variables and observations, types of the variables

## 'data.frame':    11 obs. of  2 variables:
##  $ year      : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ population: num  509 489 496 512 509 ...

pop   # Displays the data frame as a table

##    year population
## 1  1990   509.3787
## 2  1991   489.4361
## 3  1992   495.7954
## 4  1993   512.2718
## 5  1994   508.5369
## 6  1995   494.9487
## 7  1996   508.2640
## 8  1997   479.2874
## 9  1998   517.6816
## 10 1999   496.3785
## 11 2000   488.9805

head(x = pop, n = 3)    # Displays the first three rows

##   year population
## 1 1990   509.3787
## 2 1991   489.4361
## 3 1992   495.7954

tail(x = pop, n = 3)    # Displays the last three rows

##    year population
## 9  1998   517.6816
## 10 1999   496.3785
## 11 2000   488.9805

Creating a `tibble` of data

A tibble is a data frame that is presented in a different way. We could make a tibble out of fake data.

# First, the `tidyverse` package will need to be loaded.
library(tidyverse)
pop.tibble <- tibble(year, population)
class(pop.tibble)      # "tbl_df"     "tbl"        "data.frame"

## [1] "tbl_df"     "tbl"        "data.frame"

str(object = pop.tibble)

## tibble [11 x 2] (S3: tbl_df/tbl/data.frame)
##  $ year      : int [1:11] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ population: num [1:11] 509 489 496 512 509 ...

pop.tibble   # Displays the tibble

## # A tibble: 11 x 2
##     year population
##    <int>      <dbl>
##  1  1990       509.
##  2  1991       489.
##  3  1992       496.
##  4  1993       512.
##  5  1994       509.
##  6  1995       495.
##  7  1996       508.
##  8  1997       479.
##  9  1998       518.
## 10  1999       496.
## 11  2000       489.

We will see below how to import data from files and store it in a tibble.

Doing some work with a data frame

plot(pop)     # Scatter plot of the data

popmodel <- lm(formula = population ~ year, data = pop)     # Runs a regression of population on year
summary(popmodel)     # Summary of the results of the regression

## 
## Call:
## lm(formula = population ~ year, data = pop)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4397  -7.0192  -0.9886   8.3132  19.6345 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1856.770   2338.066   0.794    0.448
## year          -0.680      1.172  -0.580    0.576
## 
## Residual standard error: 12.29 on 9 degrees of freedom
## Multiple R-squared:  0.03606,    Adjusted R-squared:  -0.07104 
## F-statistic: 0.3367 on 1 and 9 DF,  p-value: 0.576

Needless to say, this is a teeny-tiny part of what R can do with data. And in most cases the data are “imported” into R from an external file.

Importing data in an external file into a data frame

In a recent project, I downloaded a large number of Excel files from a web site and prepared them for use in data analysis. This was my first such project, and it was a slow, frustrating, but highly educational. The discussion below tries to describe the things I learned.

I will be using the readr package, which is part of the tidyverse package.

setwd("C:/Users/uroy/Documents/R/learning")

I read the data in a .csv file into R using read_csv():

(pop <- read_csv(file = "temp/1.csv", na = c("-", ".")))

## # A tibble: 36 x 9
##    Region             Region_Ab `1951` `1961` `1971` `1981` `1991` `2001` `2011`
##    <chr>              <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Andaman and Nicob~ AN            31     64    115    189    281    356    381
##  2 Andhra Pradesh     AP         31115  35983  43503  53551  66508  76210  84581
##  3 Arunachal Pradesh  AR            NA    337    468    632    865   1098   1384
##  4 Assam              AS          8029  10837  14625  18041  22414  26656  31206
##  5 Bihar              BR         29085  34841  42126  52303  64531  82999 104099
##  6 Chandigarh         CH            24    120    257    452    642    901   1055
##  7 Chhattisgarh       CT          7457   9154  11637  14010  17615  20834  25545
##  8 Dadra and Nagar H~ DN            42     58     74    104    138    220    344
##  9 Daman and Diu      DD            49     37     63     79    102    158    243
## 10 Delhi              DL          1744   2659   4066   6220   9421  13851  16788
## # ... with 26 more rows

The resulting R object is a data frame or tibble named pop.

Note that it is important to tell read_csv() which symbols are used in the data source file to represent data that are “not available” or NA.

Also, the skip = 3 option in read_csv would tell R to skip the first three lines.

Instead of specifying the name of my data file in the read_csv() command above, I could have used my computer’s clipboard. I could have opened the “temp/1.csv” file in Microsoft Excel, used the mouse to select the data in the file and then pressed Ctrl+V to copy the selected data to the clipboard. Then, back in R, I could have run the command above but with file = "clipboard" instead of file = "temp/1.csv".

See this for further details on data import.

For instructions on how to import data from non-CSV files, see this. The foreign package can also be used: see this.

Tidying up the data

In this case, my data turned out to be not tidy data. The data on a single variable was spread across multiple columns, as in table4a and table 4b here. So, I used the pivot_longer() command to make the data in pop tidy.

(pop <- pop %>% pivot_longer(-(Region:Region_Ab), names_to = "year", values_to = "pop"))

## # A tibble: 252 x 4
##    Region                      Region_Ab year    pop
##    <chr>                       <chr>     <chr> <dbl>
##  1 Andaman and Nicobar Islands AN        1951     31
##  2 Andaman and Nicobar Islands AN        1961     64
##  3 Andaman and Nicobar Islands AN        1971    115
##  4 Andaman and Nicobar Islands AN        1981    189
##  5 Andaman and Nicobar Islands AN        1991    281
##  6 Andaman and Nicobar Islands AN        2001    356
##  7 Andaman and Nicobar Islands AN        2011    381
##  8 Andhra Pradesh              AP        1951  31115
##  9 Andhra Pradesh              AP        1961  35983
## 10 Andhra Pradesh              AP        1971  43503
## # ... with 242 more rows

All variables/columns other than those listed between Region and Region_Ab are merged. The variable names/column headings of these gathered variables/columns become the data for a new variable called year. And the data in the gathered variables/columns form the data in a new variable called pop.

For some of the data source files, however, the data was tidy to begin with and the pivot_longer() command was not needed.

liabs_NCDC <- read_csv("temp/129_NCDC.csv", na = c("-", "."))

For more on the pivot_longer() and, its twin, pivot_wider() functions that are often used to tidy the data, see this.

Parsing the data into the appropriate data type

Although I now had the data in tidy form – that is, with each variable in one column and each observation in one row – I was faced by inappropriately parsed data. In pop, the “year” variable is classified as character data, and I want it to be integer data. This conversion was done with the following parsing commands:

pop$year <- parse_integer(pop$year)

For more on parsing, see this.

Two other problems loom. They are discussed below.

Renaming a variable in a data frame

Now, I knew that I would eventually need to merge all my data frames – such as pop and liabs_NCDC – into one giant data frame. I also knew that I would have to merge these data objects by “Region”, “Region_Ab”, and the variable that represented the calendar year. Unfortunately, while pop had a variable named year, liabs_NCDC spelled the corresponding variable as Year:

# The names of the variables in a data frame or tibble can be obtained thus:
names(pop)

## [1] "Region"    "Region_Ab" "year"      "pop"

names(liabs_NCDC)

## [1] "Region"     "Year"       "liabs_NCDC"

That would simply not do, given that R is case sensitive. So, the following fix was applied:

names(liabs_NCDC) <- c("Region", "year", "liabs_NCDC")

Note that Year has now become year, as needed.

A more elegant solution may be the following command, especially when the data frame contains a large number of variables:

names(liabs_NCDC)[which(names(liabs_NCDC) == "Year")] = "year"

Here, names(liabs_NCDC) == "Year" is a vector of the logical type with FALSE at all the place indexes in names(liabs_NCDC) for which the condition is not satisfied and TRUE otherwise. Then which(names(liabs_NCDC) == "Year") is a vector of the place indexes in names(liabs_NCDC) == "Year" that show TRUE. Finally, names(liabs_NCDC)[which(names(liabs_NCDC) == "Year")] subsets the variable name that needed changing.

Correcting Character Data

One variable in all my data source files was “Region”, which represented the various Indian states for which data was collected on various economically significant matters, such as population or “pop”. It so happened that the state names were inconsistently spelled. So, the following corrections were necessary:

liabs_NCDC$Region[which( liabs_NCDC$Region == "ALL STATES")] = "INDIA"
liabs_NCDC$Region[which( liabs_NCDC$Region == "Uttaranchal")] = "Uttarakhand"
liabs_NCDC$Region[which( liabs_NCDC$Region == "NCT Delhi")] = "Delhi"
liabs_NCDC$Region[which( liabs_NCDC$Region == "Telengana")] = "Telangana"

Belatedly adding a missing variable

Note from the names() command above that the data frame pop has a variable named “Region_Ab” denoting abbreviated region names whereas liabs_NCDC does not. This would be a problem when the eventual merging of pop and liabs_NCDC would be done. So, I used a data frame called region_ab, which contained the “Region” and “Region_Ab” variables that pop has, and used the left_join(liabs_NCDC, Region_Ab) command to add the necessary “Region_Ab” variable to liabs_NCDC.

region_ab <- read_csv("temp/region_ab.csv")
(liabs_NCDC <- liabs_NCDC %>% left_join(region_ab))

## # A tibble: 827 x 4
##    Region             year liabs_NCDC Region_Ab
##    <chr>             <dbl>      <dbl> <chr>    
##  1 Andhra Pradesh     1991       0.34 AP       
##  2 Arunachal Pradesh  1991      NA    AR       
##  3 Assam              1991       0.36 AS       
##  4 Bihar              1991       0.18 BR       
##  5 Chhattisgarh       1991      NA    CT       
##  6 Goa                1991      NA    GA       
##  7 Gujarat            1991       0.21 GJ       
##  8 Haryana            1991       0.19 HR       
##  9 Himachal Pradesh   1991       0.15 HP       
## 10 Jammu and Kashmir  1991      NA    JK       
## # ... with 817 more rows

For more on the left_join() function, see this. It adds a new variable/column to liabs_NCDC without adding new observations/rows. The added column comes from region_ab. The left_join() function knows which variable liabs_NCDC and region_ab have in common, and uses that common variable to line things up.

On the various ways of joining data frames/tables/tibbles, see this.

Assuming that pop has all “Region” observations that are in liabs_NCDC, here’s another way to create region_ab:

region_ab <- pop %>% group_by(Region) %>% summarise(Region_Ab = Region_Ab[1])

Unfortunately, this would not have worked in this case because the Indian state of Telangana exists in liabs_NCDC but not in pop. Check:

unique(liabs_NCDC$Region)[which(!(unique(liabs_NCDC$Region) %in% unique(pop$Region)))]

## [1] "Telangana"

A major source of frustration

I had to run the above commands again and again for nearly 275 data files! I felt that there had to be a way to make this repetitive process automatic, perhaps by means of a function. I imagined that I could generate a list of all the data files in the temp subdirectory and tell the function to apply the clean-up commands to all files in the list seriatim. Unfortunately, I just could not make it work. So, I had to run all the clean-up commands separately for all 275 or so files in the temp subdirectory! Massive frustration!

Merging multiple data frames

Now, finally, pop and liabs_NCDC can be merged, and I do so using the full_join() command.

(alldata <- list(pop,liabs_NCDC) %>% reduce(full_join))

## # A tibble: 988 x 5
##    Region                      Region_Ab  year   pop liabs_NCDC
##    <chr>                       <chr>     <dbl> <dbl>      <dbl>
##  1 Andaman and Nicobar Islands AN         1951    31         NA
##  2 Andaman and Nicobar Islands AN         1961    64         NA
##  3 Andaman and Nicobar Islands AN         1971   115         NA
##  4 Andaman and Nicobar Islands AN         1981   189         NA
##  5 Andaman and Nicobar Islands AN         1991   281         NA
##  6 Andaman and Nicobar Islands AN         2001   356         NA
##  7 Andaman and Nicobar Islands AN         2011   381         NA
##  8 Andhra Pradesh              AP         1951 31115         NA
##  9 Andhra Pradesh              AP         1961 35983         NA
## 10 Andhra Pradesh              AP         1971 43503         NA
## # ... with 978 more rows

For more on the merge() and reduce() functions, see this.

Now, I am on my way. In the above command, only two data frames have been merged. But there is nothing stopping me from inserting 200 data frame names in list().

A final frustration

I could indeed write the names of 200 data frames in the list() command above, and I did. But I kept wishing for a better way. I imagined doing this: use a command like my_list_of_files_to_merge <- ls() to create an object containing the names of all data frames in the current global environment, and then use the command alldata <- list(my_list_of_files_to_merge) %>% reduce(full_join) to complete the project. Although this sounded reasonable in theory, I just could not make it work.

What is the capital of Japan?

: London

: Paris

: Tokyo

An R Learner’s Diary: Data Import

Udayan Roy

2020-10-25

Data Frames: Basics

Creating a data frame of fake data

Creating a `tibble` of data

Doing some work with a data frame

Importing data in an external file into a data frame

Tidying up the data

Parsing the data into the appropriate data type

Renaming a variable in a data frame

Correcting Character Data

Belatedly adding a missing variable

A major source of frustration

Merging multiple data frames

A final frustration

What is the capital of Japan?

An R Learner’s Diary: Data Import

Udayan Roy

2020-10-25

Data Frames: Basics

Creating a data frame of fake data

Creating a tibble of data

Doing some work with a data frame

Importing data in an external file into a data frame

Tidying up the data

Parsing the data into the appropriate data type

Renaming a variable in a data frame

Correcting Character Data

Belatedly adding a missing variable

A major source of frustration

Merging multiple data frames

A final frustration

What is the capital of Japan?

Creating a `tibble` of data