A data frame is a rectangular collection of data with the variables in the columns and the observations in the rows (with the first row showing the variables’ names). When R does its data analysis, the data is usually from a data frame. In most cases, data frames are created when data are imported into R from some external source, such as an Excel or CSV file. But let’s start with an internally created data frame
year <- 1990:2000 # Creates a vector of 11 years
population <- rnorm(mean = 500, sd = 10, n = 11) # Creates fake population data from a normal distribution with mean 500 and standard deviation 10 for 11 years
pop <- data.frame(year, population) # Combines the fake data into a data frame
class(pop) # "data.frame", as expected
## [1] "data.frame"
str(object = pop) # More info about pop, including number of variables and observations, types of the variables
## 'data.frame': 11 obs. of 2 variables:
## $ year : int 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
## $ population: num 509 489 496 512 509 ...
pop # Displays the data frame as a table
## year population
## 1 1990 509.3787
## 2 1991 489.4361
## 3 1992 495.7954
## 4 1993 512.2718
## 5 1994 508.5369
## 6 1995 494.9487
## 7 1996 508.2640
## 8 1997 479.2874
## 9 1998 517.6816
## 10 1999 496.3785
## 11 2000 488.9805
head(x = pop, n = 3) # Displays the first three rows
## year population
## 1 1990 509.3787
## 2 1991 489.4361
## 3 1992 495.7954
tail(x = pop, n = 3) # Displays the last three rows
## year population
## 9 1998 517.6816
## 10 1999 496.3785
## 11 2000 488.9805
tibble of dataA tibble is a data frame that is presented in a different way. We could make a tibble out of fake data.
# First, the `tidyverse` package will need to be loaded.
library(tidyverse)
pop.tibble <- tibble(year, population)
class(pop.tibble) # "tbl_df" "tbl" "data.frame"
## [1] "tbl_df" "tbl" "data.frame"
str(object = pop.tibble)
## tibble [11 x 2] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:11] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
## $ population: num [1:11] 509 489 496 512 509 ...
pop.tibble # Displays the tibble
## # A tibble: 11 x 2
## year population
## <int> <dbl>
## 1 1990 509.
## 2 1991 489.
## 3 1992 496.
## 4 1993 512.
## 5 1994 509.
## 6 1995 495.
## 7 1996 508.
## 8 1997 479.
## 9 1998 518.
## 10 1999 496.
## 11 2000 489.
We will see below how to import data from files and store it in a tibble.
plot(pop) # Scatter plot of the data
popmodel <- lm(formula = population ~ year, data = pop) # Runs a regression of population on year
summary(popmodel) # Summary of the results of the regression
##
## Call:
## lm(formula = population ~ year, data = pop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4397 -7.0192 -0.9886 8.3132 19.6345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1856.770 2338.066 0.794 0.448
## year -0.680 1.172 -0.580 0.576
##
## Residual standard error: 12.29 on 9 degrees of freedom
## Multiple R-squared: 0.03606, Adjusted R-squared: -0.07104
## F-statistic: 0.3367 on 1 and 9 DF, p-value: 0.576
Needless to say, this is a teeny-tiny part of what R can do with data. And in most cases the data are “imported” into R from an external file.
In a recent project, I downloaded a large number of Excel files from a web site and prepared them for use in data analysis. This was my first such project, and it was a slow, frustrating, but highly educational. The discussion below tries to describe the things I learned.
I will be using the readr package, which is part of the tidyverse package.
setwd("C:/Users/uroy/Documents/R/learning")
I read the data in a .csv file into R using read_csv():
(pop <- read_csv(file = "temp/1.csv", na = c("-", ".")))
## # A tibble: 36 x 9
## Region Region_Ab `1951` `1961` `1971` `1981` `1991` `2001` `2011`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Andaman and Nicob~ AN 31 64 115 189 281 356 381
## 2 Andhra Pradesh AP 31115 35983 43503 53551 66508 76210 84581
## 3 Arunachal Pradesh AR NA 337 468 632 865 1098 1384
## 4 Assam AS 8029 10837 14625 18041 22414 26656 31206
## 5 Bihar BR 29085 34841 42126 52303 64531 82999 104099
## 6 Chandigarh CH 24 120 257 452 642 901 1055
## 7 Chhattisgarh CT 7457 9154 11637 14010 17615 20834 25545
## 8 Dadra and Nagar H~ DN 42 58 74 104 138 220 344
## 9 Daman and Diu DD 49 37 63 79 102 158 243
## 10 Delhi DL 1744 2659 4066 6220 9421 13851 16788
## # ... with 26 more rows
The resulting R object is a data frame or tibble named pop.
Note that it is important to tell read_csv() which symbols are used in the data source file to represent data that are “not available” or NA.
Also, the skip = 3 option in read_csv would tell R to skip the first three lines.
Instead of specifying the name of my data file in the read_csv() command above, I could have used my computer’s clipboard. I could have opened the “temp/1.csv” file in Microsoft Excel, used the mouse to select the data in the file and then pressed Ctrl+V to copy the selected data to the clipboard. Then, back in R, I could have run the command above but with file = "clipboard" instead of file = "temp/1.csv".
See this for further details on data import.
For instructions on how to import data from non-CSV files, see this. The foreign package can also be used: see this.
In this case, my data turned out to be not tidy data. The data on a single variable was spread across multiple columns, as in table4a and table 4b here. So, I used the pivot_longer() command to make the data in pop tidy.
(pop <- pop %>% pivot_longer(-(Region:Region_Ab), names_to = "year", values_to = "pop"))
## # A tibble: 252 x 4
## Region Region_Ab year pop
## <chr> <chr> <chr> <dbl>
## 1 Andaman and Nicobar Islands AN 1951 31
## 2 Andaman and Nicobar Islands AN 1961 64
## 3 Andaman and Nicobar Islands AN 1971 115
## 4 Andaman and Nicobar Islands AN 1981 189
## 5 Andaman and Nicobar Islands AN 1991 281
## 6 Andaman and Nicobar Islands AN 2001 356
## 7 Andaman and Nicobar Islands AN 2011 381
## 8 Andhra Pradesh AP 1951 31115
## 9 Andhra Pradesh AP 1961 35983
## 10 Andhra Pradesh AP 1971 43503
## # ... with 242 more rows
All variables/columns other than those listed between Region and Region_Ab are merged. The variable names/column headings of these gathered variables/columns become the data for a new variable called year. And the data in the gathered variables/columns form the data in a new variable called pop.
For some of the data source files, however, the data was tidy to begin with and the pivot_longer() command was not needed.
liabs_NCDC <- read_csv("temp/129_NCDC.csv", na = c("-", "."))
For more on the pivot_longer() and, its twin, pivot_wider() functions that are often used to tidy the data, see this.
Although I now had the data in tidy form – that is, with each variable in one column and each observation in one row – I was faced by inappropriately parsed data. In pop, the “year” variable is classified as character data, and I want it to be integer data. This conversion was done with the following parsing commands:
pop$year <- parse_integer(pop$year)
For more on parsing, see this.
Two other problems loom. They are discussed below.
Now, I knew that I would eventually need to merge all my data frames – such as pop and liabs_NCDC – into one giant data frame. I also knew that I would have to merge these data objects by “Region”, “Region_Ab”, and the variable that represented the calendar year. Unfortunately, while pop had a variable named year, liabs_NCDC spelled the corresponding variable as Year:
# The names of the variables in a data frame or tibble can be obtained thus:
names(pop)
## [1] "Region" "Region_Ab" "year" "pop"
names(liabs_NCDC)
## [1] "Region" "Year" "liabs_NCDC"
That would simply not do, given that R is case sensitive. So, the following fix was applied:
names(liabs_NCDC) <- c("Region", "year", "liabs_NCDC")
Note that Year has now become year, as needed.
A more elegant solution may be the following command, especially when the data frame contains a large number of variables:
names(liabs_NCDC)[which(names(liabs_NCDC) == "Year")] = "year"
Here, names(liabs_NCDC) == "Year" is a vector of the logical type with FALSE at all the place indexes in names(liabs_NCDC) for which the condition is not satisfied and TRUE otherwise. Then which(names(liabs_NCDC) == "Year") is a vector of the place indexes in names(liabs_NCDC) == "Year" that show TRUE. Finally, names(liabs_NCDC)[which(names(liabs_NCDC) == "Year")] subsets the variable name that needed changing.
One variable in all my data source files was “Region”, which represented the various Indian states for which data was collected on various economically significant matters, such as population or “pop”. It so happened that the state names were inconsistently spelled. So, the following corrections were necessary:
liabs_NCDC$Region[which( liabs_NCDC$Region == "ALL STATES")] = "INDIA"
liabs_NCDC$Region[which( liabs_NCDC$Region == "Uttaranchal")] = "Uttarakhand"
liabs_NCDC$Region[which( liabs_NCDC$Region == "NCT Delhi")] = "Delhi"
liabs_NCDC$Region[which( liabs_NCDC$Region == "Telengana")] = "Telangana"
Note from the names() command above that the data frame pop has a variable named “Region_Ab” denoting abbreviated region names whereas liabs_NCDC does not. This would be a problem when the eventual merging of pop and liabs_NCDC would be done. So, I used a data frame called region_ab, which contained the “Region” and “Region_Ab” variables that pop has, and used the left_join(liabs_NCDC, Region_Ab) command to add the necessary “Region_Ab” variable to liabs_NCDC.
region_ab <- read_csv("temp/region_ab.csv")
(liabs_NCDC <- liabs_NCDC %>% left_join(region_ab))
## # A tibble: 827 x 4
## Region year liabs_NCDC Region_Ab
## <chr> <dbl> <dbl> <chr>
## 1 Andhra Pradesh 1991 0.34 AP
## 2 Arunachal Pradesh 1991 NA AR
## 3 Assam 1991 0.36 AS
## 4 Bihar 1991 0.18 BR
## 5 Chhattisgarh 1991 NA CT
## 6 Goa 1991 NA GA
## 7 Gujarat 1991 0.21 GJ
## 8 Haryana 1991 0.19 HR
## 9 Himachal Pradesh 1991 0.15 HP
## 10 Jammu and Kashmir 1991 NA JK
## # ... with 817 more rows
For more on the left_join() function, see this. It adds a new variable/column to liabs_NCDC without adding new observations/rows. The added column comes from region_ab. The left_join() function knows which variable liabs_NCDC and region_ab have in common, and uses that common variable to line things up.
On the various ways of joining data frames/tables/tibbles, see this.
Assuming that pop has all “Region” observations that are in liabs_NCDC, here’s another way to create region_ab:
region_ab <- pop %>% group_by(Region) %>% summarise(Region_Ab = Region_Ab[1])
Unfortunately, this would not have worked in this case because the Indian state of Telangana exists in liabs_NCDC but not in pop. Check:
unique(liabs_NCDC$Region)[which(!(unique(liabs_NCDC$Region) %in% unique(pop$Region)))]
## [1] "Telangana"
I had to run the above commands again and again for nearly 275 data files! I felt that there had to be a way to make this repetitive process automatic, perhaps by means of a function. I imagined that I could generate a list of all the data files in the temp subdirectory and tell the function to apply the clean-up commands to all files in the list seriatim. Unfortunately, I just could not make it work. So, I had to run all the clean-up commands separately for all 275 or so files in the temp subdirectory! Massive frustration!
Now, finally, pop and liabs_NCDC can be merged, and I do so using the full_join() command.
(alldata <- list(pop,liabs_NCDC) %>% reduce(full_join))
## # A tibble: 988 x 5
## Region Region_Ab year pop liabs_NCDC
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Andaman and Nicobar Islands AN 1951 31 NA
## 2 Andaman and Nicobar Islands AN 1961 64 NA
## 3 Andaman and Nicobar Islands AN 1971 115 NA
## 4 Andaman and Nicobar Islands AN 1981 189 NA
## 5 Andaman and Nicobar Islands AN 1991 281 NA
## 6 Andaman and Nicobar Islands AN 2001 356 NA
## 7 Andaman and Nicobar Islands AN 2011 381 NA
## 8 Andhra Pradesh AP 1951 31115 NA
## 9 Andhra Pradesh AP 1961 35983 NA
## 10 Andhra Pradesh AP 1971 43503 NA
## # ... with 978 more rows
For more on the merge() and reduce() functions, see this.
Now, I am on my way. In the above command, only two data frames have been merged. But there is nothing stopping me from inserting 200 data frame names in list().
I could indeed write the names of 200 data frames in the list() command above, and I did. But I kept wishing for a better way. I imagined doing this: use a command like my_list_of_files_to_merge <- ls() to create an object containing the names of all data frames in the current global environment, and then use the command alldata <- list(my_list_of_files_to_merge) %>% reduce(full_join) to complete the project. Although this sounded reasonable in theory, I just could not make it work.
: London
: Paris
: Tokyo