This document covers:

  1. How to import data
  2. How to verify data imports
  3. How to use subsetting tools we’ve already seen to manipulate that data

Importing Data

Importing data into R is easy. Most of the time.

To begin this exercise, please download Sample Data.csv from the course Moodle site. Note where you download it.

Using Packages

Almost all the time, you will use the foreign package. This is just a package of functions that allow you to read datasets in a variety of file formats into R. Since the foreign package is pre-installed in R, we do not have to download and install it, as we would for other packages (and will, later on).

Packages have to be loaded into R’s memory. (Back in the day, in the 1980s, that’s how almost all computers worked—I’m not that old, but my kindergarten had Commodore 64s.) This requires the library() command, which we use very simply:

library(foreign)

Normally, in this class, I will provide datasets as .csv, Comma-Separated Values, or as .dta, Stata data files. You don’t have to know much about these right now; right now, all you need to know is that, if the data file ends with .csv, you will use the command read.csv(). If it ends with .dta, you will use the command read.dta(). Generally speaking, most data ‘in the wild’ is also available in one of those two formats. If it isn’t (as with Excel files, .xls or .xlsx), you can use Excel or another spreadsheet program to convert them to .csv format.

We then have to declare our “working directory”, which is the path to the folder on your machine where the file you want to importlives. (Obviously, this is my path—you will have to modify the code appropriately.) RStudio has a help file specifically on this.

setwd("/Users/paulmusgrave/Dropbox/0005 Teaching/2016 03 UMASS Energy/Handouts")

Note the setwd command, which sets the working directory. This is the folder into which you’ve saved your Sample Data.csv file. You can set the working directory manually (under “Misc” in the default R program on Mac or under “Session” in RStudio), but it is a good idea for you to do this through your R code. I also recommend you create a directory for this class so that all of your course material (and data) lives in the same place.

Actually Importing Data

This is a lot of setup for a very easy command:

data <- read.csv("Sample Data.csv",stringsAsFactors = FALSE)

(Don’t worry about what the stringsAsFactors option is doing right now; for now, just know that you should include it in importing .csv files.)

For a .dta, you would use read.dta().

You should always then look at the data to see if it is what you expected:

head(data)
##                  Country GDPPCRd Energy.Use.Rd Section
## 1                Albania    7428           654       A
## 2                Bahamas   28460          2156       A
## 3 Bosnia and Herzegovina    7390          1602       A
## 4               Cambodia    1879           352       A
## 5                Comoros     990            61       A
## 6                Denmark   32018          3323       A

That’s a little ugly (I created this file in another program which treats column headers a little differently). Just to tidy things up, I’ll use the colnames() command to reset the column names:

colnames(data) <- c("Country","GDPPC","Energy","Set")

And now, check to make sure that the names have changed (and that we haven’t flipped anything!).

head(data)
##                  Country GDPPC Energy Set
## 1                Albania  7428    654   A
## 2                Bahamas 28460   2156   A
## 3 Bosnia and Herzegovina  7390   1602   A
## 4               Cambodia  1879    352   A
## 5                Comoros   990     61   A
## 6                Denmark 32018   3323   A

Viewing Our Data

A quick review of indexing: note that data[,1] and data$Country will return the same values; type these into your command line to verify:

data[,1]
data$Country

Note that we could have also used:

head(data[,1])
## [1] "Albania"                "Bahamas"               
## [3] "Bosnia and Herzegovina" "Cambodia"              
## [5] "Comoros"                "Denmark"
head(data$Country)
## [1] "Albania"                "Bahamas"               
## [3] "Bosnia and Herzegovina" "Cambodia"              
## [5] "Comoros"                "Denmark"

Or even

tail(data[,1])
## [1] "Eritrea"      "Gambia"       "Haiti"        "Ireland"     
## [5] "Korea, South" "Tanzania"
tail(data$Country)
## [1] "Eritrea"      "Gambia"       "Haiti"        "Ireland"     
## [5] "Korea, South" "Tanzania"

If you want to view the record for a given unit, you could do something like this:

data[5,]
##   Country GDPPC Energy Set
## 5 Comoros   990     61   A

which returns the fifth record in whatever order you’re using.

Using Subsetting Creatively

You could also do something more advanced but much more useful:

data[data$Country=="Denmark",]
##   Country GDPPC Energy Set
## 6 Denmark 32018   3323   A

which you should read from the inside out as instructing R to find the record(s) that match the condition that the Country field is equal to the text string “Denmark” and then return only those rows for which that statement is logically valid. Similarly, if we wanted to find only rich countries, we could use an arbitrary but sensible cutoff like:

data[data$GDPPC>=25000,]
##                  Country GDPPC Energy Set
## 2                Bahamas 28460   2156   A
## 6                Denmark 32018   3323   A
## 8                Germany 32176   3872   A
## 10                Israel 25251   2876   A
## 11                Kuwait 45627  11403   A
## 18           Switzerland 38319   3484   A
## 20         United States 41188   7058   A
## 30                 Italy 26730   2739   A
## 31            Luxembourg 67963   7939   A
## 46                Canada 34527   7434   B
## 49               Finland 30442   6228   B
## 51               Iceland 34100  16905   B
## 56                Norway 47175   5832   B
## 71        France (1963-) 29161   3917   C
## 72                Greece 25303   2609   C
## 74                 Japan 29626   3702   C
## 79                 Qatar 65894  14626   C
## 81                 Spain 27083   2777   C
## 108              Belgium 32400   5289   D
## 109               Brunei 45156   7972   D
## 111       Cyprus (1975-) 25790   2327   D
## 112    Equatorial Guinea 31519   2732   D
## 119          Netherlands 36521   4730   D
## 127               Sweden 32300   4884   D
## 128 United Arab Emirates 45203   8698   D
## 129            Australia 33989   5739   E
## 144            Singapore 46271   5510   E
## 147       United Kingdom 32470   3189   E
## 149              Austria 34705   3797   E
## 157              Ireland 36274   3230   E
## 158         Korea, South 25300   4660   E

And we could even use this sort of subsetting to create ranges of data that we could use in calculations:

mean(data[data$GDPPC>=25000,]$Energy)
## [1] 5536.677
mean(data[data$GDPPC<25000,]$Energy)
## [1] 1412.82