This document covers:
Importing data into R is easy. Most of the time.
To begin this exercise, please download Sample Data.csv from the course Moodle site. Note where you download it.
Almost all the time, you will use the foreign package. This is just a package of functions that allow you to read datasets in a variety of file formats into R. Since the foreign package is pre-installed in R, we do not have to download and install it, as we would for other packages (and will, later on).
Packages have to be loaded into R’s memory. (Back in the day, in the 1980s, that’s how almost all computers worked—I’m not that old, but my kindergarten had Commodore 64s.) This requires the library() command, which we use very simply:
library(foreign)
Normally, in this class, I will provide datasets as .csv, Comma-Separated Values, or as .dta, Stata data files. You don’t have to know much about these right now; right now, all you need to know is that, if the data file ends with .csv, you will use the command read.csv(). If it ends with .dta, you will use the command read.dta(). Generally speaking, most data ‘in the wild’ is also available in one of those two formats. If it isn’t (as with Excel files, .xls or .xlsx), you can use Excel or another spreadsheet program to convert them to .csv format.
We then have to declare our “working directory”, which is the path to the folder on your machine where the file you want to importlives. (Obviously, this is my path—you will have to modify the code appropriately.) RStudio has a help file specifically on this.
setwd("/Users/paulmusgrave/Dropbox/0005 Teaching/2016 03 UMASS Energy/Handouts")
Note the setwd command, which sets the working directory. This is the folder into which you’ve saved your Sample Data.csv file. You can set the working directory manually (under “Misc” in the default R program on Mac or under “Session” in RStudio), but it is a good idea for you to do this through your R code. I also recommend you create a directory for this class so that all of your course material (and data) lives in the same place.
This is a lot of setup for a very easy command:
data <- read.csv("Sample Data.csv",stringsAsFactors = FALSE)
(Don’t worry about what the stringsAsFactors option is doing right now; for now, just know that you should include it in importing .csv files.)
For a .dta, you would use read.dta().
You should always then look at the data to see if it is what you expected:
head(data)
## Country GDPPCRd Energy.Use.Rd Section
## 1 Albania 7428 654 A
## 2 Bahamas 28460 2156 A
## 3 Bosnia and Herzegovina 7390 1602 A
## 4 Cambodia 1879 352 A
## 5 Comoros 990 61 A
## 6 Denmark 32018 3323 A
That’s a little ugly (I created this file in another program which treats column headers a little differently). Just to tidy things up, I’ll use the colnames() command to reset the column names:
colnames(data) <- c("Country","GDPPC","Energy","Set")
And now, check to make sure that the names have changed (and that we haven’t flipped anything!).
head(data)
## Country GDPPC Energy Set
## 1 Albania 7428 654 A
## 2 Bahamas 28460 2156 A
## 3 Bosnia and Herzegovina 7390 1602 A
## 4 Cambodia 1879 352 A
## 5 Comoros 990 61 A
## 6 Denmark 32018 3323 A
A quick review of indexing: note that data[,1] and data$Country will return the same values; type these into your command line to verify:
data[,1]
data$Country
Note that we could have also used:
head(data[,1])
## [1] "Albania" "Bahamas"
## [3] "Bosnia and Herzegovina" "Cambodia"
## [5] "Comoros" "Denmark"
head(data$Country)
## [1] "Albania" "Bahamas"
## [3] "Bosnia and Herzegovina" "Cambodia"
## [5] "Comoros" "Denmark"
Or even
tail(data[,1])
## [1] "Eritrea" "Gambia" "Haiti" "Ireland"
## [5] "Korea, South" "Tanzania"
tail(data$Country)
## [1] "Eritrea" "Gambia" "Haiti" "Ireland"
## [5] "Korea, South" "Tanzania"
If you want to view the record for a given unit, you could do something like this:
data[5,]
## Country GDPPC Energy Set
## 5 Comoros 990 61 A
which returns the fifth record in whatever order you’re using.
You could also do something more advanced but much more useful:
data[data$Country=="Denmark",]
## Country GDPPC Energy Set
## 6 Denmark 32018 3323 A
which you should read from the inside out as instructing R to find the record(s) that match the condition that the Country field is equal to the text string “Denmark” and then return only those rows for which that statement is logically valid. Similarly, if we wanted to find only rich countries, we could use an arbitrary but sensible cutoff like:
data[data$GDPPC>=25000,]
## Country GDPPC Energy Set
## 2 Bahamas 28460 2156 A
## 6 Denmark 32018 3323 A
## 8 Germany 32176 3872 A
## 10 Israel 25251 2876 A
## 11 Kuwait 45627 11403 A
## 18 Switzerland 38319 3484 A
## 20 United States 41188 7058 A
## 30 Italy 26730 2739 A
## 31 Luxembourg 67963 7939 A
## 46 Canada 34527 7434 B
## 49 Finland 30442 6228 B
## 51 Iceland 34100 16905 B
## 56 Norway 47175 5832 B
## 71 France (1963-) 29161 3917 C
## 72 Greece 25303 2609 C
## 74 Japan 29626 3702 C
## 79 Qatar 65894 14626 C
## 81 Spain 27083 2777 C
## 108 Belgium 32400 5289 D
## 109 Brunei 45156 7972 D
## 111 Cyprus (1975-) 25790 2327 D
## 112 Equatorial Guinea 31519 2732 D
## 119 Netherlands 36521 4730 D
## 127 Sweden 32300 4884 D
## 128 United Arab Emirates 45203 8698 D
## 129 Australia 33989 5739 E
## 144 Singapore 46271 5510 E
## 147 United Kingdom 32470 3189 E
## 149 Austria 34705 3797 E
## 157 Ireland 36274 3230 E
## 158 Korea, South 25300 4660 E
And we could even use this sort of subsetting to create ranges of data that we could use in calculations:
mean(data[data$GDPPC>=25000,]$Energy)
## [1] 5536.677
mean(data[data$GDPPC<25000,]$Energy)
## [1] 1412.82