Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. R input facilities are simple and their requirements are fairly strict and even rather inflexible. There is a clear presumption by the designers of R that you will be able to modify your input files using other tools, such as file editors or Perl to fit in with the requirements of R. Generally this is very simple.
If variables are to be held mainly in data frames, as we strongly suggest they should be, an entire data frame can be read directly with the read.table() function. There is also a more primitive input function, scan(), that can be called directly.
For more details on importing data into R and also exporting data, see the R Data Import/Export manual.
7.1 The read.table() function
To read an entire data frame directly, the external file will normally have a special form.
- The first line of the file should have a name for each variable in the data frame.
- Each additional line of the file has as its first item a row label and the values for each variable.
If the file has one fewer item in its first line than in its second, this arrangement is presumed to be in force.
By default numeric items (except row labels) are read as numeric variables and non-numeric variables, as factors. This can be changed if necessary.
The function read.table() can then be used to read the data frame directly
HousePrice <- read.table(“houses.data”)
Often you will want to omit including the row labels directly and use the default labels. In this case the file may omit row label column.
The data frame may then be read as
HousePrice <- read.table(“houses.data”, header=TRUE)
where the header=TRUE option specifies that the first line is a line of headings, and hence, by implication from the form of the file, that no explicit row labels are given.
7.2 The scan() function
Suppose the data vectors are of equal length are are to be read in parallel. Further suppose that there are three vectors, the first of mode character and the reamining two of mode numeric, and the file is input.dat. The first step is to use scan() to read in the three vectors as a list, as follows
inp <- scan(“input.dat”, list("",0,0))
The second argument is a dummy list structure that establishees the mode of the three vectors to be read. The result, held in inp, is a list whose components are the three vectors read in. To separate the data items into three separate vectors, use assignments like
label <- inp[[1]]; x <- inp[[2]]; y <- inp[[3]]
More conveniently, the dummy list can have named components, in which case the names can be used to access the vectors read in. For example
inp <- scan(“input.dat”, list(id="", x=0, y=0))
If you wish to access the variables separately they may either be re-assigned to variables in the working frame:
label <- inp\(id; x <- inp\)x; y <- inp$y
or the list may be attached at position 2 of the search path.
If the second argument is a single value and not a list, a single vector is read in, all components of which must be of the same mode as the dummy value.
X <- matrix(scan(“light,dat”, 0),ncol=5,byrow=TRUE)
There are more elaborate input facilities available and these are detailed in the manuals.
7.3 Accessing builtin datasets
Around 100 datasets are supplied with R (in package datasets), and others are available in packages (including the recommended packages supplied with R). To see the list of datasets currently available use
data()
All the datasets supplied with R are available directly by name. However, many packages still use the obsolete convention in which data was also used to load datasets into R, for example
data(infert)
and this can still be used with the standard packages (as in this example). In most cases this will load an R object of the same name. however, in a few cases it loads several objects, so see the on-line help for the object to see what to expect.
7.3.1 Loading data from other R packages
To access data from a particular package, use the package argument, for example
data(package=“rpart”)
data(Puromycin, package=“datasets”)
If a package has been attached by library, its datasets are automatically included in the search.
User-contributed packages can be a rich source of datasets.
LS0tCnRpdGxlOiAiQ2hhcHRlciA3IC0gUmVhZGluZyBkYXRhIGZyb20gZmlsZXMiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCkxhcmdlIGRhdGEgb2JqZWN0cyB3aWxsIHVzdWFsbHkgYmUgcmVhZCBhcyB2YWx1ZXMgZnJvbSBleHRlcm5hbCBmaWxlcyByYXRoZXIgdGhhbiBlbnRlcmVkIGR1cmluZyBhbiBSIHNlc3Npb24gYXQgdGhlIGtleWJvYXJkLiBSIGlucHV0IGZhY2lsaXRpZXMgYXJlIHNpbXBsZSBhbmQgdGhlaXIgcmVxdWlyZW1lbnRzIGFyZSBmYWlybHkgc3RyaWN0IGFuZCBldmVuIHJhdGhlciBpbmZsZXhpYmxlLiBUaGVyZSBpcyBhIGNsZWFyIHByZXN1bXB0aW9uIGJ5IHRoZSBkZXNpZ25lcnMgb2YgUiB0aGF0IHlvdSB3aWxsIGJlIGFibGUgdG8gbW9kaWZ5IHlvdXIgaW5wdXQgZmlsZXMgdXNpbmcgb3RoZXIgdG9vbHMsIHN1Y2ggYXMgZmlsZSBlZGl0b3JzIG9yIFBlcmwgdG8gZml0IGluIHdpdGggdGhlIHJlcXVpcmVtZW50cyBvZiBSLiBHZW5lcmFsbHkgdGhpcyBpcyB2ZXJ5IHNpbXBsZS4KCklmIHZhcmlhYmxlcyBhcmUgdG8gYmUgaGVsZCBtYWlubHkgaW4gZGF0YSBmcmFtZXMsIGFzIHdlIHN0cm9uZ2x5IHN1Z2dlc3QgdGhleSBzaG91bGQgYmUsIGFuIGVudGlyZSBkYXRhIGZyYW1lIGNhbiBiZSByZWFkIGRpcmVjdGx5IHdpdGggdGhlIF9fcmVhZC50YWJsZSgpX18gZnVuY3Rpb24uIFRoZXJlIGlzIGFsc28gYSBtb3JlIHByaW1pdGl2ZSBpbnB1dCBmdW5jdGlvbiwgX19zY2FuKClfXywgdGhhdCBjYW4gYmUgY2FsbGVkIGRpcmVjdGx5LgoKRm9yIG1vcmUgZGV0YWlscyBvbiBpbXBvcnRpbmcgZGF0YSBpbnRvIFIgYW5kIGFsc28gZXhwb3J0aW5nIGRhdGEsIHNlZSB0aGUgKlIgRGF0YSBJbXBvcnQvRXhwb3J0KiBtYW51YWwuCgojIyA3LjEgVGhlICpyZWFkLnRhYmxlKCkqIGZ1bmN0aW9uICMjCgpUbyByZWFkIGFuIGVudGlyZSBkYXRhIGZyYW1lIGRpcmVjdGx5LCB0aGUgZXh0ZXJuYWwgZmlsZSB3aWxsIG5vcm1hbGx5IGhhdmUgYSBzcGVjaWFsIGZvcm0uCgoqIFRoZSBmaXJzdCBsaW5lIG9mIHRoZSBmaWxlIHNob3VsZCBoYXZlIGEgKm5hbWUqIGZvciBlYWNoIHZhcmlhYmxlIGluIHRoZSBkYXRhIGZyYW1lLgoqIEVhY2ggYWRkaXRpb25hbCBsaW5lIG9mIHRoZSBmaWxlIGhhcyBhcyBpdHMgZmlyc3QgaXRlbSBhICpyb3cgbGFiZWwqIGFuZCB0aGUgdmFsdWVzIGZvciBlYWNoIHZhcmlhYmxlLgoKSWYgdGhlIGZpbGUgaGFzIG9uZSBmZXdlciBpdGVtIGluIGl0cyBmaXJzdCBsaW5lIHRoYW4gaW4gaXRzIHNlY29uZCwgdGhpcyBhcnJhbmdlbWVudCBpcyBwcmVzdW1lZCB0byBiZSBpbiBmb3JjZS4KCkJ5IGRlZmF1bHQgbnVtZXJpYyBpdGVtcyAoZXhjZXB0IHJvdyBsYWJlbHMpIGFyZSByZWFkIGFzIG51bWVyaWMgdmFyaWFibGVzIGFuZCBub24tbnVtZXJpYyB2YXJpYWJsZXMsIGFzIGZhY3RvcnMuIFRoaXMgY2FuIGJlIGNoYW5nZWQgaWYgbmVjZXNzYXJ5LgoKVGhlIGZ1bmN0aW9uIF9fcmVhZC50YWJsZSgpX18gY2FuIHRoZW4gYmUgdXNlZCB0byByZWFkIHRoZSBkYXRhIGZyYW1lIGRpcmVjdGx5Cgo+IEhvdXNlUHJpY2UgPC0gcmVhZC50YWJsZSgiaG91c2VzLmRhdGEiKQoKT2Z0ZW4geW91IHdpbGwgd2FudCB0byBvbWl0IGluY2x1ZGluZyB0aGUgcm93IGxhYmVscyBkaXJlY3RseSBhbmQgdXNlIHRoZSBkZWZhdWx0IGxhYmVscy4gSW4gdGhpcyBjYXNlIHRoZSBmaWxlIG1heSBvbWl0IHJvdyBsYWJlbCBjb2x1bW4uCgpUaGUgZGF0YSBmcmFtZSBtYXkgdGhlbiBiZSByZWFkIGFzCgo+IEhvdXNlUHJpY2UgPC0gcmVhZC50YWJsZSgiaG91c2VzLmRhdGEiLCBoZWFkZXI9VFJVRSkKCndoZXJlIHRoZSBfX2hlYWRlcj1UUlVFX18gb3B0aW9uIHNwZWNpZmllcyB0aGF0IHRoZSBmaXJzdCBsaW5lIGlzIGEgbGluZSBvZiBoZWFkaW5ncywgYW5kIGhlbmNlLCBieSBpbXBsaWNhdGlvbiBmcm9tIHRoZSBmb3JtIG9mIHRoZSBmaWxlLCB0aGF0IG5vIGV4cGxpY2l0IHJvdyBsYWJlbHMgYXJlIGdpdmVuLgoKIyMgNy4yIFRoZSAqc2NhbigpKiBmdW5jdGlvbiAjIwoKU3VwcG9zZSB0aGUgZGF0YSB2ZWN0b3JzIGFyZSBvZiBlcXVhbCBsZW5ndGggYXJlIGFyZSB0byBiZSByZWFkIGluIHBhcmFsbGVsLiBGdXJ0aGVyIHN1cHBvc2UgdGhhdCB0aGVyZSBhcmUgdGhyZWUgdmVjdG9ycywgdGhlIGZpcnN0IG9mIG1vZGUgY2hhcmFjdGVyIGFuZCB0aGUgcmVhbWluaW5nIHR3byBvZiBtb2RlIG51bWVyaWMsIGFuZCB0aGUgZmlsZSBpcyBfX2lucHV0LmRhdF9fLiBUaGUgZmlyc3Qgc3RlcCBpcyB0byB1c2UgX19zY2FuKClfXyB0byByZWFkIGluIHRoZSB0aHJlZSB2ZWN0b3JzIGFzIGEgbGlzdCwgYXMgZm9sbG93cwoKPiBpbnAgPC0gc2NhbigiaW5wdXQuZGF0IiwgbGlzdCgiIiwwLDApKQoKVGhlIHNlY29uZCBhcmd1bWVudCBpcyBhIGR1bW15IGxpc3Qgc3RydWN0dXJlIHRoYXQgZXN0YWJsaXNoZWVzIHRoZSBtb2RlIG9mIHRoZSB0aHJlZSB2ZWN0b3JzIHRvIGJlIHJlYWQuIFRoZSByZXN1bHQsIGhlbGQgaW4gX19pbnBfXywgaXMgYSBsaXN0IHdob3NlIGNvbXBvbmVudHMgYXJlIHRoZSB0aHJlZSB2ZWN0b3JzIHJlYWQgaW4uIFRvIHNlcGFyYXRlIHRoZSBkYXRhIGl0ZW1zIGludG8gdGhyZWUgc2VwYXJhdGUgdmVjdG9ycywgdXNlIGFzc2lnbm1lbnRzIGxpa2UKCj4gbGFiZWwgPC0gaW5wW1sxXV07IHggPC0gaW5wW1syXV07IHkgPC0gaW5wW1szXV0KCk1vcmUgY29udmVuaWVudGx5LCB0aGUgZHVtbXkgbGlzdCBjYW4gaGF2ZSBuYW1lZCBjb21wb25lbnRzLCBpbiB3aGljaCBjYXNlIHRoZSBuYW1lcyBjYW4gYmUgdXNlZCB0byBhY2Nlc3MgdGhlIHZlY3RvcnMgcmVhZCBpbi4gRm9yIGV4YW1wbGUKCj4gaW5wIDwtIHNjYW4oImlucHV0LmRhdCIsIGxpc3QoaWQ9IiIsIHg9MCwgeT0wKSkKCklmIHlvdSB3aXNoIHRvIGFjY2VzcyB0aGUgdmFyaWFibGVzIHNlcGFyYXRlbHkgdGhleSBtYXkgZWl0aGVyIGJlIHJlLWFzc2lnbmVkIHRvIHZhcmlhYmxlcyBpbiB0aGUgd29ya2luZyBmcmFtZToKCj4gbGFiZWwgPC0gaW5wJGlkOyB4IDwtIGlucCR4OyB5IDwtIGlucCR5CgpvciB0aGUgbGlzdCBtYXkgYmUgYXR0YWNoZWQgYXQgcG9zaXRpb24gMiBvZiB0aGUgc2VhcmNoIHBhdGguCgpJZiB0aGUgc2Vjb25kIGFyZ3VtZW50IGlzIGEgc2luZ2xlIHZhbHVlIGFuZCBub3QgYSBsaXN0LCBhIHNpbmdsZSB2ZWN0b3IgaXMgcmVhZCBpbiwgYWxsIGNvbXBvbmVudHMgb2Ygd2hpY2ggbXVzdCBiZSBvZiB0aGUgc2FtZSBtb2RlIGFzIHRoZSBkdW1teSB2YWx1ZS4KCj4gWCA8LSBtYXRyaXgoc2NhbigibGlnaHQsZGF0IiwgMCksbmNvbD01LGJ5cm93PVRSVUUpCgpUaGVyZSBhcmUgbW9yZSBlbGFib3JhdGUgaW5wdXQgZmFjaWxpdGllcyBhdmFpbGFibGUgYW5kIHRoZXNlIGFyZSBkZXRhaWxlZCBpbiB0aGUgbWFudWFscy4KCiMjIDcuMyBBY2Nlc3NpbmcgYnVpbHRpbiBkYXRhc2V0cyAjIwoKQXJvdW5kIDEwMCBkYXRhc2V0cyBhcmUgc3VwcGxpZWQgd2l0aCBSIChpbiBwYWNrYWdlIF9fZGF0YXNldHNfXyksIGFuZCBvdGhlcnMgYXJlIGF2YWlsYWJsZSBpbiBwYWNrYWdlcyAoaW5jbHVkaW5nIHRoZSByZWNvbW1lbmRlZCBwYWNrYWdlcyBzdXBwbGllZCB3aXRoIFIpLiBUbyBzZWUgdGhlIGxpc3Qgb2YgZGF0YXNldHMgY3VycmVudGx5IGF2YWlsYWJsZSB1c2UKCj4gZGF0YSgpCgpBbGwgdGhlIGRhdGFzZXRzIHN1cHBsaWVkIHdpdGggUiBhcmUgYXZhaWxhYmxlIGRpcmVjdGx5IGJ5IG5hbWUuIEhvd2V2ZXIsIG1hbnkgcGFja2FnZXMgc3RpbGwgdXNlIHRoZSBvYnNvbGV0ZSBjb252ZW50aW9uIGluIHdoaWNoIF9fZGF0YV9fIHdhcyBhbHNvIHVzZWQgdG8gbG9hZCBkYXRhc2V0cyBpbnRvIFIsIGZvciBleGFtcGxlCgo+IGRhdGEoaW5mZXJ0KQoKYW5kIHRoaXMgY2FuIHN0aWxsIGJlIHVzZWQgd2l0aCB0aGUgc3RhbmRhcmQgcGFja2FnZXMgKGFzIGluIHRoaXMgZXhhbXBsZSkuIEluIG1vc3QgY2FzZXMgdGhpcyB3aWxsIGxvYWQgYW4gUiBvYmplY3Qgb2YgdGhlIHNhbWUgbmFtZS4gaG93ZXZlciwgaW4gYSBmZXcgY2FzZXMgaXQgbG9hZHMgc2V2ZXJhbCBvYmplY3RzLCBzbyBzZWUgdGhlIG9uLWxpbmUgaGVscCBmb3IgdGhlIG9iamVjdCB0byBzZWUgd2hhdCB0byBleHBlY3QuCgojIyA3LjMuMSBMb2FkaW5nIGRhdGEgZnJvbSBvdGhlciBSIHBhY2thZ2VzICMjCgpUbyBhY2Nlc3MgZGF0YSBmcm9tIGEgcGFydGljdWxhciBwYWNrYWdlLCB1c2UgdGhlIF9fcGFja2FnZV9fIGFyZ3VtZW50LCBmb3IgZXhhbXBsZQoKPiBkYXRhKHBhY2thZ2U9InJwYXJ0IikgIAo+IGRhdGEoUHVyb215Y2luLCBwYWNrYWdlPSJkYXRhc2V0cyIpCgpJZiBhIHBhY2thZ2UgaGFzIGJlZW4gYXR0YWNoZWQgYnkgX19saWJyYXJ5X18sIGl0cyBkYXRhc2V0cyBhcmUgYXV0b21hdGljYWxseSBpbmNsdWRlZCBpbiB0aGUgc2VhcmNoLgoKVXNlci1jb250cmlidXRlZCBwYWNrYWdlcyBjYW4gYmUgYSByaWNoIHNvdXJjZSBvZiBkYXRhc2V0cy4KCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgo=