1 Data structures

1.1 Vectors

1.1.1 Generating regular sequences

1.2 Matrices
1.3 Arrays
1.4 Data frames

1.4.1 Specifying elements of a data frame

1.5 Factors
1.6 Lists

2 Data input

2.1 Entering data from the keyboard
2.2 Importing data from delimited text files
2.3 Importing data from Excel
2.4 Importing data from MySQL

2.4.1 RODBC
2.4.2 Getting RODBC working
2.4.3 Using R package RMySQL

2.5 Accessing builtin datasets
2.6 Loading data from other R packages

3 Useful functions for working with data objects

4 References

1 Data structures

R has a wide variety of objects for holding data, including scalars, vectors, matrices, arrays, data frames, and lists.

R data structures

1.1 Vectors

Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function c() is used to form the vector. Here are examples of each type of vector:

a <- c(1, 2, 5, 3, 6, -2, 4)
a

## [1]  1  2  5  3  6 -2  4

b <- c("one", "two", "three")
b

## [1] "one"   "two"   "three"

c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
c

## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Here, a is numeric vector, b is a character vector, and c is a logical vector.

e <- c(1, 3, 5)
f <- c(2, 4, 6)
c(e, f)

## [1] 1 3 5 2 4 6

1.1.1 Generating regular sequences

The colon operator : used in the last statement is used to generate a sequence of numbers. For example

a <- c(1:6)
a

## [1] 1 2 3 4 5 6

a <- 1:6
a

## [1] 1 2 3 4 5 6

is equivalent to

a <- c(1, 2, 3, 4, 5, 6)
a

## [1] 1 2 3 4 5 6

a <- 6:1
a

## [1] 6 5 4 3 2 1

The function seq() is a more general facility for generating sequences. For example

seq(-5, 5, by = .2)

##  [1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4
## [15] -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4
## [29]  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4  2.6  2.8  3.0  3.2
## [43]  3.4  3.6  3.8  4.0  4.2  4.4  4.6  4.8  5.0

seq(from = -5, to = 5, length.out = 51)

##  [1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2 -3.0 -2.8 -2.6 -2.4
## [15] -2.2 -2.0 -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4
## [29]  0.6  0.8  1.0  1.2  1.4  1.6  1.8  2.0  2.2  2.4  2.6  2.8  3.0  3.2
## [43]  3.4  3.6  3.8  4.0  4.2  4.4  4.6  4.8  5.0

seq.int(-5, 5, by = 2)

## [1] -5 -3 -1  1  3  5

seq(10)

##  [1]  1  2  3  4  5  6  7  8  9 10

seq(length.out= 10)

##  [1]  1  2  3  4  5  6  7  8  9 10

is equivalent to

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

A related function is rep() which can be used for replicating an object in various complicated ways. The simplest form is

 a1 <- rep(x, times = 5)

which will put five copies of x end-to-end in a1. Another useful version is

 a2 <- rep(x, each = 5)

which repeats each element of x five times before moving on to the next.

For example

rep(1:3, times = 5)

##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

rep(1:3, each = 5)

##  [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

You can refer to elements of a vector using a numeric vector of positions within brackets.

For example, a[c(2, 4)] refers to the 2th and 4th element of vector a.
Here are additional examples:

a <- c(1, 2, 5, 3, 5, -2, 5)
a[c(2, 4)]

## [1] 2 3

a[5]

## [1] 5

a[c(1, 3, 5)]

## [1] 1 5 5

a[2:6]

## [1]  2  5  3  5 -2

a[-1]

## [1]  2  5  3  5 -2  5

a[-c(1, 3, 5)]

## [1]  2  3 -2  5

which(a==5)

## [1] 3 5 7

1.2 Matrices

A matrix is a two-dimensional array where each element has the same mode (numeric, character, or logical). Matrices are created with the matrix function . The general format is

matrix(vector, nrow = number_of_rows, ncol = number_of_columns,
       byrow = logical_value, 
       dimnames = list(char_vector_rownames, char_vector_colnames))

where vector contains the elements for the matrix, nrow and ncol specify the row and column dimensions, and dimnames contains optional row and column labels stored in character vectors. The option byrow indicates whether the matrix should be filled in by row (byrow=TRUE) or by column (byrow=FALSE). The default is by column. The following listing demonstrates the matrix function .

 y <- matrix(1:20, nrow=5, ncol=4)
 y

##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

 cells <- c(1,26,24,68)
 rnames <- c("R1", "R2")
 cnames <- c("C1", "C2")
 matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames))

##    C1 C2
## R1  1 26
## R2 24 68

 matrix(cells, nrow=2, ncol=2, byrow=FALSE, dimnames=list(rnames, cnames))

##    C1 C2
## R1  1 24
## R2 26 68

Another way to create a matrix is to give the appropriate values to the dim attribute. For examples

x <- 1:15
dim(x)

## NULL

dim(x) <- c(5, 3)
x

##      [,1] [,2] [,3]
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13
## [4,]    4    9   14
## [5,]    5   10   15

x <- matrix(1:15,5,3)
x

##      [,1] [,2] [,3]
## [1,]    1    6   11
## [2,]    2    7   12
## [3,]    3    8   13
## [4,]    4    9   14
## [5,]    5   10   15

dim(x) <- c(3, 5)
x

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    4    7   10   13
## [2,]    2    5    8   11   14
## [3,]    3    6    9   12   15

Random matrix

matrix(rnorm(24),6,4)

##            [,1]       [,2]       [,3]       [,4]
## [1,] -1.1132946  0.4269011  0.4083483 -1.1517921
## [2,] -1.7783563 -0.4838834  1.0144093  0.1171856
## [3,] -1.0865778 -0.6550947  0.7400840  1.1979374
## [4,]  1.5422933  0.2837415  1.6786151  1.1635374
## [5,]  1.7875962  0.2856816 -0.6179048 -0.6974258
## [6,]  0.7060166  0.2504176 -1.1966705 -0.3744002

Some special matrix

x <- 1:6
diag(x)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    0    0    0    0    0
## [2,]    0    2    0    0    0    0
## [3,]    0    0    3    0    0    0
## [4,]    0    0    0    4    0    0
## [5,]    0    0    0    0    5    0
## [6,]    0    0    0    0    0    6

x <- rep(1,6)
diag(x)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    0    0    0    0    0
## [2,]    0    1    0    0    0    0
## [3,]    0    0    1    0    0    0
## [4,]    0    0    0    1    0    0
## [5,]    0    0    0    0    1    0
## [6,]    0    0    0    0    0    1

x <- 1
diag(x, nrow=6)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    0    0    0    0    0
## [2,]    0    1    0    0    0    0
## [3,]    0    0    1    0    0    0
## [4,]    0    0    0    1    0    0
## [5,]    0    0    0    0    1    0
## [6,]    0    0    0    0    0    1

x <- matrix(1:36,6,6)
x

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    7   13   19   25   31
## [2,]    2    8   14   20   26   32
## [3,]    3    9   15   21   27   33
## [4,]    4   10   16   22   28   34
## [5,]    5   11   17   23   29   35
## [6,]    6   12   18   24   30   36

diag(x)

## [1]  1  8 15 22 29 36

matrix(rep(0,20),4,5)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0

matrix(0,4,5)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0

matrix(1,4,5)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    1    1    1    1    1
## [3,]    1    1    1    1    1
## [4,]    1    1    1    1    1

Identify rows, columns, or elements of a matrix by using subscripts and brackets. X[i,] refers to the ith row of matrix X, X[,j] refers to jth column, and X[i, j] refers to the ijth element, respectively. The subscripts i and j can be numeric vectors in order to select multiple rows or columns, as shown in the following listing.

 x <- matrix(1:10, nrow=2)
 x

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

 x[2,]

## [1]  2  4  6  8 10

 x[,2]

## [1] 3 4

 x[1,4]

## [1] 7

 x[1, c(4,5)]

## [1] 7 9

Indexing can also be used to suppress one or several rows or columns using negative values. For example, x[-1, ] will suppress the first row, while x[-c(1, 5), ] will do the same for the 1st and 15th rows. For example

x <- matrix(1:20, nrow=5)
x

##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

x[-c(1, 5), ]

##      [,1] [,2] [,3] [,4]
## [1,]    2    7   12   17
## [2,]    3    8   13   18
## [3,]    4    9   14   19

x[-1,]

##      [,1] [,2] [,3] [,4]
## [1,]    2    7   12   17
## [2,]    3    8   13   18
## [3,]    4    9   14   19
## [4,]    5   10   15   20

x[,-1]

##      [,1] [,2] [,3]
## [1,]    6   11   16
## [2,]    7   12   17
## [3,]    8   13   18
## [4,]    9   14   19
## [5,]   10   15   20

x[,-c(1, 3)]

##      [,1] [,2]
## [1,]    6   16
## [2,]    7   17
## [3,]    8   18
## [4,]    9   19
## [5,]   10   20

1.3 Arrays

Arrays are similar to matrices but can have more than two dimensions. They’re created with an array function of the following form:

array(vector, dimensions, dimnames)

where vector contains the data for the array, dimensions is a numeric vector giving the maximal index for each dimension, and dimnames is an optional list of dimension labels. For example

array(1:24, dim = c(3, 4, 2))

## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24

dim1 <- c("A1", "A2")
dim2 <- c("B1", "B2", "B3")
dim3 <- c("C1", "C2", "C3", "C4")
z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))
z

## , , C1
## 
##    B1 B2 B3
## A1  1  3  5
## A2  2  4  6
## 
## , , C2
## 
##    B1 B2 B3
## A1  7  9 11
## A2  8 10 12
## 
## , , C3
## 
##    B1 B2 B3
## A1 13 15 17
## A2 14 16 18
## 
## , , C4
## 
##    B1 B2 B3
## A1 19 21 23
## A2 20 22 24

z[,,1];z[,,2];z[,,3];z[,,4]

##    B1 B2 B3
## A1  1  3  5
## A2  2  4  6

##    B1 B2 B3
## A1  7  9 11
## A2  8 10 12

##    B1 B2 B3
## A1 13 15 17
## A2 14 16 18

##    B1 B2 B3
## A1 19 21 23
## A2 20 22 24

z[1,,]

##    C1 C2 C3 C4
## B1  1  7 13 19
## B2  3  9 15 21
## B3  5 11 17 23

z[,2,]

##    C1 C2 C3 C4
## A1  3  9 15 21
## A2  4 10 16 22

z[1,2,3]

## [1] 15

1.4 Data frames

A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, etc.). It’s similar to the datasets you’d typically see in SAS, SPSS, and Stata. Data frames are the most common data structure you’ll deal with in R. A data frame is created with the data.frame() function:

mydata <- data.frame(col1, col2, col3,...)

where col1, col2, col3, … are column vectors of any type (such as character, numeric, or logical). Names for each column can be provided with the names function. The following listing makes this clear.

 patientID <- c(1, 2, 3, 4)
 age <- c(25, 34, 28, 52)
 diabetes <- c("Type1", "Type2", "Type1", "Type1")
 status <- c("Poor", "Improved", "Excellent", "Poor")
 patientdata <- data.frame(patientID, age, diabetes, status)
 patientdata

patientID	age	diabetes	status
1	25	Type1	Poor
2	34	Type2	Improved
3	28	Type1	Excellent
4	52	Type1	Poor

NOTE: Each column must have only one mode, but you can put columns of different modes together to form the data frame.

1.4.1 Specifying elements of a data frame

There are several ways to identify the elements of a data frame. You can use the subscript notation you used before (for example, with matrices) or you can specify column names. The following listing demonstrates these approaches.

 patientdata[1:2,]

patientID	age	diabetes	status
1	25	Type1	Poor
2	34	Type2	Improved

 patientdata[,1:2]

patientID	age
1	25
2	34
3	28
4	52

 patientdata[,-1]

age	diabetes	status
25	Type1	Poor
34	Type2	Improved
28	Type1	Excellent
52	Type1	Poor

 patientdata[,c("diabetes", "status")]

diabetes	status
Type1	Poor
Type2	Improved
Type1	Excellent
Type1	Poor

patientdata[,"age"]

## [1] 25 34 28 52

patientdata$age

## [1] 25 34 28 52

The $ is used to indicate a particular variable from a given data frame.

1.5 Factors

Variables can be described as nominal, ordinal, or continuous.
- Nominal variables are categorical, without an implied order. Diabetes (Type1, Type2) is an example of a nominal variable.
- Ordinal variables imply order but not amount. Status (poor, improved, excellent) is a good example of an ordinal variable.
- Continuous variables can take on any value within some range, and both order and amount are implied. For example, Age in years is a continuous variable.
Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors.
The function factor() stores the categorical values as a vector of integers in the range [1... k] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers.

For example, assume that you have the vector

diabetes <- c("Type1", "Type2", "Type1", "Type1")
diabetes <- factor(diabetes)
diabetes

## [1] Type1 Type2 Type1 Type1
## Levels: Type1 Type2

The statement diabetes <- factor(diabetes) stores this vector as (1, 2, 1, 1) and associates it with 1=Type1 and 2=Type2 internally (the assignment is alphabetical). Any analyses performed on the vector diabetes will treat the variable as nominal and select the statistical methods appropriate for this level of measurement.

For vectors representing ordinal variables, the parameter ordered=TRUE is required in the function factor(). For example

status <- c("Poor", "Improved", "Excellent", "Poor")
status <- factor(status, ordered=TRUE)
status

## [1] Poor      Improved  Excellent Poor     
## Levels: Excellent < Improved < Poor

the statement status <- factor(status, ordered=TRUE) will encode the vector as (3, 2, 1, 3) and associate these values internally as 1=Excellent, 2=Improved, and 3=Poor. Additionally, any analyses performed on this vector will treat the variable as ordinal and select the statistical methods appropriately.

The following listing demonstrates how specifying factors and ordered factors impact data analyses.

 patientID <- c(1, 2, 3, 4)
 age <- c(25, 34, 28, 52)
 diabetes <- c("Type1", "Type2", "Type1", "Type1")
 status <- c("Poor", "Improved", "Excellent", "Poor")
 diabetes <- factor(diabetes) 
 status <- factor(status, order=TRUE)
 patientdata <- data.frame(patientID, age, diabetes, status)
 patientdata

patientID	age	diabetes	status
1	25	Type1	Poor
2	34	Type2	Improved
3	28	Type1	Excellent
4	52	Type1	Poor

 str(patientdata)

## 'data.frame':    4 obs. of  4 variables:
##  $ patientID: num  1 2 3 4
##  $ age      : num  25 34 28 52
##  $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
##  $ status   : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3

 summary(patientdata)

##    patientID         age         diabetes       status 
##  Min.   :1.00   Min.   :25.00   Type1:3   Excellent:1  
##  1st Qu.:1.75   1st Qu.:27.25   Type2:1   Improved :1  
##  Median :2.50   Median :31.00             Poor     :2  
##  Mean   :2.50   Mean   :34.75                          
##  3rd Qu.:3.25   3rd Qu.:38.50                          
##  Max.   :4.00   Max.   :52.00

NOTE: The function str(object) provides information on an object in R. The summary() function treats the variables differently. It provides the minimum, maximum, mean, and quartiles for the continuous variable age, and frequency counts for the categorical variables diabetes and status.

1.6 Lists

A list is an ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name. For example, a list may contain a combination of vectors, matrices, data frames, and even other lists. You create a list using the list() function:

mylist <- list(object1, object2, ...)

where the objects are any of the structures seen so far. Optionally, you can name the objects in a list:

mylist <- list(name1=object1, name2=object2,...)

The following listing shows an example.

 g <- "My First List"
 h <- c(25, 26, 18, 39)
 j <- matrix(1:10, nrow=5)
 k <- c("one", "two", "three")
 mylist <- list(title=g, ages=h, j, k)
 mylist

## $title
## [1] "My First List"
## 
## $ages
## [1] 25 26 18 39
## 
## [[3]]
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
## 
## [[4]]
## [1] "one"   "two"   "three"

Elements of a list can be accessed by indicating a component number or a name within double brackets. For example:

mylist[[1]]

## [1] "My First List"

mylist[['title']]

## [1] "My First List"

mylist$title

## [1] "My First List"

mylist[[2]]

## [1] 25 26 18 39

mylist[['ages']]

## [1] 25 26 18 39

mylist$ages

## [1] 25 26 18 39

mylist[[3]]

##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10

mylist[[4]]

## [1] "one"   "two"   "three"

2 Data input

As a data analyst, we face typically the data that comes to us from a variety of sources and in a variety of formats. The task is to import the data into our tools, analyze the data, and report on the results. R provides a wide range of tools for importing data. The definitive guide for importing data in R is the R Data Import/Export manual available at http://cran.r-project.org/doc/manuals/R-data.pdf.

2.1 Entering data from the keyboard

Perhaps the simplest method of data entry is from the keyboard. For example:

mymatrix=matrix(c(1,3,5,7,9,11,13,15,17),3,3)
mymatrix

##      [,1] [,2] [,3]
## [1,]    1    7   13
## [2,]    3    9   15
## [3,]    5   11   17

 patientID <- c(1, 2, 3, 4)
 age <- c(25, 34, 28, 52)
 diabetes <- c("Type1", "Type2", "Type1", "Type1")
 status <- c("Poor", "Improved", "Excellent", "Poor")
 patientdata <- data.frame(patientID, age, diabetes, status)
 patientdata

patientID	age	diabetes	status
1	25	Type1	Poor
2	34	Type2	Improved
3	28	Type1	Excellent
4	52	Type1	Poor

2.2 Importing data from delimited text files

The delimited text files can be imported by using read.table() , a function that reads a file in table format and saves it as a data frame. Here’s the syntax:

mydataframe <- read.table(file, header=logical_value, sep="delimiter"...)

where file is a delimited ASCII file , header is a logical value indicating whether the first row contains variable names (TRUE or FALSE). sep specifies the delimiter separating data values. For example

grades <- read.table("studentgrades.txt", header=TRUE, sep=",")
grades

StudentID	First	Last	Math	Science	Social.Studies
11	Bob	Smith	90	80	67
12	Jane	Weary	75	NA	80
10	Dan	Thornton	65	75	70
40	Mary	O’Leary	90	95	92

The variants of read.table are useful, for example, read.csv.

grades <- read.csv("studentgrades.csv", header=TRUE)##Usefull
grades

StudentID	First	Last	Math	Science	Social.Studies
11	Bob	Smith	90	80	67
12	Jane	Weary	75	NA	80
10	Dan	Thornton	65	75	70
40	Mary	O’Leary	90	95	92

2.3 Importing data from Excel

Excel 2007 uses an XLSX file format, which is essentially a zipped set of XML files. The XLSX package can be used to access spreadsheets in this format. Be sure to download and install it before first use. The read.xlsx() function imports a worksheet from an XLSX file into a data frame. The simplest format is read.xlsx(file, n) where file is the path to an Excel 2007 workbook and n is the number of the worksheet to be imported. For example,
```
#install.packages('xlsx')
library(xlsx)
mydataframe <- read.xlsx("studentgrades.xlsx", 1)
mydataframe
```
StudentID First Last Math Science Social.Studies NA.

11 Bob Smith 90 80 67 NA

12 Jane Weary 75 NA 80 NA

10 Dan Thornton 65 75 70 NA

40 Mary O’Leary 90 95 92 NA
Or get the detailed usage from xlsx index
Install following packages sequentially
```
```r
install.packages("rJava") 
install.packages("xlsxjars") 
install.packages("xlsx") 
library(rJava) 
library(xlsxjars) 
library(xlsx)
```
```
- If there are still errors about Java, you may need to download Java, and install Java into your local fold “C:\Program Files”, which is defalt.
- Remember to setup environment variables. On cmd command line, type
```
set JAVA_HOME=C:\Program Files\Java\jdk1.8.0_171 Path=C:\Program Files\Java\jdk1.8.0_211\bin
```
- Or setup environment variables manually by adding path C:\Program Files\Java\jdk1.8.0_211\bin

StudentID	First	Last	Math	Science	Social.Studies	NA.
11	Bob	Smith	90	80	67	NA
12	Jane	Weary	75	NA	80	NA
10	Dan	Thornton	65	75	70	NA
40	Mary	O’Leary	90	95	92	NA

2.4 Importing data from MySQL

2.4.1 RODBC

The R package for accessing databases through ODBC is the RODBC package. Microsoft and Simba Technologies jointly developed ODBC in the late 1990s based on a design from the SQL Access Group. In ODBC, different data sources are labeled by database source names (DSNs).

2.4.2 Getting RODBC working

Before using RODBC, it is required to configure the ODBC connection. After configuring R to communicate with your database, it is ready to use RODBC.
1. Install MySQL and ODBC connector to your database, if you donot have. The details can be found here.
2. Install the RODBC package in R. Get the detailed usage from RODBC index
3. Configure the ODBC connection to your database.
Installing the R package RODBC. A quick way to install the RODBC package is to use the function install.packages:

 install.packages("RODBC")
 library(RODBC)
 help(package="RODBC")

Using RODBC. Connecting to a database in R is similor to connecting to a file.
- connect to a database with function odbcConnect().
- execute any database queries.
- close the connection with function close().
Establishing a connection by using the function odbcConnect():

odbcConnect(dsn, uid = "", pwd = "", ...)

For example:

library(RODBC)
con<-odbcConnect('mydata',uid="root",pwd='lx5609')
sqlTables(con) # List the table-like objects accessible from an ODBC connection.

TABLE_CAT	TABLE_NAME	TABLE_TYPE	REMARKS
mysql	bmw_stock	TABLE
mysql	columns_priv	TABLE	Column privileges
mysql	db	TABLE	Database privileges
mysql	engine_cost	TABLE
mysql	event	TABLE	Events
mysql	func	TABLE	User defined functions
mysql	general_log	TABLE	General log
mysql	gtid_executed	TABLE
mysql	help_category	TABLE	help categories
mysql	help_keyword	TABLE	help keywords
mysql	help_relation	TABLE	keyword-topic relation
mysql	help_topic	TABLE	help topics
mysql	innodb_index_stats	TABLE
mysql	innodb_table_stats	TABLE
mysql	ndb_binlog_index	TABLE
mysql	patient	TABLE
mysql	patientdata	TABLE
mysql	plugin	TABLE	MySQL plugins
mysql	proc	TABLE	Stored Procedures
mysql	procs_priv	TABLE	Procedure privileges
mysql	proxies_priv	TABLE	User proxy privileges
mysql	server_cost	TABLE
mysql	servers	TABLE	MySQL Foreign Servers table
mysql	slave_master_info	TABLE	Master Information
mysql	slave_relay_log_info	TABLE	Relay Log Information
mysql	slave_worker_info	TABLE	Worker Information
mysql	slow_log	TABLE	Slow log
mysql	tables_priv	TABLE	Table privileges
mysql	time_zone	TABLE	Time zones
mysql	time_zone_leap_second	TABLE	Leap seconds information for time zones
mysql	time_zone_name	TABLE	Time zone names
mysql	time_zone_transition	TABLE	Time zone transitions
mysql	time_zone_transition_type	TABLE	Time zone transition types
mysql	usarrests	TABLE
mysql	user	TABLE	Users and global privileges

odbcGetInfo(con)  #Request information on an ODBC connection.

##              DBMS_Name               DBMS_Ver        Driver_ODBC_Ver 
##                "MySQL"               "5.7.26"                "03.51" 
##       Data_Source_Name            Driver_Name             Driver_Ver 
##               "mydata"         "myodbc5a.dll"           "05.02.0007" 
##               ODBC_Ver            Server_Name 
##           "03.80.0000" "127.0.0.1 via TCP/IP"

Write or update a table in an ODBC database by using the function sqlSave. For example, write the data in R into MySQL.

 data(USArrests)
 head(USArrests)

	Murder	Assault	UrbanPop	Rape
Alabama	13.2	236	58	21.2
Alaska	10.0	263	48	44.5
Arizona	8.1	294	80	31.0
Arkansas	8.8	190	50	19.5
California	9.0	276	91	40.6
Colorado	7.9	204	78	38.7

 sqlDrop(con,"USArrests") # Delete the USArrests data from MySQL.
 sqlSave(con,USArrests, rownames = "state", addPK = TRUE)
 rm(USArrests)

Fetch a table (or view) from the underlying database by using the function sqlFetch(). For example:

 result = sqlFetch( con,'USArrests' )
 head(result)

state	murder	assault	urbanpop	rape
Alabama	13.2	236	58	21.2
Alaska	10.0	263	48	44.5
Arizona	8.1	294	80	31.0
Arkansas	8.8	190	50	19.5
California	9.0	276	91	40.6
Colorado	7.9	204	78	38.7

 close( con )

2.4.3 Using R package RMySQL

R package RMySQL is powerful to read, write and delete data.

help(package="RMySQL")

Or get the detailed usage from RMySQL index
Use the function dbConnect to establish a connection. For example

 #install.packages("RMySQL")
 library(RMySQL)

## Loading required package: DBI

 #con <- dbConnect(RMySQL::MySQL(), dbname = "mydata", username="root", password="lx5609",
 #        host="127.0.0.1", port=3306)
 con <- dbConnect(MySQL(), dbname = "mysql", username="root", password="lx5609")
 dbGetInfo(con)

## $host
## [1] "localhost"
## 
## $user
## [1] "root"
## 
## $dbname
## [1] "mysql"
## 
## $conType
## [1] "localhost via TCP/IP"
## 
## $serverVersion
## [1] "5.7.26"
## 
## $protocolVersion
## [1] 10
## 
## $threadId
## [1] 194
## 
## $rsId
## list()

 dbListTables(con)

##  [1] "bmw_stock"                 "columns_priv"             
##  [3] "db"                        "engine_cost"              
##  [5] "event"                     "func"                     
##  [7] "general_log"               "gtid_executed"            
##  [9] "help_category"             "help_keyword"             
## [11] "help_relation"             "help_topic"               
## [13] "innodb_index_stats"        "innodb_table_stats"       
## [15] "ndb_binlog_index"          "patient"                  
## [17] "patientdata"               "plugin"                   
## [19] "proc"                      "procs_priv"               
## [21] "proxies_priv"              "server_cost"              
## [23] "servers"                   "slave_master_info"        
## [25] "slave_relay_log_info"      "slave_worker_info"        
## [27] "slow_log"                  "tables_priv"              
## [29] "time_zone"                 "time_zone_leap_second"    
## [31] "time_zone_name"            "time_zone_transition"     
## [33] "time_zone_transition_type" "usarrests"                
## [35] "user"

Use the function dbWriteTable to write or update a table. For example, write the BMW stock data into MySQL.
- Download and save data

 #install.packages('quantmod')
 library(quantmod)
 bmw_stock = new.env()
 getSymbols('BMW.DE', env = bmw_stock, src = 'yahoo',
            from = as.Date("2016-01-01"), to = as.Date("2018-12-31"))
 BMW = bmw_stock$BMW.DE
 write.csv(BMW,'BMW_stock.csv',row.names = index(BMW))

Weite the BMW_stock.csv data into MySQL.

  BMW_STOCK <- read.csv("BMW_stock.csv", head = T)
  head(BMW_STOCK)

Date	BMW.DE.Open	BMW.DE.High	BMW.DE.Low	BMW.DE.Close	BMW.DE.Volume	BMW.DE.Adjusted
2016/1/4	94.71	94.71	91.76	92.25	2376296	77.20447
2016/1/5	93.42	93.50	90.63	91.82	1533191	76.84460
2016/1/6	91.00	91.00	87.14	88.78	2301978	74.30040
2016/1/7	85.30	86.00	83.52	85.44	2812427	71.50513
2016/1/8	86.16	87.14	83.41	83.44	1969156	69.83134
2016/1/11	83.11	85.59	82.88	83.14	1832479	69.58026

  dbRemoveTable(con, "bmw_stock")#Delete the data from MySQL.

## [1] TRUE

  dbWriteTable(con,"BMW_STOCK",BMW_STOCK)

## [1] TRUE

  dbListTables(con)

##  [1] "bmw_stock"                 "columns_priv"             
##  [3] "db"                        "engine_cost"              
##  [5] "event"                     "func"                     
##  [7] "general_log"               "gtid_executed"            
##  [9] "help_category"             "help_keyword"             
## [11] "help_relation"             "help_topic"               
## [13] "innodb_index_stats"        "innodb_table_stats"       
## [15] "ndb_binlog_index"          "patient"                  
## [17] "patientdata"               "plugin"                   
## [19] "proc"                      "procs_priv"               
## [21] "proxies_priv"              "server_cost"              
## [23] "servers"                   "slave_master_info"        
## [25] "slave_relay_log_info"      "slave_worker_info"        
## [27] "slow_log"                  "tables_priv"              
## [29] "time_zone"                 "time_zone_leap_second"    
## [31] "time_zone_name"            "time_zone_transition"     
## [33] "time_zone_transition_type" "usarrests"                
## [35] "user"

If there is something wrong, please see this website.
Weite the BMW_stock.xlsx data into MySQL.

  BMW_STOCK <- read.xlsx("BMW_stock.xlsx", sheetIndex = 1,  head = T,colIndex = 1:7)
  head(BMW_STOCK)

Date	BMW.DE.Open	BMW.DE.High	BMW.DE.Low	BMW.DE.Close	BMW.DE.Volume	BMW.DE.Adjusted
2016-01-04	94.71	94.71	91.76	92.25	2376296	77.20447
2016-01-05	93.42	93.50	90.63	91.82	1533191	76.84460
2016-01-06	91.00	91.00	87.14	88.78	2301978	74.30040
2016-01-07	85.30	86.00	83.52	85.44	2812427	71.50513
2016-01-08	86.16	87.14	83.41	83.44	1969156	69.83134
2016-01-11	83.11	85.59	82.88	83.14	1832479	69.58026

  dbRemoveTable(con, "bmw_stock")#Delete the data from MySQL.

## [1] TRUE

  dbWriteTable(con,"BMW_STOCK",BMW_STOCK)

## [1] TRUE

  dbListTables(con)

##  [1] "bmw_stock"                 "columns_priv"             
##  [3] "db"                        "engine_cost"              
##  [5] "event"                     "func"                     
##  [7] "general_log"               "gtid_executed"            
##  [9] "help_category"             "help_keyword"             
## [11] "help_relation"             "help_topic"               
## [13] "innodb_index_stats"        "innodb_table_stats"       
## [15] "ndb_binlog_index"          "patient"                  
## [17] "patientdata"               "plugin"                   
## [19] "proc"                      "procs_priv"               
## [21] "proxies_priv"              "server_cost"              
## [23] "servers"                   "slave_master_info"        
## [25] "slave_relay_log_info"      "slave_worker_info"        
## [27] "slow_log"                  "tables_priv"              
## [29] "time_zone"                 "time_zone_leap_second"    
## [31] "time_zone_name"            "time_zone_transition"     
## [33] "time_zone_transition_type" "usarrests"                
## [35] "user"

Fetch a table (or view) from the underlying database by using the dbReadTable function. For example:

 result = dbReadTable(con,"BMW_STOCK")
 head(result)

Date	BMW.DE.Open	BMW.DE.High	BMW.DE.Low	BMW.DE.Close	BMW.DE.Volume	BMW.DE.Adjusted
2016-01-04	94.71	94.71	91.76	92.25	2376296	77.20447
2016-01-05	93.42	93.50	90.63	91.82	1533191	76.84460
2016-01-06	91.00	91.00	87.14	88.78	2301978	74.30040
2016-01-07	85.30	86.00	83.52	85.44	2812427	71.50513
2016-01-08	86.16	87.14	83.41	83.44	1969156	69.83134
2016-01-11	83.11	85.59	82.88	83.14	1832479	69.58026

Remember to disconnect con
```
dbDisconnect(con)
```
```
## [1] TRUE
```

2.5 Accessing builtin datasets

Around 100 datasets are supplied with R (in package datasets), and others are available in packages (including the recommended packages supplied with R). To see the list of datasets currently available use

data()

All the datasets supplied with R are available directly by name. For example

data(infert)
head(infert)

education	age	parity	induced	case	spontaneous	stratum	pooled.stratum
0-5yrs	26	6	1	1	2	1	3
0-5yrs	42	1	1	1	0	2	1
0-5yrs	39	6	2	1	0	3	4
0-5yrs	34	4	2	1	0	4	2
6-11yrs	35	3	1	1	1	5	32
6-11yrs	36	4	2	1	1	6	36

str(infert)

## 'data.frame':    248 obs. of  8 variables:
##  $ education     : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 1 2 2 2 2 2 2 ...
##  $ age           : num  26 42 39 34 35 36 23 32 21 28 ...
##  $ parity        : num  6 1 6 4 3 4 1 2 1 2 ...
##  $ induced       : num  1 1 2 2 1 2 0 0 0 0 ...
##  $ case          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ spontaneous   : num  2 0 0 0 1 1 0 0 1 0 ...
##  $ stratum       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pooled.stratum: num  3 1 4 2 32 36 6 22 5 19 ...

2.6 Loading data from other R packages

Access data from a particular package by using function data(). For example

#install.packages('rpart')
#data(package="rpart")#To see the list of datasets in R package rpart.
data(car90, package="rpart")
car90

data(car90, package="rpart")
dim(car90)

## [1] 111  34

car90[1:15,1:8]

	Country	Disp	Disp2	Eng.Rev	Front.Hd	Frt.Leg.Room	Frt.Shld	Gear.Ratio
Acura Integra	Japan	112	1.8	2935	3.5	41.5	53.0	3.26
Acura Legend	Japan	163	2.7	2505	2.0	41.5	55.5	2.95
Audi 100	Germany	141	2.3	2775	2.5	41.5	56.5	3.27
Audi 80	Germany	121	2.0	2835	4.0	42.0	52.5	3.25
BMW 325i	Germany	152	2.5	2625	2.0	42.0	52.0	3.02
BMW 535i	Germany	209	3.5	2285	3.0	42.0	54.5	2.80
Buick Century	USA	151	2.5	NA	4.0	42.0	56.5	NA
Buick Electra	USA	231	3.8	NA	6.0	42.0	58.5	NA
Buick Le Sabre	USA	231	3.8	NA	5.0	41.0	59.0	NA
Buick Regal	NA	189	3.1	NA	5.5	41.0	58.0	NA
Buick Riviera	USA	231	3.8	NA	3.5	41.5	58.0	NA
Cadillac Brougham	USA	307	5.0	NA	5.0	43.0	60.5	NA
Cadillac De Ville	USA	273	4.5	NA	6.0	42.0	59.5	NA
Cadillac Eldorado	USA	273	4.5	NA	3.0	42.0	58.0	NA
Cadillac Seville	NA	NA	NA	NA	3.0	42.0	57.5	NA

There is a collection Rdatasets listing over 1300 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages.

3 Useful functions for working with data objects

Table Useful functions for working with data objects

Function	Purpose
length(object)	Number of elements/components.
dim(object)	Dimensions of an object.
str(object)	Structure of an object.
class(object)	Class or type of an object.
mode(object)	How an object is stored.
names(object)	Names of components in an object.
c(object, object,…)	Combines objects into a vector.
cbind(object, object, …)	Combines objects as columns.
rbind(object, object, …)	Combines objects as rows.
object	Prints the object.
head(object)	Lists the first part of the object.
tail(object)	Lists the last part of the object.
hist(x)	histogram of the frequencies of x.
barplot(x)	histogram of the values of x.
boxplot(x)	“box-and-whiskers”" plot.
mean(x)	Mean of x.
median(x)	Median of x.
sd(x)	Standard deviation of x.
var(x)	Variance or covariance of x.
quantile(x, probs = $\tau$ )	The $\tau$ th quantile of x.
range(x)	Range of x.
list()	Construct, coerce and check for both kinds of R lists.
sum(x)	Sum of x.
diff(x, lag = n)	Lagged Differences.
min(x)	Maximum of x.
max(x)	Minimum of x.
scale(x, center = TURE, scale = TRUE)	Scaling and centering of matrix-like objects x.
ls()	List objects.
rm(object, object, …)	Deletes one or more objects. The statement `rm(list = ls())` will remove most objects from the working environment.