c() - to concatenate and create vectors

To create a vector of numbers, we use the function c() (for concatenate)

x <- c(1,2,3,4,5)
x
## [1] 1 2 3 4 5
x = c(1,6,2)
x
## [1] 1 6 2
y = c(1,4,3)
length() - to check the length of the elements

We can tell R to add two sets of numbers together. It will then add the first number from x to the first number from y, and so on. However, x and y should be the same length.

length(x)
## [1] 3
length(y)
## [1] 3
x+y
## [1]  2 10  5
ls() - function used for listing

The ls() function allows us to look at a list of all of the objects, such ls() as data and functions, that we have saved so far.

ls()
## [1] "x" "y"
rm() - function used for deleting

The ls() function allows us to look at a list of all of the objects, such ls() as data and functions, that we have saved so far.

rm(x,y)
ls()
## character(0)
ls() and rm() mixed together

It’s also possible to remove all objects at once

rm(list=ls())
matrix()

The matrix() function can be used to create a matrix of numbers. Before we use the matrix() function, we can learn more about it:

?matrix

matrix() function takes a number of inputs, but for now we focus on the first three: the data (the entries in the matrix), the number of rows, and the number of columns. First, we create a simple matrix.

x <- matrix(data = c(1,2,3,4), nrow = 2, ncol = 2)
x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

we could just as well omit typing data=, nrow=, and ncol= in the matrix() command above: that is, we could just type

x <- matrix(c(1,2,3,4),2,2)
x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Alternatively, the byrow=TRUE option can be used to populate the matrix in order of the rows.

matrix(c(1,2,3,4), nrow = 2, ncol = 2, byrow = TRUE)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
sqrt() and ^ - square root and power functions

The sqrt() function returns the square root of each sqrt() element of a vector or matrix

sqrt(x)
##          [,1]     [,2]
## [1,] 1.000000 1.732051
## [2,] 1.414214 2.000000

Any powers are possible, including fractional or negative powers.

x^2
##      [,1] [,2]
## [1,]    1    9
## [2,]    4   16
rnorm() and cor() - vector of random normal variables and correlation

The rnorm() function generates a vector of random normal variables, with first argument n the sample size. Each time we call this function, we will get a different answer. By default, rnorm() creates standard normal random variables with a mean of 0 and a standard deviation of 1. However, the mean and standard deviation can be altered using the mean and sd arguments

x <- rnorm(50)
y = x + rnorm(50, mean = 50, sd = .1)
cor(x,y)
## [1] 0.9922666
set.seed() - reproduce random numbers

Sometimes we want our code to reproduce the exact same set of random numbers; we can use the set.seed() function to do this.

set.seed(1303)
rnorm(50)
##  [1] -1.1439763145  1.3421293656  2.1853904757  0.5363925179  0.0631929665
##  [6]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
## [11] -0.0486871234 -0.6956562176  0.8289174803  0.2066528551 -0.2356745091
## [16] -0.5563104914 -0.3647543571  0.8623550343 -0.6307715354  0.3136021252
## [21] -0.9314953177  0.8238676185  0.5233707021  0.7069214120  0.4202043256
## [26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
## [31]  1.5732737361  0.0127465055  0.8726470499  0.4220661905 -0.0188157917
## [36]  2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412  1.3677342065
## [41]  0.2640073322  0.6321868074 -1.3306509858  0.0268888182  1.0406363208
## [46]  1.3120237985 -0.0300020767 -0.2500257125  0.0234144857  1.6598706557
mean() and var() - vector mean and variance

Compute the mean and variance of a vector of numbers. Applying sqrt() to the output of var() will give the standard deviation. Or we can simply use the sd() function.

set.seed(3)
y <- rnorm(100)
mean(y)
## [1] 0.01103557
var(y)
## [1] 0.7328675

Standard deviation

sqrt(var(y))
## [1] 0.8560768
sd(y)
## [1] 0.8560768

Graphics

Plot

The plot() function is the primary way to plot data in R. For instance,plot(x,y) produces a scatterplot of the numbers in x versus the numbers in y. There are many additional options that can be passed in to the plot() function. For example, passing in the argument xlab will result in a label on the x-axis.

x <- rnorm(100)
y <- rnorm(100)
plot(x,y)

 plot(x,y,xlab="this is the x-axis",ylab="this is the y-axis",
main="Plot of X vs Y")

pdf() and jpeg() - Save a plot in an specific format

We will often want to save the output of an R plot. The command that we use to do this will depend on the file type that we would like to create. For instance, to create a pdf, we use the pdf() function, and to create a jpeg, we use the jpeg() function.

pdf("Figure.pdf")
plot(x,y, col = "green")
dev.off() #Indicates to R that we are done creating the plot
## quartz_off_screen 
##                 2
seq() - number sequence

The function seq() can be used to create a sequence of numbers. For instance, seq(a,b) makes a vector of integers between a and b. There are many other options: for instance, seq(0,1,length=10) makes a sequence of 10 numbers that are equally spaced between 0 and 1. Typing 3:11 is a shorthand for seq(3,11) for integer arguments.

x <- seq(1,10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x <- 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x <- seq(-pi,pi,length = 50)
x
##  [1] -3.14159265 -3.01336438 -2.88513611 -2.75690784 -2.62867957 -2.50045130
##  [7] -2.37222302 -2.24399475 -2.11576648 -1.98753821 -1.85930994 -1.73108167
## [13] -1.60285339 -1.47462512 -1.34639685 -1.21816858 -1.08994031 -0.96171204
## [19] -0.83348377 -0.70525549 -0.57702722 -0.44879895 -0.32057068 -0.19234241
## [25] -0.06411414  0.06411414  0.19234241  0.32057068  0.44879895  0.57702722
## [31]  0.70525549  0.83348377  0.96171204  1.08994031  1.21816858  1.34639685
## [37]  1.47462512  1.60285339  1.73108167  1.85930994  1.98753821  2.11576648
## [43]  2.24399475  2.37222302  2.50045130  2.62867957  2.75690784  2.88513611
## [49]  3.01336438  3.14159265
contour() - Sophisticated plots

The contour() function produces a contour plot in order to represent three-dimensional data; it is like a topographical map. It takes three arguments: * A vector of the x values (the first dimension), * A vector of the y values (the second dimension), and * A matrix whose elements correspond to the z value (the third dimension) for each pair of (x,y) coordinates.

y <- x
f <- outer(x,y, function(x,y) cos(y)/(1+x^2))
contour(x,y,f)

y <- x
f <- outer(x,y, function(x,y) cos(y)/(1+x^2))
contour(x,y,f)
contour(x,y,f,nlevels = 45,add = T)

fa <- (f-t(f))/2
contour(x,y,fa,nlevels=15)

image() - similar to contour()

Similar, except that it produces a color-coded plot whose colors depend on the z value. Known as heatmap.

image(x,y,fa)

persp() - 3D plots

theta and phi control the angles at which the plot is viewed

persp(x,y,fa)

persp(x,y,fa,theta =30)

persp(x,y,fa,theta =30,phi =20)

persp(x,y,fa,theta =30,phi =70)

persp(x,y,fa,theta =30,phi =40)

Indexing data

We often wish to examine part of a set of data. Suppose that our data is stored in the matrix A.

A <- matrix(1:16,4,4)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16

Element within the matrix

A[2,3]
## [1] 10

We can also select multiple rows and columns at a time, by providing vectors as the indices.

 A[c(1,3),c(2,4)]
##      [,1] [,2]
## [1,]    5   13
## [2,]    7   15
A[1:3,2:4]
##      [,1] [,2] [,3]
## [1,]    5    9   13
## [2,]    6   10   14
## [3,]    7   11   15
A[1:2,]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
A[,1:2]
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8
A[1,]
## [1]  1  5  9 13

The use of a negative sign - in the index tells R to keep all rows or columns except those indicated in the index

A[-c(1,3),]
##      [,1] [,2] [,3] [,4]
## [1,]    2    6   10   14
## [2,]    4    8   12   16
A[-c(1,3),-c(1,3,4)]
## [1] 6 8
dim() - Matrxi dimension
dim(A)
## [1] 4 4

Loading data

read.table() and write.table()

fix opens an editor

#Auto <- read.table("Auto.data")
Auto <- read.csv("~/IntroToStatisticalLearningR-/data/Auto.csv",header=T,na.strings ="?")
fix(Auto)
dim() - dataframe dimension
dim(Auto)
## [1] 397   9
Auto[1:4,]
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
na.omit() - Remove na from dataframes
Auto <- na.omit(Auto)
names() - Column names
names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

Additional Graphical and Numerical Summaries

plot() - used for scatterplots
plot(Auto$cylinders, Auto$mpg)

To avoid using the “$” to refer a columns within a dataframe we can use:

attach(Auto)
plot(cylinders, mpg)

as.factor() - Turn quantitative variables in qualitative

If a variable is being treated as quantitative, and wants to be treated as qualitative we can use as.factor()

cylinders <- as.factor(cylinders)

if the variable is categorical, plot() will create boxplots instead

plot(cylinders, mpg)

plot(cylinders, mpg, col = "red")

plot(cylinders, mpg, col = "red", varwidth = T)

plot(cylinders, mpg, col = "red", varwidth = T, horizontal = T)

plot(cylinders, mpg, col = "red", varwidth = T, xlab = "cylinders ", ylab="MPG")

hist() - Histograms
hist(mpg)

hist(mpg, col = 2)

hist(mpg, col = 2, breaks = 15)

pairs() - scatterplot matrix

Scatterplot for every scatterplot pair of variables for any given data set.

pairs(Auto)

pairs(~ mpg + displacement + horsepower + weight + acceleration, Auto)

identify() - Identifying the value for a particular variable for points on a plot.

We pass in three arguments to identify(): the x-axis variable, the y-axis variable, and the variable whose values we would like to see printed for each point. Then clicking on a given point in the plot will cause R to print the value of the variable of interest. Right-clicking on the plot will exit the identify() function

plot(horsepower ,mpg)
identify(horsepower, mpg, name)

## integer(0)
summary() - numerical summary

Produces a numerical summary of each variable in summary() a particular data set.

summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

For qualitative variables such as name, R will list the number of observations that fall in each category. We can also produce a summary of just a single variable.

summary(mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   17.00   22.75   23.45   29.00   46.60