R Labs from ISL Book(Trevor&Hastie)

Chapter 2 - Statistical Learning

Lets get the basics going. For example:
c() - creates a vector. In a way it concatenates all the inputs in one variable.

We can also save things using = rather than <-
?funcname will always cause R to open a new help file window with additional information about the function funcname.

x <- c(1,3,2,5)
x

## [1] 1 3 2 5

x = c(1,6,2)
y = c(1,4,3)

# How to open a help page for a function
#?c()

length() - checks the length of a vector

length (x)

## [1] 3

length (y)

## [1] 3

x+y

## [1]  2 10  5

ls() - allows us to look at a list of all of the objects, such as data and functions, that we have saved so far.
rm() - can be used to delete any that we don’t want.

ls()

## [1] "x" "y"

rm(x,y)
ls()

## character(0)

# Removing all at once
#rm(list=ls())

Creating matrices: matrix() - creates a matrix of numbers. By default R creates matrices by successively filling in columns. byrow=TRUE option can be used to populate the matrix in order of the rows.

x=matrix(data=c(1,2,3,4) , nrow=2, ncol =2)
x

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

# Same output as above without specific option value but then it must be in the same order as in the help file
x=matrix (c(1,2,3,4) ,2,2)

# Above resulted in default by column ordering of values
matrix (c(1,2,3,4) ,2,2,byrow =TRUE)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

Some more examples of functions:
* rnorm() function generates a vector of random normal variables, with first argument n the sample size.

Sometimes we want our code to reproduce the exact same set of random numbers; we can use the set.seed() function to do this. The set.seed() function takes an (arbitrary) integer argument.
sqrt(), cor(), mean(), var() and sd() are some simple other simple functions to calculate square root, pearson correlation cefficient, mean/ averagem variance and standard deviation respectively.

sqrt(x)

##          [,1]     [,2]
## [1,] 1.000000 1.732051
## [2,] 1.414214 2.000000

x^2

##      [,1] [,2]
## [1,]    1    9
## [2,]    4   16

x=rnorm (50)
y=x+rnorm (50, mean=50, sd=.1)
cor(x,y)

## [1] 0.991476

set.seed (1303)
rnorm (50)

##  [1] -1.1439763145  1.3421293656  2.1853904757  0.5363925179  0.0631929665
##  [6]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
## [11] -0.0486871234 -0.6956562176  0.8289174803  0.2066528551 -0.2356745091
## [16] -0.5563104914 -0.3647543571  0.8623550343 -0.6307715354  0.3136021252
## [21] -0.9314953177  0.8238676185  0.5233707021  0.7069214120  0.4202043256
## [26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
## [31]  1.5732737361  0.0127465055  0.8726470499  0.4220661905 -0.0188157917
## [36]  2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412  1.3677342065
## [41]  0.2640073322  0.6321868074 -1.3306509858  0.0268888182  1.0406363208
## [46]  1.3120237985 -0.0300020767 -0.2500257125  0.0234144857  1.6598706557

set.seed (3)
y=rnorm (100)
mean(y)

## [1] 0.01103557

var(y)

## [1] 0.7328675

sqrt(var(y))

## [1] 0.8560768

sd(y)

## [1] 0.8560768

seq() - create a sequence of numbers.
seq(a,b) - makes a vector of integers between a and b.
seq(0,1,length=10) makes a sequence of 10 numbers that are equally spaced between 0 and 1.

# Typing 3:11 is a shorthand for seq(3,11) for integer arguments.

x=seq (1 ,10)
x

##  [1]  1  2  3  4  5  6  7  8  9 10

x=1:10
x

##  [1]  1  2  3  4  5  6  7  8  9 10

x=seq(-pi ,pi ,length =50)
x

##  [1] -3.14159265 -3.01336438 -2.88513611 -2.75690784 -2.62867957
##  [6] -2.50045130 -2.37222302 -2.24399475 -2.11576648 -1.98753821
## [11] -1.85930994 -1.73108167 -1.60285339 -1.47462512 -1.34639685
## [16] -1.21816858 -1.08994031 -0.96171204 -0.83348377 -0.70525549
## [21] -0.57702722 -0.44879895 -0.32057068 -0.19234241 -0.06411414
## [26]  0.06411414  0.19234241  0.32057068  0.44879895  0.57702722
## [31]  0.70525549  0.83348377  0.96171204  1.08994031  1.21816858
## [36]  1.34639685  1.47462512  1.60285339  1.73108167  1.85930994
## [41]  1.98753821  2.11576648  2.24399475  2.37222302  2.50045130
## [46]  2.62867957  2.75690784  2.88513611  3.01336438  3.14159265

Graphics

plot() - primary way to plot data in R.E.g. plot(x,y) produces a scatterplot of the numbers in x versus the numbers in y.
pdf() - to create a pdf file (alternates to manual copy paste).
jpeg()- to create a jpeg file.
dev.off() - indicates to R that we are done creating the plot.
contour() - produces a contour plot used to represent a 3-Dimensional data.
image() - works the same way as contour(), except that it produces a color-coded plot whose colors depend on the z value.
persp() - alternate to contour()/image() it results in a 3-D plot as well.The arguments theta and phi control the angles at which the plot is viewed.

x=rnorm (100)
y=rnorm (100)
plot(x,y)

plot(x,y,xlab=" this is the x-axis",ylab=" this is the y-axis",
main = "Plot of X vs Y / Scatterplot ")

pdf (" Figure .pdf ")
plot(x,y,col =" green ")
dev.off ()

## png 
##   2

jpeg ("Figure.jpeg")


x=seq(-pi ,pi ,length =50)
x

##  [1] -3.14159265 -3.01336438 -2.88513611 -2.75690784 -2.62867957
##  [6] -2.50045130 -2.37222302 -2.24399475 -2.11576648 -1.98753821
## [11] -1.85930994 -1.73108167 -1.60285339 -1.47462512 -1.34639685
## [16] -1.21816858 -1.08994031 -0.96171204 -0.83348377 -0.70525549
## [21] -0.57702722 -0.44879895 -0.32057068 -0.19234241 -0.06411414
## [26]  0.06411414  0.19234241  0.32057068  0.44879895  0.57702722
## [31]  0.70525549  0.83348377  0.96171204  1.08994031  1.21816858
## [36]  1.34639685  1.47462512  1.60285339  1.73108167  1.85930994
## [41]  1.98753821  2.11576648  2.24399475  2.37222302  2.50045130
## [46]  2.62867957  2.75690784  2.88513611  3.01336438  3.14159265

y=x
f=outer(x,y,function (x,y)cos(y)/(1+x^2))
contour (x,y,f)
contour (x,y,f,nlevels =45, add=T)
fa=(f-t(f))/2
contour (x,y,fa,nlevels =15)

image(x,y,fa)
persp(x,y,fa)
persp(x,y,fa ,theta =30)
persp(x,y,fa ,theta =30, phi =20)
persp(x,y,fa ,theta =30, phi =70)
persp(x,y,fa ,theta =30, phi =40)
dev.off ()

## png 
##   2

Indexing Data

We often wish to examine part of a set of data. Suppose that our data is stored in the matrix A. What are different ways of accessing/selecting the element corresponding to:

the second row and the third column.
multiple rows and columns at a time, by providing vectors as the indices.
all columns of a particular row or all rows of a set of columns.(R treats a single row or column of a matrix as a vector)
all columns/rows except the one specified(The use of a negative sign - in the index does this).

dim() - outputs the number of rows followed by the number of columns of a given matrix.

A=matrix (1:16 ,4 ,4)
A

##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16

A[2,3]

## [1] 10

A[c(1,3) ,c(2,4) ]

##      [,1] [,2]
## [1,]    5   13
## [2,]    7   15

A[1:3 ,2:4]

##      [,1] [,2] [,3]
## [1,]    5    9   13
## [2,]    6   10   14
## [3,]    7   11   15

A[1:2 ,]

##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14

A[ ,1:2]

##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8

A[1,]

## [1]  1  5  9 13

A[5]

## [1] 5

A[-c(1,3) ,]

##      [,1] [,2] [,3] [,4]
## [1,]    2    6   10   14
## [2,]    4    8   12   16

A[-c(1,3) ,-c(1,3,4)]

## [1] 6 8

dim(A)

## [1] 4 4

rm(Auto)

## Warning in rm(Auto): object 'Auto' not found

Loading Data

read.table() - imports a data set(reads a text file as a data frame) into R.
write.table() - is used to export data.
fix() - used to view it in a spreadsheet like window. However, the window must be closed before further R commands can be entered.
read.csv() - used to load the csv format file.
names() - to check the variable names.

There are various ways to deal with the missing data. In this case, only five of the rows contain missing observations, and so we choose to use the na.omit() function to simply remove these rows.

#Auto=read.table ("Auto.data ")
#fix(Auto)

Auto=read.table("D:/Boston College/MS AE Courses/Spring 2018 - Big Data Econometrics/DataSets/Auto.data", header =T,na.strings ="?")
#fix(Auto)

Auto=read.csv("D:/Boston College/MS AE Courses/Spring 2018 - Big Data Econometrics/DataSets/Auto.csv", header =T,na.strings ="?")

#fix(Auto)
dim(Auto)

## [1] 397   9

Auto [1:4 ,]

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst

Auto=na.omit(Auto)
dim(Auto)

## [1] 392   9

names(Auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"

str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:5] 33 127 331 337 355
##   .. ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Data Descriptives

plot() - produces scatterplots of the quantitative scatterplot variables and box plot if the variable on x-axis is categorial.
Simply typing the variable names will produce an error message, because R does not know to look in the Auto data set for those variables. To refer to a variable, we must type the data set and the variable name joined with a $ symbol.
Alternatively, we can use the attach() function in order to tell R to make the variables in this data frame available by name.
as.factor() - converts quantitative variables into qualitative variables.
hist() - can be used to plot a histogram.
pairs() - creates a scatterplot matrix i.e. a scatterplot for every scatterplot pair of variables for any given data set. We can also produce scatterplots matrix for just a subset of the variables.
identify() - In conjunction with the plot() function, indentify() provides a useful interactive method for identifying the value for a particular variable for points on a plot.
The numbers printed under the identify() function correspond to the rows for the selected points.
summary() - produces a numerical summary of each variable in a particular data set. For qualitative variables such as name, R will list the number of observations that fall in each category.

Before exiting R, we may want to save a record of all of the commands that we typed in the most recent session; this can be accomplished using the savehistory() function. Next time we enter R, we can load that history using the loadhistory() function.

#plot(cylinders , mpg)

plot(Auto$cylinders , Auto$mpg )

attach (Auto)
plot(cylinders , mpg)

cylinders =as.factor (cylinders )

plot(cylinders , mpg)

plot(cylinders , mpg , col ="red ")

plot(cylinders , mpg , col ="red", varwidth =T)

plot(cylinders , mpg , col ="red", varwidth =T,horizontal =T)

plot(cylinders , mpg , col ="red", varwidth =T, xlab=" cylinders ",ylab ="MPG ")

hist(mpg)

hist(mpg ,col =2)

hist(mpg ,col =2, breaks =15)

pairs(Auto)

pairs(~mpg + displacement + horsepower + weight + acceleration , Auto)

plot(horsepower ,mpg)
identify(horsepower ,mpg ,name)

## integer(0)

boxplot(Auto$mpg~Auto$cylinders)

summary(Auto)

##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##                                                                 
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##                  name    
##  amc matador       :  5  
##  ford pinto        :  5  
##  toyota corolla    :  5  
##  amc gremlin       :  4  
##  amc hornet        :  4  
##  chevrolet chevette:  4  
##  (Other)           :365

summary(mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   17.00   22.75   23.45   29.00   46.60

R Labs from ISL Book(Trevor&Hastie)

Priyanka Gagneja

January 20, 2018

Chapter 2 - Statistical Learning