To create a vector of numbers, we use the function c() (for concatenate)
x <- c(1,2,3,4,5)
x
## [1] 1 2 3 4 5
x = c(1,6,2)
x
## [1] 1 6 2
y = c(1,4,3)
We can tell R to add two sets of numbers together. It will then add the first number from x to the first number from y, and so on. However, x and y should be the same length.
length(x)
## [1] 3
length(y)
## [1] 3
x+y
## [1] 2 10 5
The ls() function allows us to look at a list of all of the objects, such ls() as data and functions, that we have saved so far.
ls()
## [1] "x" "y"
The ls() function allows us to look at a list of all of the objects, such ls() as data and functions, that we have saved so far.
rm(x,y)
ls()
## character(0)
It’s also possible to remove all objects at once
rm(list=ls())
The matrix() function can be used to create a matrix of numbers. Before we use the matrix() function, we can learn more about it:
?matrix
matrix() function takes a number of inputs, but for now we focus on the first three: the data (the entries in the matrix), the number of rows, and the number of columns. First, we create a simple matrix.
x <- matrix(data = c(1,2,3,4), nrow = 2, ncol = 2)
x
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
we could just as well omit typing data=, nrow=, and ncol= in the matrix() command above: that is, we could just type
x <- matrix(c(1,2,3,4),2,2)
x
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Alternatively, the byrow=TRUE option can be used to populate the matrix in order of the rows.
matrix(c(1,2,3,4), nrow = 2, ncol = 2, byrow = TRUE)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
The sqrt() function returns the square root of each sqrt() element of a vector or matrix
sqrt(x)
## [,1] [,2]
## [1,] 1.000000 1.732051
## [2,] 1.414214 2.000000
Any powers are possible, including fractional or negative powers.
x^2
## [,1] [,2]
## [1,] 1 9
## [2,] 4 16
The rnorm() function generates a vector of random normal variables, with first argument n the sample size. Each time we call this function, we will get a different answer. By default, rnorm() creates standard normal random variables with a mean of 0 and a standard deviation of 1. However, the mean and standard deviation can be altered using the mean and sd arguments
x <- rnorm(50)
y = x + rnorm(50, mean = 50, sd = .1)
cor(x,y)
## [1] 0.9922666
Sometimes we want our code to reproduce the exact same set of random numbers; we can use the set.seed() function to do this.
set.seed(1303)
rnorm(50)
## [1] -1.1439763145 1.3421293656 2.1853904757 0.5363925179 0.0631929665
## [6] 0.5022344825 -0.0004167247 0.5658198405 -0.5725226890 -1.1102250073
## [11] -0.0486871234 -0.6956562176 0.8289174803 0.2066528551 -0.2356745091
## [16] -0.5563104914 -0.3647543571 0.8623550343 -0.6307715354 0.3136021252
## [21] -0.9314953177 0.8238676185 0.5233707021 0.7069214120 0.4202043256
## [26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
## [31] 1.5732737361 0.0127465055 0.8726470499 0.4220661905 -0.0188157917
## [36] 2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412 1.3677342065
## [41] 0.2640073322 0.6321868074 -1.3306509858 0.0268888182 1.0406363208
## [46] 1.3120237985 -0.0300020767 -0.2500257125 0.0234144857 1.6598706557
Compute the mean and variance of a vector of numbers. Applying sqrt() to the output of var() will give the standard deviation. Or we can simply use the sd() function.
set.seed(3)
y <- rnorm(100)
mean(y)
## [1] 0.01103557
var(y)
## [1] 0.7328675
Standard deviation
sqrt(var(y))
## [1] 0.8560768
sd(y)
## [1] 0.8560768
The plot() function is the primary way to plot data in R. For instance,plot(x,y) produces a scatterplot of the numbers in x versus the numbers in y. There are many additional options that can be passed in to the plot() function. For example, passing in the argument xlab will result in a label on the x-axis.
x <- rnorm(100)
y <- rnorm(100)
plot(x,y)
plot(x,y,xlab="this is the x-axis",ylab="this is the y-axis",
main="Plot of X vs Y")
We will often want to save the output of an R plot. The command that we use to do this will depend on the file type that we would like to create. For instance, to create a pdf, we use the pdf() function, and to create a jpeg, we use the jpeg() function.
pdf("Figure.pdf")
plot(x,y, col = "green")
dev.off() #Indicates to R that we are done creating the plot
## quartz_off_screen
## 2
The function seq() can be used to create a sequence of numbers. For instance, seq(a,b) makes a vector of integers between a and b. There are many other options: for instance, seq(0,1,length=10) makes a sequence of 10 numbers that are equally spaced between 0 and 1. Typing 3:11 is a shorthand for seq(3,11) for integer arguments.
x <- seq(1,10)
x
## [1] 1 2 3 4 5 6 7 8 9 10
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
x <- seq(-pi,pi,length = 50)
x
## [1] -3.14159265 -3.01336438 -2.88513611 -2.75690784 -2.62867957 -2.50045130
## [7] -2.37222302 -2.24399475 -2.11576648 -1.98753821 -1.85930994 -1.73108167
## [13] -1.60285339 -1.47462512 -1.34639685 -1.21816858 -1.08994031 -0.96171204
## [19] -0.83348377 -0.70525549 -0.57702722 -0.44879895 -0.32057068 -0.19234241
## [25] -0.06411414 0.06411414 0.19234241 0.32057068 0.44879895 0.57702722
## [31] 0.70525549 0.83348377 0.96171204 1.08994031 1.21816858 1.34639685
## [37] 1.47462512 1.60285339 1.73108167 1.85930994 1.98753821 2.11576648
## [43] 2.24399475 2.37222302 2.50045130 2.62867957 2.75690784 2.88513611
## [49] 3.01336438 3.14159265
The contour() function produces a contour plot in order to represent three-dimensional data; it is like a topographical map. It takes three arguments: * A vector of the x values (the first dimension), * A vector of the y values (the second dimension), and * A matrix whose elements correspond to the z value (the third dimension) for each pair of (x,y) coordinates.
y <- x
f <- outer(x,y, function(x,y) cos(y)/(1+x^2))
contour(x,y,f)
y <- x
f <- outer(x,y, function(x,y) cos(y)/(1+x^2))
contour(x,y,f)
contour(x,y,f,nlevels = 45,add = T)
fa <- (f-t(f))/2
contour(x,y,fa,nlevels=15)
Similar, except that it produces a color-coded plot whose colors depend on the z value. Known as heatmap.
image(x,y,fa)
theta and phi control the angles at which the plot is viewed
persp(x,y,fa)
persp(x,y,fa,theta =30)
persp(x,y,fa,theta =30,phi =20)
persp(x,y,fa,theta =30,phi =70)
persp(x,y,fa,theta =30,phi =40)
We often wish to examine part of a set of data. Suppose that our data is stored in the matrix A.
A <- matrix(1:16,4,4)
A
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
Element within the matrix
A[2,3]
## [1] 10
We can also select multiple rows and columns at a time, by providing vectors as the indices.
A[c(1,3),c(2,4)]
## [,1] [,2]
## [1,] 5 13
## [2,] 7 15
A[1:3,2:4]
## [,1] [,2] [,3]
## [1,] 5 9 13
## [2,] 6 10 14
## [3,] 7 11 15
A[1:2,]
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
A[,1:2]
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
A[1,]
## [1] 1 5 9 13
The use of a negative sign - in the index tells R to keep all rows or columns except those indicated in the index
A[-c(1,3),]
## [,1] [,2] [,3] [,4]
## [1,] 2 6 10 14
## [2,] 4 8 12 16
A[-c(1,3),-c(1,3,4)]
## [1] 6 8
dim(A)
## [1] 4 4
fix opens an editor
#Auto <- read.table("Auto.data")
Auto <- read.csv("~/IntroToStatisticalLearningR-/data/Auto.csv",header=T,na.strings ="?")
fix(Auto)
dim(Auto)
## [1] 397 9
Auto[1:4,]
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
Auto <- na.omit(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
plot(Auto$cylinders, Auto$mpg)
To avoid using the “$” to refer a columns within a dataframe we can use:
attach(Auto)
plot(cylinders, mpg)
If a variable is being treated as quantitative, and wants to be treated as qualitative we can use as.factor()
cylinders <- as.factor(cylinders)
if the variable is categorical, plot() will create boxplots instead
plot(cylinders, mpg)
plot(cylinders, mpg, col = "red")
plot(cylinders, mpg, col = "red", varwidth = T)
plot(cylinders, mpg, col = "red", varwidth = T, horizontal = T)
plot(cylinders, mpg, col = "red", varwidth = T, xlab = "cylinders ", ylab="MPG")
hist(mpg)
hist(mpg, col = 2)
hist(mpg, col = 2, breaks = 15)
Scatterplot for every scatterplot pair of variables for any given data set.
pairs(Auto)
pairs(~ mpg + displacement + horsepower + weight + acceleration, Auto)
We pass in three arguments to identify(): the x-axis variable, the y-axis variable, and the variable whose values we would like to see printed for each point. Then clicking on a given point in the plot will cause R to print the value of the variable of interest. Right-clicking on the plot will exit the identify() function
plot(horsepower ,mpg)
identify(horsepower, mpg, name)
## integer(0)
Produces a numerical summary of each variable in summary() a particular data set.
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
For qualitative variables such as name, R will list the number of observations that fall in each category. We can also produce a summary of just a single variable.
summary(mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 17.00 22.75 23.45 29.00 46.60