RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.
RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:
RStudio > Preferences (Mac)Tools > Options (Windows)There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.
Projects are an RStudio feature to help keep your code and working environments contained and organized, which comes in handy when you start to have multiple projects. To start a new project, you can click on the dropdown in the upper-righthand corner of RStudio and choose to begin a new project.
Even if you’re not using the RStudio projects feature, it’s still a good idea to keep work for any given project in a single directory (folder). You can make a new folder in Finder or File Explorer. Once you have that, you can set your working directory in R like this:
setwd("PATH/TO/PROJECT")
You can also see your current working directory by using this:
getwd()
You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.
new_int <- 4
new_int
## [1] 4
Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).
cos(new_int)
## [1] -0.6536436
cos(4)
## [1] -0.6536436
Functions are ways of running the same piece of code on something that changes. It can save us a lot of typing - one useful way of thinking says that if you have to copy and paste the same code three times, you should write a function instead. Let’s try writing a simple function to show how this can work.
new_fun <- function(x) {
my_int <- x
your_int <- my_int * 2
cat("My integer is", my_int, "and your integer is", your_int)
}
Now it’s ready to be run!
new_fun(4)
## My integer is 4 and your integer is 8
new_fun(8)
## My integer is 8 and your integer is 16
new_fun(87732)
## My integer is 87732 and your integer is 175464
You may have noticed that we have a few new things in our “Environment” pane in RStudio. These variables and functions comprise our working environment, data that R has held in our active memory. This environment isn’t necessarily persistent, so it doesn’t last between R sessions. Often, you will be working on a project over multiple days or on multiple computers, so it’s useful to save that working environment as it exists. You can save (and load) your environment like this:
save.image("environment.RData")
load("environment.RData")
There are some functions and datsets built into R already. Let’s explore some a bit using a built-in dataset, mtcars.
data(mtcars)
mtcars
We can find out some things about the basic structure of our data.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
We can use specific parts of the data, too, such as the mpg variable. Then we can find out more about that with some built-in functions.
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
length(mtcars$mpg)
## [1] 32
mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2
prod(mtcars$mpg)
## [1] 1.264241e+41
sum(mtcars$mpg)
## [1] 642.9
sqrt(mtcars$mpg)
## [1] 4.582576 4.582576 4.774935 4.626013 4.324350 4.254409 3.781534
## [8] 4.939636 4.774935 4.381780 4.219005 4.049691 4.159327 3.898718
## [15] 3.224903 3.224903 3.834058 5.692100 5.513620 5.822371 4.636809
## [22] 3.937004 3.898718 3.646917 4.381780 5.224940 5.099020 5.513620
## [29] 3.974921 4.438468 3.872983 4.626013
var(mtcars$mpg)
## [1] 36.3241
You can use ?function_name or help(function_name) to view a help page and ??function_name to search all help pages
?var
??var
help(var)
People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:
install.packages("tidyverse")
If you want to install multiple packages at once, you can do that too.
install.packages(c("igraph", "sna"))
Vectors are one of the basic data structures in R. In short, they are groups of values of a single data type. We can make them with the c() function.
coat <- c("calico", "tortoiseshell", "tabby")
weight <- c(2.1, 5.0, 3.2)
likes_string <- c(TRUE, FALSE, TRUE)
When we say they are of a single data type, we are referring to the five “atomic” data types in R. Let’s see those:
typeof(coat)
## [1] "character"
typeof(weight)
## [1] "double"
typeof(likes_string)
## [1] "logical"
typeof(1 + 1i)
## [1] "complex"
typeof(1L)
## [1] "integer"
Vectors must be made of one of these five data types. Let’s see what happens when we try to mix them up.
test <- c(0, 2, 4)
typeof(test)
## [1] "double"
test <- c("0", "2", "4")
typeof(test)
## [1] "character"
test <- c(0, 2, "4")
typeof(test)
## [1] "character"
When we tried to mix numeric and character data types, the entire test vector became a character vector. This is called type coercion. Type coercion follows this pattern: Logical -> Integer -> Double (numeric) -> Complex -> Character
We can force vectors to go in the opposite direction, but this sometimes doesn’t work. Other times, it produces unexpected behaviors.
as.numeric(likes_string)
## [1] 1 0 1
as.numeric(test)
## [1] 0 2 4
as.logical(test)
## [1] NA NA NA
as.logical(as.numeric(test))
## [1] FALSE TRUE TRUE
Notice that test had to be made into a numeric vector before it could be made into a logical vector. Also notice that it was converted to FALSE, TRUE, TRUE. That’s because any number other than 0 defaults to TRUE when it is forced into a logical format.
We can add to existing vectors with c()
test <- c(test, 8)
test
## [1] "0" "2" "4" "8"
We can create series of numbers easily using a :
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
10:1
## [1] 10 9 8 7 6 5 4 3 2 1
We can also create sequences of numbers using functions like rep() and seq().
rep(8, 80)
## [1] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [36] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [71] 8 8 8 8 8 8 8 8 8 8
seq(1, 10, by = 0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
## [15] 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [29] 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1
## [43] 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5
## [57] 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9
## [71] 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3
## [85] 9.4 9.5 9.6 9.7 9.8 9.9 10.0
Vectors are interesting (and powerful) because we can perform vectorized operations on the entire structure at once.
seq_example <- seq(1, 10, by = 0.1)
seq_example * 2
## [1] 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6
## [15] 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4
## [29] 7.6 7.8 8.0 8.2 8.4 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 10.2
## [43] 10.4 10.6 10.8 11.0 11.2 11.4 11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0
## [57] 13.2 13.4 13.6 13.8 14.0 14.2 14.4 14.6 14.8 15.0 15.2 15.4 15.6 15.8
## [71] 16.0 16.2 16.4 16.6 16.8 17.0 17.2 17.4 17.6 17.8 18.0 18.2 18.4 18.6
## [85] 18.8 19.0 19.2 19.4 19.6 19.8 20.0
seq_example - 1
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6
## [18] 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3
## [35] 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0
## [52] 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7
## [69] 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4
## [86] 8.5 8.6 8.7 8.8 8.9 9.0
Matrixes are vectors with two or more dimensions. Like vectors, they need to be all of a single data type, and we can perform operations on the entire structure.
m <- matrix(1:100, nrow = 10, ncol = 10)
m
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 11 21 31 41 51 61 71 81 91
## [2,] 2 12 22 32 42 52 62 72 82 92
## [3,] 3 13 23 33 43 53 63 73 83 93
## [4,] 4 14 24 34 44 54 64 74 84 94
## [5,] 5 15 25 35 45 55 65 75 85 95
## [6,] 6 16 26 36 46 56 66 76 86 96
## [7,] 7 17 27 37 47 57 67 77 87 97
## [8,] 8 18 28 38 48 58 68 78 88 98
## [9,] 9 19 29 39 49 59 69 79 89 99
## [10,] 10 20 30 40 50 60 70 80 90 100
m * 2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2 22 42 62 82 102 122 142 162 182
## [2,] 4 24 44 64 84 104 124 144 164 184
## [3,] 6 26 46 66 86 106 126 146 166 186
## [4,] 8 28 48 68 88 108 128 148 168 188
## [5,] 10 30 50 70 90 110 130 150 170 190
## [6,] 12 32 52 72 92 112 132 152 172 192
## [7,] 14 34 54 74 94 114 134 154 174 194
## [8,] 16 36 56 76 96 116 136 156 176 196
## [9,] 18 38 58 78 98 118 138 158 178 198
## [10,] 20 40 60 80 100 120 140 160 180 200
Matrixes usually fill by column, but we can force them to fill by row
m2 <- matrix(1:100, nrow = 10, ncol = 10, byrow = TRUE)
m2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 11 12 13 14 15 16 17 18 19 20
## [3,] 21 22 23 24 25 26 27 28 29 30
## [4,] 31 32 33 34 35 36 37 38 39 40
## [5,] 41 42 43 44 45 46 47 48 49 50
## [6,] 51 52 53 54 55 56 57 58 59 60
## [7,] 61 62 63 64 65 66 67 68 69 70
## [8,] 71 72 73 74 75 76 77 78 79 80
## [9,] 81 82 83 84 85 86 87 88 89 90
## [10,] 91 92 93 94 95 96 97 98 99 100
Data frames are probably the most common data structure used by R programmers. They are a rectangular data format, and under the hood, they are typically lists of equal-length vectors. Let’s make one with some of the vectors we made earlier.
cats <- data.frame(coat, weight, likes_string)
cats
Data frames are very easy to write out to local files, like a csv, and very easy to read in from a csv.
write.csv(cats, "./data/cats.csv")
cats <- read.csv("./data/cats.csv")
We can take a look at some of the individual variables using $ as a selector.
cats$weight
## [1] 2.1 5.0 3.2
cats$coat
## [1] calico tortoiseshell tabby
## Levels: calico tabby tortoiseshell
Let’s also take a look at the overall structure of the data
dim(cats)
## [1] 3 4
Right now, coat is a factor, another data structure we won’t be talking about today. Factors are useful for categorical variables, but they can be tricky to use. It’s often easier to convert them to simple character vectors when we read in data.
cats <- read.csv("./data/cats.csv", stringsAsFactors = FALSE)
We can still perform vectorized operations with the vectors within a data frame. R will recognize and warn us when this won’t work, however (such as when we try to add a number to a character string)
cats$weight + 2
## [1] 4.1 7.0 5.2
paste("My cat is", cats$coat)
## [1] "My cat is calico" "My cat is tortoiseshell"
## [3] "My cat is tabby"
cats$weight + cats$coat
## Error in cats$weight + cats$coat: non-numeric argument to binary operator
We’ve already dealt with lists, because data frames are a special kind of list. Regular lists are very flexible, and can contain all kinds of data and data structures. Lists can also be hierarchical (lists of lists), allowing for more complex data structures to exist in R.
l <- list(vec = 1:100,
mat = matrix(rnorm(100), ncol = 10, nrow = 10),
string = "hello there",
df = mtcars)
l
## $vec
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
##
## $mat
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.575588924 0.74519188 -0.4848754 0.1391231 0.59832946
## [2,] -0.923604354 2.00808102 -0.5031293 -0.2036388 0.59768599
## [3,] 1.511388177 -0.57797740 -0.1669421 -0.4956906 -0.06778529
## [4,] 0.392417153 0.20389733 0.3094613 -2.3713088 -0.98997156
## [5,] -0.397300972 -1.11493599 0.6721045 -1.2796279 -0.65519626
## [6,] -0.058858268 0.32262185 -0.7223474 1.4647199 0.74440140
## [7,] 2.014444817 -0.74660385 1.0927833 -0.3606918 0.78803904
## [8,] -0.430927027 -0.82017654 1.4842737 1.2202376 -0.60802090
## [9,] 0.636391131 -0.03128202 -1.4893464 -0.7897873 0.86556613
## [10,] -0.009490206 0.64082276 0.1812507 0.4429181 -0.78985218
## [,6] [,7] [,8] [,9] [,10]
## [1,] -2.303274813 0.3322283 -0.4777402 0.4685432 -0.16472839
## [2,] 0.008986901 0.5059366 0.2922390 1.0071837 -0.36041851
## [3,] 0.938740503 1.0031328 1.9672819 -0.3881994 1.12834773
## [4,] -1.342101768 -2.0495210 0.6181051 -0.3480759 -0.06612591
## [5,] 0.502619988 -1.2447343 -0.7863848 0.7319942 -2.37559243
## [6,] -0.487657540 0.8306734 0.5107479 0.4565645 -1.43070524
## [7,] -0.890396014 -0.9886024 1.1963613 0.1637245 0.65048655
## [8,] 1.935472400 0.4704066 -1.6089219 1.3782464 -0.85741649
## [9,] -2.322661747 0.1474557 -0.1928753 1.9395275 -1.00434968
## [10,] 0.831285911 0.9282722 -0.9248248 -0.4861869 1.01932721
##
## $string
## [1] "hello there"
##
## $df
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
We can access the elements of a list using the $ like we did with data frames, but we can also use square brackets [] to do so. Notice the difference between single and double brackets.
l$vec
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
l["vec"]
## $vec
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
l[["vec"]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
l["vec"] * 2
## Error in l["vec"] * 2: non-numeric argument to binary operator
l[["vec"]] * 2
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
## [18] 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68
## [35] 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102
## [52] 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136
## [69] 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170
## [86] 172 174 176 178 180 182 184 186 188 190 192 194 196 198 200
We can even use the $ selector to drill down into the hierarhcy.
l$df
l$df$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
Just to prove a point, let’s take a look at the type of our cats data frame.
typeof(cats)
## [1] "list"
There are different ways to subset data structures based on the type of structure. We’ll look at vectors, matrixes, and data frames.
Let’s use the seq_example we made before.
seq_example
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
## [15] 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [29] 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1
## [43] 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5
## [57] 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9
## [71] 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3
## [85] 9.4 9.5 9.6 9.7 9.8 9.9 10.0
We can use square brackets to get just the first ten elements, like this:
seq_example[1:10]
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
Or, we can get the elements that match conditions we set up, like this:
seq_example[seq_example < 4]
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6
## [18] 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
seq_example[seq_example <= 4]
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6
## [18] 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
seq_example[seq_example != 3]
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
## [15] 2.4 2.5 2.6 2.7 2.8 2.9 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8
## [29] 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2
## [43] 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6
## [57] 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0
## [71] 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4
## [85] 9.5 9.6 9.7 9.8 9.9 10.0
We can add multiple conditions, using & for “and”, and | for “or”
seq_example[seq_example < 4 & seq_example > 2]
## [1] 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [18] 3.8 3.9
seq_example[seq_example < 4 | seq_example > 8]
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
## [15] 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [29] 3.8 3.9 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2
## [43] 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0
Matrixes are just multi-dimensional vectors, so we can use much of the same notation to subset them. We can identify elements by element number, column, and/or row.
m
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 11 21 31 41 51 61 71 81 91
## [2,] 2 12 22 32 42 52 62 72 82 92
## [3,] 3 13 23 33 43 53 63 73 83 93
## [4,] 4 14 24 34 44 54 64 74 84 94
## [5,] 5 15 25 35 45 55 65 75 85 95
## [6,] 6 16 26 36 46 56 66 76 86 96
## [7,] 7 17 27 37 47 57 67 77 87 97
## [8,] 8 18 28 38 48 58 68 78 88 98
## [9,] 9 19 29 39 49 59 69 79 89 99
## [10,] 10 20 30 40 50 60 70 80 90 100
To get the 87th element:
m[87]
## [1] 87
Specifying columns and rows:
m[1,1]
## [1] 1
m[1,]
## [1] 1 11 21 31 41 51 61 71 81 91
m[,1]
## [1] 1 2 3 4 5 6 7 8 9 10
m[1:3, 5:7]
## [,1] [,2] [,3]
## [1,] 41 51 61
## [2,] 42 52 62
## [3,] 43 53 63
We can also use conditions like we do with vectors:
m[m >= 45]
## [1] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
## [18] 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
## [35] 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
## [52] 96 97 98 99 100
Square brackets work for data frames too
mtcars
mtcars[1]
mtcars[1:2]
mtcars[1,]
mtcars[,1]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
If I were interested in finding cars with mpg > 20, we can do so several ways. Here’s one:
mtcars$mpg > 20
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [23] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
which(mtcars$mpg > 20)
## [1] 1 2 3 4 8 9 18 19 20 21 26 27 28 32
which() gives us the indexes of the vector that match our condition. We can use that with square bracket notation to extract a subset of our data
mtcars_efficient <- mtcars[which(mtcars$mpg > 20),]
mtcars_efficient
Or, we could use a function like ifelse() to add a new column to our existing data frame using $.
mtcars$efficient <- ifelse(mtcars$mpg > 20, TRUE, FALSE)
mtcars
We can recode some of our data using square brackets and the assignment operator. Let’s use our matrix from before to experiment.
m[1:3, 1:2] <- 8000
m
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 8000 8000 21 31 41 51 61 71 81 91
## [2,] 8000 8000 22 32 42 52 62 72 82 92
## [3,] 8000 8000 23 33 43 53 63 73 83 93
## [4,] 4 14 24 34 44 54 64 74 84 94
## [5,] 5 15 25 35 45 55 65 75 85 95
## [6,] 6 16 26 36 46 56 66 76 86 96
## [7,] 7 17 27 37 47 57 67 77 87 97
## [8,] 8 18 28 38 48 58 68 78 88 98
## [9,] 9 19 29 39 49 59 69 79 89 99
## [10,] 10 20 30 40 50 60 70 80 90 100
m[m > 90] <- NA
m
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] NA NA 21 31 41 51 61 71 81 NA
## [2,] NA NA 22 32 42 52 62 72 82 NA
## [3,] NA NA 23 33 43 53 63 73 83 NA
## [4,] 4 14 24 34 44 54 64 74 84 NA
## [5,] 5 15 25 35 45 55 65 75 85 NA
## [6,] 6 16 26 36 46 56 66 76 86 NA
## [7,] 7 17 27 37 47 57 67 77 87 NA
## [8,] 8 18 28 38 48 58 68 78 88 NA
## [9,] 9 19 29 39 49 59 69 79 89 NA
## [10,] 10 20 30 40 50 60 70 80 90 NA
If we want to recode a specific value, we can do that too
m[m == 31] <- 85467
m
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] NA NA 21 85467 41 51 61 71 81 NA
## [2,] NA NA 22 32 42 52 62 72 82 NA
## [3,] NA NA 23 33 43 53 63 73 83 NA
## [4,] 4 14 24 34 44 54 64 74 84 NA
## [5,] 5 15 25 35 45 55 65 75 85 NA
## [6,] 6 16 26 36 46 56 66 76 86 NA
## [7,] 7 17 27 37 47 57 67 77 87 NA
## [8,] 8 18 28 38 48 58 68 78 88 NA
## [9,] 9 19 29 39 49 59 69 79 89 NA
## [10,] 10 20 30 40 50 60 70 80 90 NA
== doesn’t work for NA values, though. Instead, there’s a special function called is.na()
m[m == NA] <- 0
m
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] NA NA 21 85467 41 51 61 71 81 NA
## [2,] NA NA 22 32 42 52 62 72 82 NA
## [3,] NA NA 23 33 43 53 63 73 83 NA
## [4,] 4 14 24 34 44 54 64 74 84 NA
## [5,] 5 15 25 35 45 55 65 75 85 NA
## [6,] 6 16 26 36 46 56 66 76 86 NA
## [7,] 7 17 27 37 47 57 67 77 87 NA
## [8,] 8 18 28 38 48 58 68 78 88 NA
## [9,] 9 19 29 39 49 59 69 79 89 NA
## [10,] 10 20 30 40 50 60 70 80 90 NA
m[is.na(m)] <- 0
m
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 21 85467 41 51 61 71 81 0
## [2,] 0 0 22 32 42 52 62 72 82 0
## [3,] 0 0 23 33 43 53 63 73 83 0
## [4,] 4 14 24 34 44 54 64 74 84 0
## [5,] 5 15 25 35 45 55 65 75 85 0
## [6,] 6 16 26 36 46 56 66 76 86 0
## [7,] 7 17 27 37 47 57 67 77 87 0
## [8,] 8 18 28 38 48 58 68 78 88 0
## [9,] 9 19 29 39 49 59 69 79 89 0
## [10,] 10 20 30 40 50 60 70 80 90 0
R has a plotting system built right in that is useful for some basic plots, such as a scatter plot
plot(main = "MTCARS", x = mtcars$mpg, y = mtcars$hp,
col = ifelse(mtcars$efficient, "blue", "red"))
legend("topright", title = "Efficient", legend = c(TRUE, FALSE), col = c("blue", "red"), pch = 1)
There are packages that have more extensive plotting capabilities, such as ggplot2, which has become a standard plotting package in the past few years.
library(ggplot2)
ggplot(mtcars, aes(x = mpg, y = hp)) +
geom_point(aes(color = efficient)) +
labs(title = "MTCARS")
There are many other packages that are used for more specialized graphics, such as network graphs.
For now, let’s clean up our working environment. We can do that with rm()
rm(cats)
If we want to clean the environment entirely, we can do so like this:
rm(list = ls())
Let’s practice some with network data
library(tidyr)
library(sna)
## Loading required package: statnet.common
##
## Attaching package: 'statnet.common'
## The following object is masked from 'package:base':
##
## order
## Loading required package: network
## network: Classes for Relational Data
## Version 1.13.0 created on 2015-08-31.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## Mark S. Handcock, University of California -- Los Angeles
## David R. Hunter, Penn State University
## Martina Morris, University of Washington
## Skye Bender-deMoll, University of Washington
## For citation information, type citation("network").
## Type help("network-package") to get started.
## sna: Tools for Social Network Analysis
## Version 2.4 created on 2016-07-23.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## For citation information, type citation("sna").
## Type help(package="sna") to get started.
net <- read.csv("./data/friendship.csv")
If we’d like to turn this data frame into an adjacency matrix (not necessarily a matrix like we discussed before), we can do so with a function called spread() from the tidyr package.
net_matrix <- spread(data = net, key = V2, value = V3)
net_matrix
We’ll have to remove the first column, though, which we can do like this:
net_matrix <- net_matrix[,-1]
net_matrix
Now we’ll use the gplot() function from the sna package to plot the network this matrix describes.
gplot(net_matrix, displaylabels = TRUE)
If we want to gather this back into a three-column data frame, we can do so with the gather() function from tidyr. First we’ll make a copy as a new variable.
net_tidy <- net_matrix
net_tidy
Then we’ll add our row names back as a new column.
net_tidy$V1 <- rownames(net_tidy)
net_tidy
Now we’ll gather the wide data into a long format.
net_tidy <- gather(net_tidy, key = V2, value = V3, 1:21)
net_tidy
Much of this is inspired by and borrowed from lessons by Software Carpentry.↩