Using RStudio

RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.

RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:

  • RStudio > Preferences (Mac)
  • Tools > Options (Windows)

There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.

Projects and Working Directories

Projects are an RStudio feature to help keep your code and working environments contained and organized, which comes in handy when you start to have multiple projects. To start a new project, you can click on the dropdown in the upper-righthand corner of RStudio and choose to begin a new project.

Even if you’re not using the RStudio projects feature, it’s still a good idea to keep work for any given project in a single directory (folder). You can make a new folder in Finder or File Explorer. Once you have that, you can set your working directory in R like this:

setwd("PATH/TO/PROJECT")

You can also see your current working directory by using this:

getwd()

R Basics

Creating Variables

You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.

new_int <- 4 
new_int
## [1] 4

Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).

cos(new_int) 
## [1] -0.6536436
cos(4)
## [1] -0.6536436

Functions

Functions are ways of running the same piece of code on something that changes. It can save us a lot of typing - one useful way of thinking says that if you have to copy and paste the same code three times, you should write a function instead. Let’s try writing a simple function to show how this can work.

new_fun <- function(x) { 
  my_int <- x 
  your_int <- my_int * 2 
  cat("My integer is", my_int, "and your integer is", your_int)
}

Now it’s ready to be run!

new_fun(4)
## My integer is 4 and your integer is 8
new_fun(8)
## My integer is 8 and your integer is 16
new_fun(87732)
## My integer is 87732 and your integer is 175464

Working Environments

You may have noticed that we have a few new things in our “Environment” pane in RStudio. These variables and functions comprise our working environment, data that R has held in our active memory. This environment isn’t necessarily persistent, so it doesn’t last between R sessions. Often, you will be working on a project over multiple days or on multiple computers, so it’s useful to save that working environment as it exists. You can save (and load) your environment like this:

save.image("environment.RData")

load("environment.RData")

Exploring data

There are some functions and datsets built into R already. Let’s explore some a bit using a built-in dataset, mtcars.

data(mtcars)
mtcars

We can find out some things about the basic structure of our data.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
dim(mtcars)
## [1] 32 11
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

We can use specific parts of the data, too, such as the mpg variable. Then we can find out more about that with some built-in functions.

mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
length(mtcars$mpg)
## [1] 32
mean(mtcars$mpg)
## [1] 20.09062
median(mtcars$mpg)
## [1] 19.2
prod(mtcars$mpg)
## [1] 1.264241e+41
sum(mtcars$mpg)
## [1] 642.9
sqrt(mtcars$mpg)
##  [1] 4.582576 4.582576 4.774935 4.626013 4.324350 4.254409 3.781534
##  [8] 4.939636 4.774935 4.381780 4.219005 4.049691 4.159327 3.898718
## [15] 3.224903 3.224903 3.834058 5.692100 5.513620 5.822371 4.636809
## [22] 3.937004 3.898718 3.646917 4.381780 5.224940 5.099020 5.513620
## [29] 3.974921 4.438468 3.872983 4.626013
var(mtcars$mpg)
## [1] 36.3241

Finding Help

You can use ?function_name or help(function_name) to view a help page and ??function_name to search all help pages

?var
??var
help(var)

Packages

People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:

install.packages("tidyverse")

If you want to install multiple packages at once, you can do that too.

install.packages(c("igraph", "sna"))

Data Structures and Types1

Vectors

Vectors are one of the basic data structures in R. In short, they are groups of values of a single data type. We can make them with the c() function.

coat <- c("calico", "tortoiseshell", "tabby")

weight <- c(2.1, 5.0, 3.2)

likes_string <- c(TRUE, FALSE, TRUE)

Data Types

When we say they are of a single data type, we are referring to the five “atomic” data types in R. Let’s see those:

typeof(coat)
## [1] "character"
typeof(weight)
## [1] "double"
typeof(likes_string)
## [1] "logical"
typeof(1 + 1i)
## [1] "complex"
typeof(1L)
## [1] "integer"

Vectors must be made of one of these five data types. Let’s see what happens when we try to mix them up.

test <- c(0, 2, 4)
typeof(test)
## [1] "double"
test <- c("0", "2", "4")
typeof(test)
## [1] "character"
test <- c(0, 2, "4")
typeof(test)
## [1] "character"

When we tried to mix numeric and character data types, the entire test vector became a character vector. This is called type coercion. Type coercion follows this pattern: Logical -> Integer -> Double (numeric) -> Complex -> Character

We can force vectors to go in the opposite direction, but this sometimes doesn’t work. Other times, it produces unexpected behaviors.

as.numeric(likes_string)
## [1] 1 0 1
as.numeric(test)
## [1] 0 2 4
as.logical(test)
## [1] NA NA NA
as.logical(as.numeric(test))
## [1] FALSE  TRUE  TRUE

Notice that test had to be made into a numeric vector before it could be made into a logical vector. Also notice that it was converted to FALSE, TRUE, TRUE. That’s because any number other than 0 defaults to TRUE when it is forced into a logical format.

Other Vector Behaviors

We can add to existing vectors with c()

test <- c(test, 8)
test
## [1] "0" "2" "4" "8"

We can create series of numbers easily using a :

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
10:1
##  [1] 10  9  8  7  6  5  4  3  2  1

We can also create sequences of numbers using functions like rep() and seq().

rep(8, 80)
##  [1] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [36] 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [71] 8 8 8 8 8 8 8 8 8 8
seq(1, 10, by = 0.1)
##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
## [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7
## [29]  3.8  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1
## [43]  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5
## [57]  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9
## [71]  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3
## [85]  9.4  9.5  9.6  9.7  9.8  9.9 10.0

Vectors are interesting (and powerful) because we can perform vectorized operations on the entire structure at once.

seq_example <- seq(1, 10, by = 0.1)
seq_example * 2
##  [1]  2.0  2.2  2.4  2.6  2.8  3.0  3.2  3.4  3.6  3.8  4.0  4.2  4.4  4.6
## [15]  4.8  5.0  5.2  5.4  5.6  5.8  6.0  6.2  6.4  6.6  6.8  7.0  7.2  7.4
## [29]  7.6  7.8  8.0  8.2  8.4  8.6  8.8  9.0  9.2  9.4  9.6  9.8 10.0 10.2
## [43] 10.4 10.6 10.8 11.0 11.2 11.4 11.6 11.8 12.0 12.2 12.4 12.6 12.8 13.0
## [57] 13.2 13.4 13.6 13.8 14.0 14.2 14.4 14.6 14.8 15.0 15.2 15.4 15.6 15.8
## [71] 16.0 16.2 16.4 16.6 16.8 17.0 17.2 17.4 17.6 17.8 18.0 18.2 18.4 18.6
## [85] 18.8 19.0 19.2 19.4 19.6 19.8 20.0
seq_example - 1
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6
## [18] 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3
## [35] 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0
## [52] 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7
## [69] 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4
## [86] 8.5 8.6 8.7 8.8 8.9 9.0

Matrixes

Matrixes are vectors with two or more dimensions. Like vectors, they need to be all of a single data type, and we can perform operations on the entire structure.

m <- matrix(1:100, nrow = 10, ncol = 10)
m
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1   11   21   31   41   51   61   71   81    91
##  [2,]    2   12   22   32   42   52   62   72   82    92
##  [3,]    3   13   23   33   43   53   63   73   83    93
##  [4,]    4   14   24   34   44   54   64   74   84    94
##  [5,]    5   15   25   35   45   55   65   75   85    95
##  [6,]    6   16   26   36   46   56   66   76   86    96
##  [7,]    7   17   27   37   47   57   67   77   87    97
##  [8,]    8   18   28   38   48   58   68   78   88    98
##  [9,]    9   19   29   39   49   59   69   79   89    99
## [10,]   10   20   30   40   50   60   70   80   90   100
m * 2
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    2   22   42   62   82  102  122  142  162   182
##  [2,]    4   24   44   64   84  104  124  144  164   184
##  [3,]    6   26   46   66   86  106  126  146  166   186
##  [4,]    8   28   48   68   88  108  128  148  168   188
##  [5,]   10   30   50   70   90  110  130  150  170   190
##  [6,]   12   32   52   72   92  112  132  152  172   192
##  [7,]   14   34   54   74   94  114  134  154  174   194
##  [8,]   16   36   56   76   96  116  136  156  176   196
##  [9,]   18   38   58   78   98  118  138  158  178   198
## [10,]   20   40   60   80  100  120  140  160  180   200

Matrixes usually fill by column, but we can force them to fill by row

m2 <- matrix(1:100, nrow = 10, ncol = 10, byrow = TRUE)
m2
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    2    3    4    5    6    7    8    9    10
##  [2,]   11   12   13   14   15   16   17   18   19    20
##  [3,]   21   22   23   24   25   26   27   28   29    30
##  [4,]   31   32   33   34   35   36   37   38   39    40
##  [5,]   41   42   43   44   45   46   47   48   49    50
##  [6,]   51   52   53   54   55   56   57   58   59    60
##  [7,]   61   62   63   64   65   66   67   68   69    70
##  [8,]   71   72   73   74   75   76   77   78   79    80
##  [9,]   81   82   83   84   85   86   87   88   89    90
## [10,]   91   92   93   94   95   96   97   98   99   100

Data Frames

Data frames are probably the most common data structure used by R programmers. They are a rectangular data format, and under the hood, they are typically lists of equal-length vectors. Let’s make one with some of the vectors we made earlier.

cats <- data.frame(coat, weight, likes_string)

cats

Data frames are very easy to write out to local files, like a csv, and very easy to read in from a csv.

write.csv(cats, "./data/cats.csv")

cats <- read.csv("./data/cats.csv")

We can take a look at some of the individual variables using $ as a selector.

cats$weight
## [1] 2.1 5.0 3.2
cats$coat
## [1] calico        tortoiseshell tabby        
## Levels: calico tabby tortoiseshell

Let’s also take a look at the overall structure of the data

dim(cats)
## [1] 3 4

Right now, coat is a factor, another data structure we won’t be talking about today. Factors are useful for categorical variables, but they can be tricky to use. It’s often easier to convert them to simple character vectors when we read in data.

cats <- read.csv("./data/cats.csv", stringsAsFactors = FALSE)

We can still perform vectorized operations with the vectors within a data frame. R will recognize and warn us when this won’t work, however (such as when we try to add a number to a character string)

cats$weight + 2
## [1] 4.1 7.0 5.2
paste("My cat is", cats$coat)
## [1] "My cat is calico"        "My cat is tortoiseshell"
## [3] "My cat is tabby"
cats$weight + cats$coat
## Error in cats$weight + cats$coat: non-numeric argument to binary operator

Lists

We’ve already dealt with lists, because data frames are a special kind of list. Regular lists are very flexible, and can contain all kinds of data and data structures. Lists can also be hierarchical (lists of lists), allowing for more complex data structures to exist in R.

l <- list(vec = 1:100, 
          mat = matrix(rnorm(100), ncol = 10, nrow = 10), 
          string = "hello there",
          df = mtcars)
l
## $vec
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
## 
## $mat
##               [,1]        [,2]       [,3]       [,4]        [,5]
##  [1,]  0.575588924  0.74519188 -0.4848754  0.1391231  0.59832946
##  [2,] -0.923604354  2.00808102 -0.5031293 -0.2036388  0.59768599
##  [3,]  1.511388177 -0.57797740 -0.1669421 -0.4956906 -0.06778529
##  [4,]  0.392417153  0.20389733  0.3094613 -2.3713088 -0.98997156
##  [5,] -0.397300972 -1.11493599  0.6721045 -1.2796279 -0.65519626
##  [6,] -0.058858268  0.32262185 -0.7223474  1.4647199  0.74440140
##  [7,]  2.014444817 -0.74660385  1.0927833 -0.3606918  0.78803904
##  [8,] -0.430927027 -0.82017654  1.4842737  1.2202376 -0.60802090
##  [9,]  0.636391131 -0.03128202 -1.4893464 -0.7897873  0.86556613
## [10,] -0.009490206  0.64082276  0.1812507  0.4429181 -0.78985218
##               [,6]       [,7]       [,8]       [,9]       [,10]
##  [1,] -2.303274813  0.3322283 -0.4777402  0.4685432 -0.16472839
##  [2,]  0.008986901  0.5059366  0.2922390  1.0071837 -0.36041851
##  [3,]  0.938740503  1.0031328  1.9672819 -0.3881994  1.12834773
##  [4,] -1.342101768 -2.0495210  0.6181051 -0.3480759 -0.06612591
##  [5,]  0.502619988 -1.2447343 -0.7863848  0.7319942 -2.37559243
##  [6,] -0.487657540  0.8306734  0.5107479  0.4565645 -1.43070524
##  [7,] -0.890396014 -0.9886024  1.1963613  0.1637245  0.65048655
##  [8,]  1.935472400  0.4704066 -1.6089219  1.3782464 -0.85741649
##  [9,] -2.322661747  0.1474557 -0.1928753  1.9395275 -1.00434968
## [10,]  0.831285911  0.9282722 -0.9248248 -0.4861869  1.01932721
## 
## $string
## [1] "hello there"
## 
## $df
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

We can access the elements of a list using the $ like we did with data frames, but we can also use square brackets [] to do so. Notice the difference between single and double brackets.

l$vec
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
l["vec"]
## $vec
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
l[["vec"]]
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
l["vec"] * 2
## Error in l["vec"] * 2: non-numeric argument to binary operator
l[["vec"]] * 2
##   [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34
##  [18]  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68
##  [35]  70  72  74  76  78  80  82  84  86  88  90  92  94  96  98 100 102
##  [52] 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136
##  [69] 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170
##  [86] 172 174 176 178 180 182 184 186 188 190 192 194 196 198 200

We can even use the $ selector to drill down into the hierarhcy.

l$df
l$df$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

Just to prove a point, let’s take a look at the type of our cats data frame.

typeof(cats)
## [1] "list"

Subsetting Data

There are different ways to subset data structures based on the type of structure. We’ll look at vectors, matrixes, and data frames.

Vectors

Let’s use the seq_example we made before.

seq_example
##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
## [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7
## [29]  3.8  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1
## [43]  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5
## [57]  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9
## [71]  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3
## [85]  9.4  9.5  9.6  9.7  9.8  9.9 10.0

We can use square brackets to get just the first ten elements, like this:

seq_example[1:10]
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

Or, we can get the elements that match conditions we set up, like this:

seq_example[seq_example < 4]
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6
## [18] 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
seq_example[seq_example <= 4]
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6
## [18] 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
seq_example[seq_example != 3]
##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
## [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8
## [29]  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2
## [43]  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6
## [57]  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0
## [71]  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4
## [85]  9.5  9.6  9.7  9.8  9.9 10.0

We can add multiple conditions, using & for “and”, and | for “or”

seq_example[seq_example < 4 & seq_example > 2]
##  [1] 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7
## [18] 3.8 3.9
seq_example[seq_example < 4 | seq_example > 8]
##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
## [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7
## [29]  3.8  3.9  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2
## [43]  9.3  9.4  9.5  9.6  9.7  9.8  9.9 10.0

Matrixes

Matrixes are just multi-dimensional vectors, so we can use much of the same notation to subset them. We can identify elements by element number, column, and/or row.

m
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1   11   21   31   41   51   61   71   81    91
##  [2,]    2   12   22   32   42   52   62   72   82    92
##  [3,]    3   13   23   33   43   53   63   73   83    93
##  [4,]    4   14   24   34   44   54   64   74   84    94
##  [5,]    5   15   25   35   45   55   65   75   85    95
##  [6,]    6   16   26   36   46   56   66   76   86    96
##  [7,]    7   17   27   37   47   57   67   77   87    97
##  [8,]    8   18   28   38   48   58   68   78   88    98
##  [9,]    9   19   29   39   49   59   69   79   89    99
## [10,]   10   20   30   40   50   60   70   80   90   100

To get the 87th element:

m[87]
## [1] 87

Specifying columns and rows:

m[1,1]
## [1] 1
m[1,]
##  [1]  1 11 21 31 41 51 61 71 81 91
m[,1]
##  [1]  1  2  3  4  5  6  7  8  9 10
m[1:3, 5:7]
##      [,1] [,2] [,3]
## [1,]   41   51   61
## [2,]   42   52   62
## [3,]   43   53   63

We can also use conditions like we do with vectors:

m[m >= 45]
##  [1]  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61
## [18]  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78
## [35]  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95
## [52]  96  97  98  99 100

Data Frames

Square brackets work for data frames too

mtcars
mtcars[1]
mtcars[1:2]
mtcars[1,]
mtcars[,1]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

If I were interested in finding cars with mpg > 20, we can do so several ways. Here’s one:

mtcars$mpg > 20
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [23] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
which(mtcars$mpg > 20)
##  [1]  1  2  3  4  8  9 18 19 20 21 26 27 28 32

which() gives us the indexes of the vector that match our condition. We can use that with square bracket notation to extract a subset of our data

mtcars_efficient <- mtcars[which(mtcars$mpg > 20),]
mtcars_efficient

Or, we could use a function like ifelse() to add a new column to our existing data frame using $.

mtcars$efficient <- ifelse(mtcars$mpg > 20, TRUE, FALSE)
mtcars

Recoding Data

We can recode some of our data using square brackets and the assignment operator. Let’s use our matrix from before to experiment.

m[1:3, 1:2] <- 8000
m
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,] 8000 8000   21   31   41   51   61   71   81    91
##  [2,] 8000 8000   22   32   42   52   62   72   82    92
##  [3,] 8000 8000   23   33   43   53   63   73   83    93
##  [4,]    4   14   24   34   44   54   64   74   84    94
##  [5,]    5   15   25   35   45   55   65   75   85    95
##  [6,]    6   16   26   36   46   56   66   76   86    96
##  [7,]    7   17   27   37   47   57   67   77   87    97
##  [8,]    8   18   28   38   48   58   68   78   88    98
##  [9,]    9   19   29   39   49   59   69   79   89    99
## [10,]   10   20   30   40   50   60   70   80   90   100
m[m > 90] <- NA
m
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]   NA   NA   21   31   41   51   61   71   81    NA
##  [2,]   NA   NA   22   32   42   52   62   72   82    NA
##  [3,]   NA   NA   23   33   43   53   63   73   83    NA
##  [4,]    4   14   24   34   44   54   64   74   84    NA
##  [5,]    5   15   25   35   45   55   65   75   85    NA
##  [6,]    6   16   26   36   46   56   66   76   86    NA
##  [7,]    7   17   27   37   47   57   67   77   87    NA
##  [8,]    8   18   28   38   48   58   68   78   88    NA
##  [9,]    9   19   29   39   49   59   69   79   89    NA
## [10,]   10   20   30   40   50   60   70   80   90    NA

If we want to recode a specific value, we can do that too

m[m == 31] <- 85467
m
##       [,1] [,2] [,3]  [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]   NA   NA   21 85467   41   51   61   71   81    NA
##  [2,]   NA   NA   22    32   42   52   62   72   82    NA
##  [3,]   NA   NA   23    33   43   53   63   73   83    NA
##  [4,]    4   14   24    34   44   54   64   74   84    NA
##  [5,]    5   15   25    35   45   55   65   75   85    NA
##  [6,]    6   16   26    36   46   56   66   76   86    NA
##  [7,]    7   17   27    37   47   57   67   77   87    NA
##  [8,]    8   18   28    38   48   58   68   78   88    NA
##  [9,]    9   19   29    39   49   59   69   79   89    NA
## [10,]   10   20   30    40   50   60   70   80   90    NA

== doesn’t work for NA values, though. Instead, there’s a special function called is.na()

m[m == NA] <- 0
m
##       [,1] [,2] [,3]  [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]   NA   NA   21 85467   41   51   61   71   81    NA
##  [2,]   NA   NA   22    32   42   52   62   72   82    NA
##  [3,]   NA   NA   23    33   43   53   63   73   83    NA
##  [4,]    4   14   24    34   44   54   64   74   84    NA
##  [5,]    5   15   25    35   45   55   65   75   85    NA
##  [6,]    6   16   26    36   46   56   66   76   86    NA
##  [7,]    7   17   27    37   47   57   67   77   87    NA
##  [8,]    8   18   28    38   48   58   68   78   88    NA
##  [9,]    9   19   29    39   49   59   69   79   89    NA
## [10,]   10   20   30    40   50   60   70   80   90    NA
m[is.na(m)] <- 0
m
##       [,1] [,2] [,3]  [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    0    0   21 85467   41   51   61   71   81     0
##  [2,]    0    0   22    32   42   52   62   72   82     0
##  [3,]    0    0   23    33   43   53   63   73   83     0
##  [4,]    4   14   24    34   44   54   64   74   84     0
##  [5,]    5   15   25    35   45   55   65   75   85     0
##  [6,]    6   16   26    36   46   56   66   76   86     0
##  [7,]    7   17   27    37   47   57   67   77   87     0
##  [8,]    8   18   28    38   48   58   68   78   88     0
##  [9,]    9   19   29    39   49   59   69   79   89     0
## [10,]   10   20   30    40   50   60   70   80   90     0

Plotting

R has a plotting system built right in that is useful for some basic plots, such as a scatter plot

plot(main = "MTCARS", x = mtcars$mpg, y = mtcars$hp, 
     col = ifelse(mtcars$efficient, "blue", "red"))
legend("topright", title = "Efficient", legend = c(TRUE, FALSE), col = c("blue", "red"), pch = 1)

There are packages that have more extensive plotting capabilities, such as ggplot2, which has become a standard plotting package in the past few years.

library(ggplot2)

ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point(aes(color = efficient)) +
  labs(title = "MTCARS")

There are many other packages that are used for more specialized graphics, such as network graphs.

Cleaning up workspace

For now, let’s clean up our working environment. We can do that with rm()

rm(cats)

If we want to clean the environment entirely, we can do so like this:

rm(list = ls())

Using Network Data

Let’s practice some with network data

library(tidyr)
library(sna)
## Loading required package: statnet.common
## 
## Attaching package: 'statnet.common'
## The following object is masked from 'package:base':
## 
##     order
## Loading required package: network
## network: Classes for Relational Data
## Version 1.13.0 created on 2015-08-31.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##                     Mark S. Handcock, University of California -- Los Angeles
##                     David R. Hunter, Penn State University
##                     Martina Morris, University of Washington
##                     Skye Bender-deMoll, University of Washington
##  For citation information, type citation("network").
##  Type help("network-package") to get started.
## sna: Tools for Social Network Analysis
## Version 2.4 created on 2016-07-23.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##  For citation information, type citation("sna").
##  Type help(package="sna") to get started.
net <- read.csv("./data/friendship.csv")

If we’d like to turn this data frame into an adjacency matrix (not necessarily a matrix like we discussed before), we can do so with a function called spread() from the tidyr package.

net_matrix <- spread(data = net, key = V2, value = V3)
net_matrix

We’ll have to remove the first column, though, which we can do like this:

net_matrix <- net_matrix[,-1]
net_matrix

Now we’ll use the gplot() function from the sna package to plot the network this matrix describes.

gplot(net_matrix, displaylabels = TRUE)

If we want to gather this back into a three-column data frame, we can do so with the gather() function from tidyr. First we’ll make a copy as a new variable.

net_tidy <- net_matrix
net_tidy

Then we’ll add our row names back as a new column.

net_tidy$V1 <- rownames(net_tidy)
net_tidy

Now we’ll gather the wide data into a long format.

net_tidy <- gather(net_tidy, key = V2, value = V3, 1:21)
net_tidy

  1. Much of this is inspired by and borrowed from lessons by Software Carpentry.