Introduction to R (BTC)

Why R? R is a free software environment for statistical computing and data visualization. Seems pretty relevant to our class, eh?

Of course, R can just be a calculator for things like 3 - 4 = -1 or 3 ^ 2 = 9. I’ll throw a couple of common algebraic operations you might use in R. Note that there are many more, but these are commonly used!

# This is a comment (words following the hashtag). I use this to annotate my code so that it is easier for me to look back and understand what I did. 
1 + 7

## [1] 8

16 / 2

## [1] 8

2 * 4

## [1] 8

13 - 5

## [1] 8

2 ^ 3 # "2 raised to the 3rd power"

## [1] 8

100 %% 92 # remainder

## [1] 8

65 %/% 8 # number of times 8 goes into 65 cleanly

## [1] 8

log(256, 2) #log, base 2, of 256; default of log function is natural log (base e)

## [1] 8

You can also use R to compare objects, even characters! You will get a logical statement (TRUE, FALSE, or NA) in return

5 > 3

## [1] TRUE

6 == 7 #note that for comparisons, "==" is used instead of "="

## [1] FALSE

"cat" == "dog"

## [1] FALSE

4 <= 4

## [1] TRUE

6 != 2 # "!=" denotes "not equal to"

## [1] TRUE

The beauty of technology, however, lies in its ability to work with a whole bunch of numbers and data. Specifically, R is pretty great for working with large datasets and strings of information (i.e. variables). You can use those algebraic expressions above on not just individual numbers, but on objects like vectors, lists, matrices… more on that below.

One note before we start: R is a function-based programming language. Like we saw in Google Sheets functions like `AVERAGE()`, the inputs that go into the parentheses are manipulated by the function machine to produce the output of interest–in this case, the mean of the set of numeric input values. R functions are the same, though the function names may differ (often the case for different programming languages). That’s why learning a programming language is like learning a new language–the more you use it, the easier and more intuitive it becomes!

Now, let’s familiarize ourselves with basic R “anatomy” before we dive into coding.

Objects: 5 basic types of R objects

[Atomic] Vectors: These are similar to vectors in math. Vectors can store data of the same type (character, integers, logical, etc.), which is similar to how we have different variable types (categorical, quantitative, etc.). Some examples of vectors:

# Assign information into variables (x, y, z) to create vectors using "<-" or "=". The `c()` function is known as concatenation. Think of it as just linking the individual items together in a series. 
w = c(TRUE, FALSE, NA)
x <- c(1, 2, 3, 4) 
y <- c("a", "b", "c")
z = 5

# Do you see them in your global environment (to the right)?

Now say I want to view them and check what type of vector they are:

# Print vectors using `print()` function to see the values of x, y, and z
print(w); print(x) ; print(y); print(z)

## [1]  TRUE FALSE    NA

## [1] 1 2 3 4

## [1] "a" "b" "c"

## [1] 5

# Print the class (type) of each vector using `class()` function
print(class(w)) ; print(class(x)); print(class(y)); print(class(z))

## [1] "logical"

## [1] "numeric"

## [1] "character"

## [1] "numeric"

Note how I can embed functions within other functions. I can do this when the output of the inner function is an acceptale input for the outer function. Can you think of why I might embed functions in functions?

As I alluded to above, I can also perform operations on objects like vectors. Check out the examples below:

#recall x from above. 
x - 1 ; x / 2 ; x ^ 2 ; x * 3

## [1] 0 1 2 3

## [1] 0.5 1.0 1.5 2.0

## [1]  1  4  9 16

## [1]  3  6  9 12

#recall x, y, and z from above
x - z # I can do this because they're both numeric

## [1] -4 -3 -2 -1

Checkpoint: Can I do x - y? Why or why not?

List: these looks almost identical to vectors but can contain different types of data types. It can also include vectors, matrices, and even other lists – list-inception!

# List 1, `ls1`, is created using vectors x, y, and z
ls1 <- list(x, y, z)
ls2 <- list(x, ls1) #create list using ls1 

print(ls1) ; print(ls2) #

## [[1]]
## [1] 1 2 3 4
## 
## [[2]]
## [1] "a" "b" "c"
## 
## [[3]]
## [1] 5

## [[1]]
## [1] 1 2 3 4
## 
## [[2]]
## [[2]][[1]]
## [1] 1 2 3 4
## 
## [[2]][[2]]
## [1] "a" "b" "c"
## 
## [[2]][[3]]
## [1] 5

print(class(ls1)); print(class(ls2)) #class type is list

## [1] "list"

## [1] "list"

Checkpoint: how do lists and vectors differ and how are they similar? Can a vector be part of a list, and can a list be part of a vector?

Matrices: these are also known as a 2D array and are created using matrix() with the following inputs: data, nrow, ncol, byrow, and dimnames.

This is a great time to introduce the help tool in R. I like to think of this as R’s dictionary. If your are unsure of a certain function’s mechanism, you can write ?[some_function] in your code and some helpful reading should pop-up to your bottom right. You can also write ??[some_idea] to see if you don’t know the specific function name. See below:

?matrix

Back to matrices. As you have read in the help section, the default matrix is full of missing observations (NA) if you leave the data argument blank. I can define the number of rows (nrow) and columns (ncol), which are 1 by default. The byrow argument is FALSE by default, which tells me that my matrix will be filled with data column-wise (TRUE for row-wise). Finally, dimnames allows me to give my rows and columns names (respectively) and should be a list.

matrix() #default empty matrix

##      [,1]
## [1,]   NA

#recall x from above
matrix(x) ; matrix(x, nrow = 2) ; matrix(x , nrow = 2, byrow = T)

##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

mat <- matrix(x, nrow = 2, dimnames = list(c("row1", "row2"), c("column1", "column2")))
print(mat)

##      column1 column2
## row1       1       3
## row2       2       4

class(mat)

## [1] "matrix" "array"

Checkpoint Use the code chunk below to learn about the dim() function, and apply it to our matrix, mat.

Array: Like a matrix, but is not limited to 2D (can be n-dimensional). We won’t delve too much into these, but you can use ? and the code chunk below to gain some intuition for arrays.

?array
arr <- array(x, dim = c(4, 4)) ; print(arr)

##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    2    2    2    2
## [3,]    3    3    3    3
## [4,]    4    4    4    4

Before we jump into our last data object for the day, we should touch on the concept of “subsetting” for R objects. First, let’s introduce the dim() function, which tells us the dimensions of our input objects.

# recall mat from above
dim(mat) # output returns # of rows and # of columns (respectively)

## [1] 2 2

dim(arr) # another example using array

## [1] 4 4

dim(x) # note that it does not work on 1D objects; only matrices, arrays, and dataframes

## NULL

Subsetting allows us to pull specific values or subsets of values from our object of interest. We do this using brackets [ ].

For 1D objects, the notation [n] pulls out the nth item in the object:

x[2] # 2nd item in the x vector

## [1] 2

x[5] # Why does this return NA?

## [1] NA

2D objects, the notation goes [row, column]. I can pull out specific values, or values of entire rows/columns by leaving either the row or column specification blank.

print(mat); mat[1, 2] # value from the 1st row, 2nd column

##      column1 column2
## row1       1       3
## row2       2       4

## [1] 3

print(arr); arr[4, ] # all the values from the 4th row

##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    2    2    2    2
## [3,]    3    3    3    3
## [4,]    4    4    4    4

## [1] 4 4 4 4

Checkpoint: subset array to get all the values from the 4th column.

Data Frames: These are 2D tablular R objects (like a dataset). They can contain multiple columns and rows, with each column representing a vector (like variables). Unlike matrices, each column in a data frame can have different types of data.

a <- 1:5 # another way to create a numeric vector that sequentially goes 1, 2, 3, 4, 5
b <- c(0.67, 0.23, 1.25, 3, .20) 
c <- c("apples", "bananas", "cabbage", "dragonfruit", "eggs")

#create dataframe
df <- data.frame(a, b, c) ; print(df)

##   a    b           c
## 1 1 0.67      apples
## 2 2 0.23     bananas
## 3 3 1.25     cabbage
## 4 4 3.00 dragonfruit
## 5 5 0.20        eggs

Now, data frames usually have informative column names because they are representing things with context. Lets make our columns (VARIABLES) more informative using colnames().

colnames(df) <- c("item_id", "cost_per_item", "grocery_item") ; print(df)

##   item_id cost_per_item grocery_item
## 1       1          0.67       apples
## 2       2          0.23      bananas
## 3       3          1.25      cabbage
## 4       4          3.00  dragonfruit
## 5       5          0.20         eggs

What if I want to add a column that tells me how much of each item I will be purchasing? There’s a couple ways to do this, but I will define the fourth (new) column using subsetting techniques:

df[ , 4] <- c(4, 5, 1, 1, 12) ; print(df)

##   item_id cost_per_item grocery_item V4
## 1       1          0.67       apples  4
## 2       2          0.23      bananas  5
## 3       3          1.25      cabbage  1
## 4       4          3.00  dragonfruit  1
## 5       5          0.20         eggs 12

colnames(df)[4] <- "quantity" ; print(df) #rename to more useful variable name

##   item_id cost_per_item grocery_item quantity
## 1       1          0.67       apples        4
## 2       2          0.23      bananas        5
## 3       3          1.25      cabbage        1
## 4       4          3.00  dragonfruit        1
## 5       5          0.20         eggs       12

Finally, if I want to pull out a specific variable from the dataframe, I can use the $ to do so. The general notation is [dataframe]$[variable name]

df$grocery_item #extracts the variable named "grocery_item"

## [1] "apples"      "bananas"     "cabbage"     "dragonfruit" "eggs"

df$cost_per_item[1] #extracts the value of "cost_per_item" for the first observation

## [1] 0.67

# CHALLENGE EXAMPLE 
df$cost_per_item[df$grocery_item == "apples"] #any idea of what's going on here?

## [1] 0.67

Checkpoint: Create a 5th variable titled “total_cost” for each item. Then use a function to calculate the overall cost for all my groceries.

df[ , 5] <- df["cost_per_item"] * df["quantity"] #new variable using existing variables
colnames(df)[5] <- "total_cost" #rename
sum(df["total_cost"]) # the overall cost is $10.48

## [1] 10.48

Introduction to R (BTC)

Miss. Yunge

2024-07-28

Welcome to R, or more specifically, RStudio! To clarify, R is the programming language (like Python, Java Script, C++), and RStudio is the environment that most people prefer to code R in.

Why R? R is a free software environment for statistical computing and data visualization. Seems pretty relevant to our class, eh?

Now, let’s familiarize ourselves with basic R “anatomy” before we dive into coding.

Objects: 5 basic types of R objects