Week 2 Lecture

When you make an Rmarkdown file, always keep this chunk:

lets also load tidyverse

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

What is R?

To put it simply: R is a programming language, but it’s basically a spreadsheet program (like Excel) that you operate using text commands.

To put it even more simply: it’s a big calculator! Example:

1+1

## [1] 2

You’ll notice that there is a [1] in front of the “2”. This may seem strange, but it’s put there to tell you that the answer (2) is the “1st” answer. This will make more sense later, when we have problems with multiple answers.

2*2

## [1] 4

Get the idea? It’s pretty simple.

More complex problems are do-able, and in my opinion, even easier than doing them with a regular calculator:

2*2+4*8-3*7

## [1] 15

Division and multiplication too!

12/4

## [1] 3

12*4

## [1] 48

Important: Data can be stored in objects!

Have a look:

#lets store "2+2" in an object called "a"

a<-2+2

print(a)

## [1] 4

It’s possible to make complex equations using this method:

b<-(4*a)+(3*a)

print(b)

## [1] 28

c<-b+a

print(c)

## [1] 32

It’s also possible to put entire lists of things into a single variable. We can put a bunch of numbers inside a single variable, with the “c()” command. The “c” in “c()” stands for “concatenate” or “combine”. When we put something inside a command like “c()”, we often say that we are “wrapping” it inside “c()”.

So, for example, “wrapping 1,2,3,4 in c()” looks like this: c(1,2,3,4)

Try wrapping 1,2,3,4 into a new variable called “test1234” in the console. It should look something like this:

test1234<-c(1,2,3,4)

Now lets do the same in code chunks:

d_list<-c(1,2,3,4,5,6)
e_list<-c(1,2,3,4,5,6)

print(d_list)

## [1] 1 2 3 4 5 6

print(e_list)

## [1] 1 2 3 4 5 6

# these lists can be multiplied together:

combined_list<-d_list*e_list

print(combined_list)

## [1]  1  4  9 16 25 36

When we create lists like these, the lists are called “vectors”.

We can also ask R how long a vector is (this will be very useful later, trust me):

length(d_list)

## [1] 6

This will also be useful: we can ask R to give us the specific value of a vector position. Here we are asking R, “give us the 3rd value of new_list”:

new_list<-c(5,10,15,20,25)

new_list[3]

## [1] 15

We can add all elements of a vector together:

sum(new_list)

## [1] 75

Or we can just add together some parts of the vector:

sum(new_list[3:4])

## [1] 35

There are easier ways to do this, but you can use “sum()” and “length()” to calculate an average. For example:

sum(new_list)/length(new_list)

## [1] 15

You can also check logical statements. Such as “is x bigger than y”?

x<-50

y<-2

x>y

## [1] TRUE

print(new_list)

## [1]  5 10 15 20 25

new_list[3]>new_list[4]

## [1] FALSE

You can also discover the minimum and maximum element of a vector very easily:

min(new_list)

## [1] 5

max_b<-max(new_list)

How do we get files into R?

We can import files from the working directory. Remember how we set the working directory earlier? The dataset you are working on should be there already.

For this lecture, we will be using a small dataset called “city.csv”:

wd<-getwd()
city<-read.csv("city.csv")

tibble(city)

## # A tibble: 7 × 6
##       X city      high_far low_far high_cent low_cent
##   <int> <chr>        <int>   <int>     <int>    <int>
## 1     1 Mumbai          91      72        33       22
## 2     2 Nairobi         88      57        31       14
## 3     3 Paris           48      36         9        2
## 4     4 Sao Paulo       82      68        28       20
## 5     5 Sydney          79      68        27       20
## 6     6 Tokyo           48      37         9        3
## 7     7 Toronto         27      10        -3      -12

R can load almost any type of data, starting with read.csv, read_excel, and so forth.

Getting help in R

What if something is off and you need to read it’s documentation? For specific commands, like “print()”, you can just write “?print” in the command console. Try it now, without the quotation marks.

This will open up a window on the lower right called “Help”. Here you see a brief description of what “print()” does, followed by allowed commands (or “arguments”) that can be passed to “print()”. At the very bottom of the page, you’ll see a section called “Examples” which is very valuable. This basically shows how the “print()” command can be used in real life.

You can do this with almost any command in R! I’m not kidding when I say that “RTFM” (Read The Freakin’ Manual) is great advice!

One more thing

It’s also important to know: we use the dollar sign “$” to call specific variables (columns) in a code.

For example, “city” has 6 columns:

colnames(city)

## [1] "X"         "city"      "high_far"  "low_far"   "high_cent" "low_cent"

If we want to find the average of a specific column (say, “high_far”), we call mean(city$high_far), like this:

mean(city$high_far)

## [1] 66.14286

Keep this in mind for later.

Plots

Unlike Excel, it’s very easy to make simple plots in R. For very simple x-y plots, there is just one command: plot()

plot(city$high_far,main="CityPlot",ylab="Y axis",type="b",sub="Data from Unknown Source")

R automatically chooses a column, in this case, “X” (the Index), for the bottom row. The vertical row (Y axis) is the “city$high_far” temperature for that city.

Lets explore “plot()” a little bit more. Type the following in console:

?plot

In this case, we’ll click on the first link “Generic X-Y Plotting”

Under “Usage” we see this:

plot(x, y, …)

This means that plot needs an “x” and “y”, and can also accept “…” (additional functions). In the example we just did, we left out the “x” and R just inserted whatever it thought was best (the index). We can add some additional functions now. Look below “usage” to “Arguments”:

plot(city$high_far,type="l")

# the "l" stands for "line". We are asking R to dispense with points and replace with lines.

Much nicer! Now try this:

plot(city$high_far,type="b")

# "b" stands for "both" -- both lines and points

I think you’ll agree this is a good plot to see the difference in cities. We can also add labels to the plot. Check out the arguments like “main” and “sub”.

plot(city$high_far,type="b",main="City High F temps",sub="Data From Unknown Source")

Can we add city names? Yes, but we have to explicitly call “city$city” in the “X” argument, as a “factor()”. Don’t worry about “factor()” for now, I’ll explain it in detail later.

plot(factor(city$city),city$high_far,type="b",main="City High F temps",sub="Data From Unknown Source")

This looks kind of weird. In future classes we will discuss easier ways to make graphs literally however we want them – using ggplot.

Homework

Create a new Rmarkdown document in a folder with cars_data.csv
Load cars_data.csv into the Rmarkdown document

Answer these questions in the Rmarkdown document. Show your code:

What is the average (mean()) mpg for all cars?
What is the minimum mpg for all cars?
What is the maximum mpg for all cars?
How many cars are there in the dataset? (Hint: length() of mpg)
upload this to Rpubs and submit the URL via CANVAS. Good luck!