Welcome to R! This is an RMarkdown notebook. This allows you to intersperse writing text with calculations, making it perfect for lab writeups and reproducible data analyses.
Below this writing, you can see a code chunk. It loads packages which are code other people have written to make your life easier. The tidyverse package is life changing, and is used extremely commonly throughout the sciences. That being said, different disciplines use different software. However, the bonus here is that you will also develop light coding skills.
You need to install this package first, by writing
install.packages(tidyverse) into the
console down at the bottom. You should only ever have
to do this once on any given machine!
The first thing to do when you use R is to load packages. Pressing
ctrl-shift-enter (all at once) will run
the following cell/chunk. The same thing will happen if
you press the green arrow
. The stuff in the curly braces is extra
instructions to R. The load-packages just gives this block
of code a name and is optional, but it makes it easier to troubleshoot
code (you should always put a name in the blocks!). Then, the other
parts tell R not to spit out a bunch of information. For this class, you
don’t have to worry about changing what is in the curly braces basically
ever, unless you really want to.
There is a line with a # in front of it in the code block that does nothing. The # is a comment symbol and will make R ignore anything that comes after it on a line.
If you ever want to add another code block, press
ctrl-alt-i to insert a block.
(Or go to Insert>Executable Cell>R, but that takes so many
clicks…)
R can do most calculator computations. In addition, for multiple-step calculations it can store the value of a variable at intermediate steps.
Try running the following cells multiple times. Tweak a few things. Then, below the code chunk, answer the questions.
Note: you can also do any of these one-line calculations in the console (bottom left). Sometimes, you may want to test some things in the console before putting them into a code chunk, or you’re just using R as a calculator for short calculations and you don’t need to track multiple code chunks.
5218+4683
## [1] 9901
x <- 500
y <- 7
x^2
## [1] 250000
y+9
## [1] 16
x <- x+3
x
## [1] 503
y <- y*2
y
## [1] 14
x+y
## [1] 517
When does a line print something, and when doesn’t it? (Hint: Consider when there are arrows <-) -It acts as an equal sign
When is a value stored or updated? What does <- do? -It is updated when is a new value
Is anything about the behavior of these code chunks surprising to you? -There is a new number every time you click play again
Our textbook gives us several datasets. The single line of code
included in this code chunk instructs R to load some data.
In this case, it is a list of cars that were sold in 1993 at a
particular dealership. The command head displays the
head, or first 6 rows of the dataset. To see the FULL dataset,
use the environment tab.
cars_table <- cars93
head(cars_table)
To access just one column of this dataset, use the $
symbol. For example, we can get the column of prices by the following.
Here price is in thousands of dollars, so a price of 8.0
corresponds to an $8,000 car.
cars_table$price
## [1] 15.9 33.9 37.7 30.0 15.7 20.8 23.7 26.3 34.7 40.1 15.9 18.8 18.4 29.5 9.2
## [16] 11.3 15.6 12.2 19.3 7.4 10.1 20.2 20.9 8.4 12.1 8.0 10.0 13.9 47.9 28.0
## [31] 35.2 34.3 36.1 8.3 11.6 61.9 14.9 10.3 26.1 11.8 21.5 16.3 20.7 9.0 18.5
## [46] 24.4 11.1 8.4 10.9 8.6 9.8 18.2 9.1 26.7
You can ask R to get information from a single column by putting a
column name into a function, like max or min.
The following computes the minimum (lowest) priced car sold at this
dealership in 1993. The function max computes the maximum
(highest).
min(cars_table$price)
## [1] 7.4
How many observations does this dataset have?
-54 observations in the data set
What are the variables in this dataset, and are they categorical
or quantitative?
-6 categorical variables
What was the most expensive car that was sold in this dataset?
Use R to compute the answer.
-$61.90 was the most expensive
max(cars_table$price)
## [1] 61.9
What proportion of the cars were more expensive than $35,000?
Count them, and use R as a calculator below to get the answer.
6/54
## [1] 0.1111111
Was this data collected from an observational study or an
experiment? Explain how this tells you what types of conclusions you can
draw from this data.
- the expensive cars were drove less than the cheaper
cars
R has some powerful functions for making graphics. We
can create a simple plot of the relationship between the weight of a car
and its price as follows. When we describe one of these plots, we always
write y versus x. In this case, we are plotting price
versus weight.
cars_table %>%
ggplot(aes(x=weight, y=price))+
geom_point()+
ggtitle("Price versus Weight in cars")
Is there an apparent trend between price and
weight? How would you describe it?
-The heaver the weight of the car, the higher the
price.
cars_table %>%
ggplot(aes(x=weight, y=price))+
geom_point()+
ggtitle("Price versus mpg_city in cars")
The below makes a different type of plot by using a different
geometry. Give it a title by adding
+ ggtitle("YOUR TITLE HERE") that is appropriate for what
it is measuring.
cars_table %>%
ggplot(aes(x=type, fill=drive_train))+
geom_bar()
The cars04 dataset contains another set of cars from
a different dealership that were sold in – you guessed it – 2004.
cars_table_2 <- cars04
head(cars_table_2)
How many cars were represented by this dataset?
-263 cars in the data set
How do the prices of cars in this dataset compare to the prices
of cars in the cars93 dataset? Explain what you are looking at to give
your answer.
–I compaired the prices in the cars93 and the msrp prices in the
cars 04 data set. the cars in the 04 data set are way more expensive
than the 1993 cars.
Make a plot that displays the price of each car (from
dealer_cost) plotted against weight, with an
appropriate title. Do you see a similar trend as in cars93?
Explain.
cars_table_2 %>%
ggplot(aes(x=weight, y=dealer_cost))+
geom_point()+
ggtitle("weight versus dealer_cost in cars")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
What was the most expensive car sold in this dealership in 1994?
Use R code to answer.
max(cars_table_2$msrp)
## [1] 192465That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses.
In this course we will be using the suite of R packages from the tidyverse. The book R For Data Science by Grolemund and Wickham is a fantastic resource for data analysis in R with the tidyverse. If you are Googling for R code, make sure to also include these package names in your search query. For example, instead of Googling “scatterplot in R”, Google “scatterplot in R with the tidyverse”.
These may come in handy throughout the semester:
Note that some of the code on these cheatsheets may be too advanced for this course. However the majority of it will become useful throughout the semester.