We’ve already thrown around a few new terms, so let’s define them before we proceed.
I like to think of R as the engine of the car, and RStudio is the dashboard.
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.
And to make versioning simpler, this is a solo lab. Additionally, we want to make sure everyone gets a significant amount of time at the steering wheel.
Go to the CRAN and download R, make sure you get the version that matches your operating system.
If you have R installed run the following code - We can see my R Version is 4.0.2.
R.version
## _
## platform x86_64-apple-darwin17.0
## arch x86_64
## os darwin17.0
## system x86_64, darwin17.0
## status
## major 4
## minor 0.2
## year 2020
## month 06
## day 22
## svn rev 78730
## language R
## version.string R version 4.0.2 (2020-06-22)
## nickname Taking Off Again
This should tell you what version of R you are currently using. If your R version is lower then 3.6.0 I would strongly recommend updating. In general it is a good idea to update your R version, unless you have a project right now that depend on a specific version of R.
We recommend using RStudio as your IDE if you don’t already have it installed. You can go to the RStudio website to download and install the software.
You can also open the RStudio application first and then create a project by going
file -> new project...
file -> new file -> R markdown...
RStudio is comprised of four panes.
test <- 2 + 2
test
## [1] 4
On the bottom right is the Files pane, as well as other panes that will come handy as we start our analysis.
If you click on a file, it will open in the editor, on the top left pane.
Finally, the top right pane shows your Environment. If you define a variable it would show up there. Try typing x <- 2 in the Console and hit enter, what do you get in the Environment pane? Ans: I will see the integer 2 in the global environment.
x <- 2
R is an open-source language, and developers contribute functionality to R via packages. In this lab we will work with three packages: palmerpenguins which contains the dataset, and tidyverse which is a collection of packages for doing data analysis in a “tidy” way.
Load these packages by running the following in the Console.
- Reference: https://github.com/tidymodels/parsnip
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.1 ──
## ✓ broom 0.7.6 ✓ recipes 0.1.14
## ✓ dials 0.0.9 ✓ rsample 0.0.9
## ✓ infer 0.5.3 ✓ tune 0.1.1
## ✓ modeldata 0.1.0 ✓ workflows 0.2.1
## ✓ parsnip 0.1.5 ✓ yardstick 0.0.7
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
library(palmerpenguins)
# install.packages("devtools")
# devtools::install_github("tidymodels/parsnip")
# install.packages("parsnip")
If you haven’t installed these packages yet and R complains, then you can install these packages by running the following command. (Note that R package names are case-sensitive)
# install.packages(c("tidyverse", "palmerpenguins"))
Note that the packages are also loaded with the same commands in your R Markdown document.
Before we introduce the data, let’s warm up with some simple exercises.
The top portion of your R Markdown file (between the three dashed lines) is called YAML. It stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.
Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.
The data frame we will be working with today is called penguins and it’s in the palmerpenguins package.
- count the number of species and islands with dplyr::count()
penguins %>%
count(species, island) %>%
rename(count = n)
## # A tibble: 5 x 3
## species island count
## <fct> <fct> <int>
## 1 Adelie Biscoe 44
## 2 Adelie Dream 56
## 3 Adelie Torgersen 52
## 4 Chinstrap Dream 68
## 5 Gentoo Biscoe 124
Visualize the distribution of body_mass_g with ggplot
penguins %>%
ggplot(aes(body_mass_g)) +
geom_histogram(bins = 30) +
theme_bw()
Look at the correlation between
body_mass_g and some of the other variables
ggplot(penguins, aes(body_mass_g, flipper_length_mm)) +
geom_point() +
ggtitle("body_mass_g vs flipper_length_mm")
ggplot(penguins, aes(body_mass_g, bill_depth_mm)) +
geom_point() +
ggtitle("body_mass_g vs bill_depth_mm")
## Warning: Removed 2 rows containing missing values (geom_point).
Fit a linear model using parsnip to model body_mass_g
lm_spec <- linear_reg() %>%
set_engine("lm")
lm_fit <- lm_spec %>%
fit(body_mass_g ~ species + island + bill_length_mm + bill_depth_mm + flipper_length_mm,
data = penguins)
lm_fit
## parsnip model object
##
## Fit time: 4ms
##
## Call:
## stats::lm(formula = body_mass_g ~ species + island + bill_length_mm +
## bill_depth_mm + flipper_length_mm, data = data)
##
## Coefficients:
## (Intercept) speciesChinstrap speciesGentoo islandDream
## -4353.133 -530.857 906.293 -4.306
## islandTorgersen bill_length_mm bill_depth_mm flipper_length_mm
## -59.852 41.310 140.252 20.531
Get parameter estimates:
tidy(lm_fit)
## # A tibble: 8 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4353. 497. -8.77 9.45e-17
## 2 speciesChinstrap -531. 89.7 -5.92 8.19e- 9
## 3 speciesGentoo 906. 149. 6.09 3.06e- 9
## 4 islandDream -4.31 64.0 -0.0673 9.46e- 1
## 5 islandTorgersen -59.9 65.5 -0.914 3.61e- 1
## 6 bill_length_mm 41.3 7.18 5.75 2.01e- 8
## 7 bill_depth_mm 140. 19.0 7.38 1.28e-12
## 8 flipper_length_mm 20.5 3.13 6.56 2.03e-10