Now that you have installed and setup RStudio, we are ready to begin familiarizing ourselves with R. There are several functions that are useful whenever we begin a new project. I refer to these as “housekeeping” items.
Let’s begin by opening a library (after you have installed the relevant packages):
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Next, here is how you clear all objects from your environment:
rm(list=ls())
Caution
This will clear your environment, but it will not result in a “clean” R session (e.g., packages will still be open).
Next, let’s set a working directory (the local folder where you will access and store files). One way to do this is to go to Session > Set Working Directory > Choose Directory. Note that your working directory needs to use “/” rather than “\” (alternatively, you can use “\\”). This is natural for Mac users, but Windows users should be cognizant of this issue:
setwd('C:/YOURWD')
To check your current working directory you can use:
getwd()
[1] "C:/Users/qtswanquist/OneDrive/Documents/Teaching/Data Science in Accounting/Scripts/01 R Basics"
Note
This is how we will set up working directories in this course. However, there are other ways to manage working directories that are better for collaboration. Check out the here package or RProjects for more information. We will not use these in this course.
Importing files
R can handle almost any file type you can think of. Let’s start simple and work with importing a file with “comma-separated values” (i.e., a csv file). Save the “Coozie Data” file from the course website into your working directory and run the following code (information about this dataset can be found here):
coozie_data <-read_csv('coozie_data.csv')
The tidyverse has several example datasets for illustrative purposes. Let’s store the “diamonds” dataset into an object called “diamonds”. Information about this dataset can be found here.
diamonds <- diamonds
Practice with mathematical symbols and operators
Now let’s familiarize with mathematical functions and operators:
ggplot(coozie_data_long, aes(x = time, y = temperature, color = treatment, shape = treatment)) +geom_line() +geom_point() +labs(title ='Coozie Experiment', x ='Time (in Mins)', y ='Temperature (in F)', color ='Legend', shape ='Legend' )
Neat, huh!
Explore and wrangle the diamonds data
Take a look at the dataset:
diamonds
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
Report the structure and summary statistics for the data:
It is important that you write clean and consistent code. The tidyverse has a style guide detailing best practices for code format and syntax. Not everyone follows the same format, but maintaining consistent organization and syntax makes it easier for others to follow.
Here is an example of well formatted code that produces a scatterplot of flower petal sizes from the iris dataset (information about this dataset can be found here):
iris %>%filter(Species !='setosa') %>%ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +geom_point() +labs(x ='Sepal Width (in cm)', y ='Sepal Length (in cm)', title ='Sepal Length vs. Width of Irises' )
Here is an example poorly formatted code (that produces the exact same graphic):
iris %>%filter(Species!='setosa') |>ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +geom_point() +labs(x='Sepal Width (in cm)',y ="Sepal Length (in cm)",title ="Sepal Length vs. Width of Irises")