R Basics

General housekeeping items

Now that you have installed and setup RStudio, we are ready to begin familiarizing ourselves with R. There are several functions that are useful whenever we begin a new project. I refer to these as “housekeeping” items.

Let’s begin by opening a library (after you have installed the relevant packages):

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Next, here is how you clear all objects from your environment:

rm(list=ls())
Caution

This will clear your environment, but it will not result in a “clean” R session (e.g., packages will still be open).

Next, let’s set a working directory (the local folder where you will access and store files). One way to do this is to go to Session > Set Working Directory > Choose Directory. Note that your working directory needs to use “/” rather than “\” (alternatively, you can use “\\”). This is natural for Mac users, but Windows users should be cognizant of this issue:

setwd('C:/YOURWD')

To check your current working directory you can use:

getwd()
[1] "C:/Users/qtswanquist/OneDrive/Documents/Teaching/Data Science in Accounting/Scripts/01 R Basics"
Note

This is how we will set up working directories in this course. However, there are other ways to manage working directories that are better for collaboration. Check out the here package or RProjects for more information. We will not use these in this course.

Importing files

R can handle almost any file type you can think of. Let’s start simple and work with importing a file with “comma-separated values” (i.e., a csv file). Save the “Coozie Data” file from the course website into your working directory and run the following code (information about this dataset can be found here):

coozie_data <- read_csv('coozie_data.csv')

The tidyverse has several example datasets for illustrative purposes. Let’s store the “diamonds” dataset into an object called “diamonds”. Information about this dataset can be found here.

diamonds <- diamonds

Practice with mathematical symbols and operators

Now let’s familiarize with mathematical functions and operators:

#Addition
10 + 2
[1] 12
#Subtraction
10 - 2
[1] 8
#Multiplication
10 * 2
[1] 20
#Division
10 / 2
[1] 5
#Exponents
10 ^ 2
[1] 100
#Natural logarithm
log(10)
[1] 2.302585

Logical operators:

#Equal to
10 == 2
[1] FALSE
#Not equal to
10 != 2
[1] TRUE
#! is not!
!FALSE
[1] TRUE
#Greater than
10 > 2
[1] TRUE
#Greater than or equal to
10 >= 2
[1] TRUE
#AND
(10 >= 2) & (10 <= 2)
[1] FALSE
#OR
(10 >= 2) | (10 <= 2)
[1] TRUE

Assignment:

a = 5
b <- 5
5 -> c
d <- c(1,2,3)

Wrangle and plot the “coozie data”

Take a look at the coozie dataset:

coozie_data
# A tibble: 13 × 4
    time control soft_coozie steel_coozie
   <dbl>   <dbl>       <dbl>        <dbl>
 1     0    40.5        41.4         40.6
 2     5    46.8        44.2         41.2
 3    10    54.5        47.3         41.5
 4    15    59.5        50.5         41.5
 5    20    63.9        53.6         42.3
 6    25    67.3        56.1         42.1
 7    30    69.8        58.5         42.4
 8    35    72.7        60.8         42.8
 9    40    74.3        62.8         43.3
10    45    75.7        65.1         43.9
11    50    76.8        66.1         44.8
12    55    77.5        68.5         45.3
13    60    78.1        70.3         46  

The coozie data is not “tidy” (a concept we will cover later in the course). We will want to reshape from wide to long and store in a new object:

coozie_data_long <- coozie_data %>%
  pivot_longer(-time, names_to = 'treatment', values_to = 'temperature')

Take a look at the new dataset:

coozie_data_long
# A tibble: 39 × 3
    time treatment    temperature
   <dbl> <chr>              <dbl>
 1     0 control             40.5
 2     0 soft_coozie         41.4
 3     0 steel_coozie        40.6
 4     5 control             46.8
 5     5 soft_coozie         44.2
 6     5 steel_coozie        41.2
 7    10 control             54.5
 8    10 soft_coozie         47.3
 9    10 steel_coozie        41.5
10    15 control             59.5
# ℹ 29 more rows

Export the dataset to working directory (save as a csv file):

write_csv(coozie_data_long, "coozie_data_long.csv")

Then let’s plot!

ggplot(coozie_data_long, aes(x = time, y = temperature, color = treatment, shape = treatment)) +
  geom_line() +
  geom_point() +
  labs(
    title = 'Coozie Experiment', 
    x = 'Time (in Mins)', 
    y = 'Temperature (in F)', 
    color = 'Legend', 
    shape = 'Legend'
  )

Neat, huh!

Explore and wrangle the diamonds data

Take a look at the dataset:

diamonds
# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Report the structure and summary statistics for the data:

str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
     carat               cut        color        clarity          depth      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
                                    J: 2808   (Other): 2531                  
     table           price             x                y         
 Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
 Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
 Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
 3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
 Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
                                                                  
       z         
 Min.   : 0.000  
 1st Qu.: 2.910  
 Median : 3.530  
 Mean   : 3.539  
 3rd Qu.: 4.040  
 Max.   :31.800  
                 

Create a new variable that captures “price per carat” - the “base R” way:

diamonds$price_carat <- diamonds$price / diamonds$carat

Create a new variable that captures “price per carat” - the “tidyverse” way:

diamonds <- diamonds %>%
  mutate(price_carat = price / carat)

Create a new variable that indicates “big diamonds’”(those over 1 carat):

diamonds <- diamonds %>%
  mutate(big_diamond = carat > 1)

Report a frequency table for each unique value of clarity:

table(diamonds$clarity)

   I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
  741  9194 13065 12258  8171  5066  3655  1790 

Isolate only “flawless” diamonds - the “base R” way:

diamonds_data_flawless <- diamonds[diamonds$clarity == 'IF',]

Isolate only “flawless” diamonds - the “tidyverse” way:

diamonds_data_flawless <- diamonds %>%
  filter(clarity == 'IF')

Code formatting example

It is important that you write clean and consistent code. The tidyverse has a style guide detailing best practices for code format and syntax. Not everyone follows the same format, but maintaining consistent organization and syntax makes it easier for others to follow.

Here is an example of well formatted code that produces a scatterplot of flower petal sizes from the iris dataset (information about this dataset can be found here):

iris %>%
  filter(Species != 'setosa') %>%
  ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
  geom_point() +
  labs(x = 'Sepal Width (in cm)', 
       y = 'Sepal Length (in cm)', 
       title = 'Sepal Length vs. Width of Irises'
  ) 

Here is an example poorly formatted code (that produces the exact same graphic):

iris %>% filter(Species!= 'setosa') |>
ggplot(aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +geom_point() +
labs(x='Sepal Width (in cm)',y ="Sepal Length (in cm)",title ="Sepal Length vs. Width of Irises") 

Asking for help

A ? before a function will pull up documentation:

?base