R Code Notes

Introduction

For the purpose of self study and code review, this R markdown report has been created. The original contents can be found on Sylvia’s GitHub.

Package used

dplyr
janitor
palmerpenguins
skimr
tidyverse

Part 1: Concepts

Functions and variables
Vectors and data frames

Part 2: Basic

Install and load packages
Prepare data
Inspect data
Organize data
Manipulate data
Nested and pipe
Logical operator and conditional statement
Visualize data
Save the visuals

Part 3: Tips

Do not assume
Check for bias

Part 1: Concepts

1-1. Functions and variables

functions: a code to perform a specific task. Type function and then add argument within the (). Example:

print("It's a beautiful day!")

## [1] "It's a beautiful day!"

variable: a value which can be stored for later use. Also called as objects. Define variable name followed by <- and function.

var_x <- "this is variable"
var_y <- 123.45

A calculation can be performed with variable as below:

var_y - 23.45

## [1] 100

Ask about function by adding ? before function name.

?print()

or find out more about the packages.

browseVignettes("tidyverse")

1-2. Vectors and data frames

vector: a group of data elements of the same type stored in a sequence. Once a vector is created, it will show as a set of data.

vec_1 <- c(12,34,56,78.9)
vec_2 <- c(1L, 5L, 10L)
vec_3 <- c("Anna", "Beta", "Cera", "Delta")

now, let’s run one of vector (vec_3) below.

## [1] "Anna"  "Beta"  "Cera"  "Delta"

list: it’s similar to vector but it can contain different data type.

list_1 <- list("a", 1L, 3.5, FALSE)

Other operations can be done to vectors including:
calculating the length of vector

length(vec_3)

## [1] 4

or, assigning titles to vector can be done too.

names(vec_1) <- c("spring", "summer", "fall", "winter")
print(vec_1)

## spring summer   fall winter 
##   12.0   34.0   56.0   78.9

The same can be done to list.

list1 <- list('x-axis'=1, 'y-axis'=2, 'z-axis'=3)
print(list1)

## $`x-axis`
## [1] 1
## 
## $`y-axis`
## [1] 2
## 
## $`z-axis`
## [1] 3

data frame: collection of columns, typically imported from different source. Example below:

df_1 <- data.frame(city=c("NY", "SF", "CO"), days=c(2.4, 4.4, 5.1), rank=c(2,1,3))

or codes can be added step by step by first defining the variables:

city <- c("NY", "SF", "CO")
days <- c(2.4, 4.4, 5.1)
rank <- c(2,1,3)
df_1 <- data.frame(city,days,rank)

Now, let’s run the code to see the result.

##   city days rank
## 1   NY  2.4    2
## 2   SF  4.4    1
## 3   CO  5.1    3

matrix: two-dimentional collection of data elements, containing a single data type.

matrix(c(3:10), nrow=2)

##      [,1] [,2] [,3] [,4]
## [1,]    3    5    7    9
## [2,]    4    6    8   10

matrix(c(3:10), ncol=2)

##      [,1] [,2]
## [1,]    3    7
## [2,]    4    8
## [3,]    5    9
## [4,]    6   10

Note that matrix(c(3:9), nrow=2) will give an error, as 7 elements are not 2x multiplier.

Part 2: Basic

2-1. Install and load packages

Check installed packages. If needed, install for use.

installed.packages()
install.packages("palmerpenguins")

load package and/or dataset before using it. Nothing will show up in the console:

library(palmerpenguins) 
data(penguins)

2-2. Prepare data

Import csv file and select certain columns:

csv_file <- read_csv("test.csv")
select(csv_file, column1, column2, column3)

Import excel file and read specific sheet:

read_excel("test.xls")
excel_sheets("test.xls")
read_excel("test.xls", sheet="sales")

2-3. Inspect data

View will display data in reader-friendly table format in separate tab:

View(penguins)

Also, summary data can be reviewed with below functions.

head(penguins)
colnames(penguins)
glimpse(penguins)
str(penguins)

As an example, str() has been run as below:

str(penguins)

## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Summary statistics can also be checked:

skim_without_charts(penguins) 
summary(penguins)

Example of summary():

summary(penguins)

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

Inspect specific column(s):

select(penguins, species)

By adding - sign before the column, result will display all columns except for marked column. See example, same as select(penguins, -species):

penguins %>%
  select(-species)

## # A tibble: 344 x 7
##    island  bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex    year
##    <fct>            <dbl>         <dbl>            <int>       <int> <fct> <int>
##  1 Torger~           39.1          18.7              181        3750 male   2007
##  2 Torger~           39.5          17.4              186        3800 fema~  2007
##  3 Torger~           40.3          18                195        3250 fema~  2007
##  4 Torger~           NA            NA                 NA          NA <NA>   2007
##  5 Torger~           36.7          19.3              193        3450 fema~  2007
##  6 Torger~           39.3          20.6              190        3650 male   2007
##  7 Torger~           38.9          17.8              181        3625 fema~  2007
##  8 Torger~           39.2          19.6              195        4675 male   2007
##  9 Torger~           34.1          18.1              193        3475 <NA>   2007
## 10 Torger~           42            20.2              190        4250 <NA>   2007
## # ... with 334 more rows

2-4. Organize data

Sort the data. By adding - sign, sort in DESC order instead of default ASC:

arrange(penguins, bill_length_mm)
arrange(penguins, desc(bill_length_mm))
penguins %>% arrange(bill_length_mm)
penguins %>% arrange(-bill_length_mm)

Filter values:

filter(penguins, species=='Gentoo')

Example of using arrange and filter functions:

penguins %>% 
  filter(species=="Gentoo") %>%
  select(species, island, body_mass_g) %>%
  arrange(-body_mass_g)

## # A tibble: 124 x 3
##    species island body_mass_g
##    <fct>   <fct>        <int>
##  1 Gentoo  Biscoe        6300
##  2 Gentoo  Biscoe        6050
##  3 Gentoo  Biscoe        6000
##  4 Gentoo  Biscoe        6000
##  5 Gentoo  Biscoe        5950
##  6 Gentoo  Biscoe        5950
##  7 Gentoo  Biscoe        5850
##  8 Gentoo  Biscoe        5850
##  9 Gentoo  Biscoe        5850
## 10 Gentoo  Biscoe        5800
## # ... with 114 more rows

Find max, min, mean: if dataset contains NA, result will show as NA.

min(penguins$year)
max(penguins$year)
mean(penguins$year)

Group data for summary statistics. Below example illustrates the process of grouping data -> removing NA values -> assigning new column name -> summarizing them.

penguins %>% 
  group_by(species, island) %>% drop_na() %>%
  summarize(mean_bl=mean(bill_length_mm),
            max_bl=max(bill_length_mm))

## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.

## # A tibble: 5 x 4
## # Groups:   species [3]
##   species   island    mean_bl max_bl
##   <fct>     <fct>       <dbl>  <dbl>
## 1 Adelie    Biscoe       39.0   45.6
## 2 Adelie    Dream        38.5   44.1
## 3 Adelie    Torgersen    39.0   46  
## 4 Chinstrap Dream        48.8   58  
## 5 Gentoo    Biscoe       47.6   59.6

Save cleaned data frame:

cleaned_penguins <- penguins %>% arrange(bill_length_mm)
cleaned2_penguins <- penguins %>% select(island, species)

2-5. Manipulate data

Rename column or variable: rename(dataset, new_name=old_name):

rename(penguins, weight=body_mass_g)

Update all columns to upper/lower case

rename_with(penguins,toupper)
rename_with(penguins,tolower)

Clean column names by ensuring only characters, numbers and _ are in the columns

clean_names(penguins)

Combine columns:

unite(data_set, 
      'new_column_name', 
      coulmn1_to_unite, 
      column2_to_unite, 
      sep=' ,')

Applying as below:

example <- bookings_data %>%
  select(arrival_date_year, arrival_date_month) %>%
  unite(arrival_year_month, c("arrival_date_year",
                              "arrival_date_month"), sep = " ,")

Separate column:

separate(data_set, 
         column_to_separate, 
         into=c('column1', 'colmn2'), 
         sep= ' ')

Add column: mutate(dataset, new_column=explain)

mutate(penguins, body_mass_kg=body_mass_g/1000)

Example below:

penguins %>%
  mutate(body_mass_kg=body_mass_g/1000) %>%
  select(species, island, body_mass_kg) %>%
  arrange(-body_mass_kg)

## # A tibble: 344 x 3
##    species island body_mass_kg
##    <fct>   <fct>         <dbl>
##  1 Gentoo  Biscoe         6.3 
##  2 Gentoo  Biscoe         6.05
##  3 Gentoo  Biscoe         6   
##  4 Gentoo  Biscoe         6   
##  5 Gentoo  Biscoe         5.95
##  6 Gentoo  Biscoe         5.95
##  7 Gentoo  Biscoe         5.85
##  8 Gentoo  Biscoe         5.85
##  9 Gentoo  Biscoe         5.85
## 10 Gentoo  Biscoe         5.8 
## # ... with 334 more rows

Summary statistics: summarize(dataset, col_name=mean(col))

penguins %>%
  drop_na() %>%
  summarize(avg_g=mean(body_mass_g), sum_g=sum(body_mass_g))

2-6. Nested query and pipe

Nested query can be written as below:

arrange(filter(ToothGrowth, dose==0.5), len)

Using pipe, process step is clearer and less cluttered:

filtered_toothgrowth <- ToothGrowth %>%
  filter(dose==0.5) %>%
  arrange(len)

2-7. Conditional statement

Logical operators: and &, or |, not!

x <-10

x<12 & x>11      # false as 10 is not bigger than 11.

## [1] FALSE

x<12 | x>11      # true, less than 12 or bigger than 11.

## [1] TRUE

!x>11            # true, not bigger than 11.

## [1] TRUE

!(x>15 | x<5)    # true, not bigger than 15 or less than 5

## [1] TRUE

Conditional statement: if(){then}

x <-5

if(x>0){
  print("x is a positive number")
  } else {
    print("x is a negative number")
    }

## [1] "x is a positive number"

y <-1982

if (y>1990) {
  print("Group1")
  } else if (y>1980) {
    print("Group2")
    } else {
      print("Group3")
      }

## [1] "Group2"

2-8. Visualize data

ggplot(data)+geom_shape(mapping=aes(argument)). geom_bar will display bar chart, geom_point will display scatter plot. See below:

ggplot(data=penguins)+ 
  geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))

Above can be also written as below:

ggplot(data=penguins, 
       mapping=aes(x=flipper_length_mm,y=body_mass_g))+
  geom_point()

Enhance the chart by mapping color, size, shape or alpha(difference in density):

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm,
                        y=body_mass_g,
                        color=species,
                        shape=species))

See the chart with alpha aes:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm,
                         y=body_mass_g,
                         alpha=species))

Below is the alternative way of writing codes for archive purpose:

ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
  geom_point()

ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
  geom_point(color="purple")

ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
  geom_point()+
  facet_wrap(~species)

If color needs to be applied to entire chart instead of specific variables:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g),
             color="purple")

Different chart types can be created by changing geom. For example, smooth line as below:

ggplot(data=penguins)+
  geom_smooth(mapping=aes(x=flipper_length_mm,y=body_mass_g))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

bar chart below. If color is used instead of fill, the outline will only be colored:

ggplot(data=penguins)+
  geom_bar(mapping=aes(x=species,fill=species))

stacked bar chart:

ggplot(data=penguins)+
  geom_bar(mapping=aes(x=species,fill=island))

Examine relationship between trend line and data points by adding two geom:

ggplot(data=penguins)+
  geom_smooth(mapping=aes(x=flipper_length_mm, y=body_mass_g))+
  geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There are also different types of soothing lines. For reference:

ggplot(data=penguins,aes(x=flipper_length_mm,y=body_mass_g))+
  geom_smooth(method="loess")                                       #Loess smoothing
  
ggplot(data=penguins,aes(x=flipper_length_mm,y=body_mass_g))+
  geom_smooth(method="gam",formula=y~s(x))                          #Gam smoothing

Facet function is also handy for focusing on specific data points:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))+
  facet_wrap(~species)

facet_grid for 1+ facet:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
  facet_grid(sex~species)

Other functions can be used as normal:

penguins %>%
  filter(island=="Biscoe") %>%
  ggplot(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
  geom_point()

Adding label and annotation will help readers understand about data:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g, color=species))+
  labs(title="Palmer Penguins: Relationship between Body Mass vs Flipper Length",
       subtitle="Body Mass (gram) / Flipper Length (mm) ",
       caption="R Package used: palmerpenguins")+
  annotate("text", x=220,y=4000,label="positive relationship observed",
           angle=45, fontface="bold", size=3)

Label can be added using variables too:

minamount <- min(dataframe$column)
maxamount <- max(dataframe$column)
  
labs(caption=paste0("The minimum is ",minamount," and the maximum is ", maxamount))

Add x/y axis header title if missing:

labs(x="X-axis name", y="Y-axis name")

Rotate x-axis headers:

theme(axis.text.x=element_text(angle=45))

2-9. Save the visuals

ggsave will save the latest visual:

ggsave("test_visuals.png", width=5, height=5)

Part 3: Tips

3-1. Do not assume

Summary may indicate the similar dataset for below groups:

library('Tmisc')
data(quartet)
quartet %>%
  group_by(set) %>%
  summarize(mean(x),sd(x),mean(y),sd(y),cor(x,y))

## # A tibble: 4 x 6
##   set   `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
##   <fct>     <dbl>   <dbl>     <dbl>   <dbl>       <dbl>
## 1 I             9    3.32      7.50    2.03       0.816
## 2 II            9    3.32      7.50    2.03       0.816
## 3 III           9    3.32      7.5     2.03       0.816
## 4 IV            9    3.32      7.50    2.03       0.817

But visualizing them will reveal the difference between the grouped data:

ggplot(quartet,aes(x,y))+
  geom_point()+
  geom_smooth(method=lm,se=FALSE)+
  facet_wrap(~set)

## `geom_smooth()` using formula 'y ~ x'

3-2. Check for bias

Package needed:

install.packages("SimDesign")
library("SimDesign")

Compare data:

actual_data <- c(10,20,30,40,50)
predicted_data <-c(8,14,22,39,45)
bias(actual_data,predicted_data)

## [1] 4.4

Another example:

actual_data2 <- c(10,20,30,40,50)
predicted_data2 <- c(12,24,39,47,55)
bias(actual_data2, predicted_data2)

## [1] -5.4

The more the result is closer to 0, the less the data is biased.

END

If you have any feedback, please do not hesitate to reach out sylviahk416@gmail.com

R Code Notes

created by Sylvia Kim

version 2022-02-05

Introduction

Package used

Table of contents

Part 1: Concepts

Part 2: Basic

Part 3: Tips

Part 1: Concepts

1-1. Functions and variables

1-2. Vectors and data frames

Part 2: Basic

2-1. Install and load packages

2-2. Prepare data

2-3. Inspect data

2-4. Organize data

2-5. Manipulate data

2-6. Nested query and pipe

2-7. Conditional statement

2-8. Visualize data

2-9. Save the visuals

Part 3: Tips

3-1. Do not assume

3-2. Check for bias

END