Introduction

For the purpose of self study and code review, this R markdown report has been created. The original contents can be found on Sylvia’s GitHub.

Package used

  • dplyr
  • janitor
  • palmerpenguins
  • skimr
  • tidyverse

Table of contents

Part 1: Concepts

  1. Functions and variables
  2. Vectors and data frames

Part 2: Basic

  1. Install and load packages
  2. Prepare data
  3. Inspect data
  4. Organize data
  5. Manipulate data
  6. Nested and pipe
  7. Logical operator and conditional statement
  8. Visualize data
  9. Save the visuals

Part 3: Tips

  1. Do not assume
  2. Check for bias

Part 1: Concepts

1-1. Functions and variables

functions: a code to perform a specific task. Type function and then add argument within the (). Example:

print("It's a beautiful day!")
## [1] "It's a beautiful day!"

variable: a value which can be stored for later use. Also called as objects. Define variable name followed by <- and function.

var_x <- "this is variable"
var_y <- 123.45

A calculation can be performed with variable as below:

var_y - 23.45
## [1] 100

Ask about function by adding ? before function name.

?print()

or find out more about the packages.

browseVignettes("tidyverse")

1-2. Vectors and data frames

vector: a group of data elements of the same type stored in a sequence. Once a vector is created, it will show as a set of data.

vec_1 <- c(12,34,56,78.9)
vec_2 <- c(1L, 5L, 10L)
vec_3 <- c("Anna", "Beta", "Cera", "Delta")

now, let’s run one of vector (vec_3) below.

## [1] "Anna"  "Beta"  "Cera"  "Delta"

list: it’s similar to vector but it can contain different data type.

list_1 <- list("a", 1L, 3.5, FALSE)

Other operations can be done to vectors including:
calculating the length of vector

length(vec_3)
## [1] 4

or, assigning titles to vector can be done too.

names(vec_1) <- c("spring", "summer", "fall", "winter")
print(vec_1)
## spring summer   fall winter 
##   12.0   34.0   56.0   78.9

The same can be done to list.

list1 <- list('x-axis'=1, 'y-axis'=2, 'z-axis'=3)
print(list1)
## $`x-axis`
## [1] 1
## 
## $`y-axis`
## [1] 2
## 
## $`z-axis`
## [1] 3

data frame: collection of columns, typically imported from different source. Example below:

df_1 <- data.frame(city=c("NY", "SF", "CO"), days=c(2.4, 4.4, 5.1), rank=c(2,1,3))

or codes can be added step by step by first defining the variables:

city <- c("NY", "SF", "CO")
days <- c(2.4, 4.4, 5.1)
rank <- c(2,1,3)
df_1 <- data.frame(city,days,rank)

Now, let’s run the code to see the result.

##   city days rank
## 1   NY  2.4    2
## 2   SF  4.4    1
## 3   CO  5.1    3

matrix: two-dimentional collection of data elements, containing a single data type.

matrix(c(3:10), nrow=2)
##      [,1] [,2] [,3] [,4]
## [1,]    3    5    7    9
## [2,]    4    6    8   10
matrix(c(3:10), ncol=2)
##      [,1] [,2]
## [1,]    3    7
## [2,]    4    8
## [3,]    5    9
## [4,]    6   10

Note that matrix(c(3:9), nrow=2) will give an error, as 7 elements are not 2x multiplier.


Part 2: Basic

2-1. Install and load packages

Check installed packages. If needed, install for use.

installed.packages()
install.packages("palmerpenguins")

load package and/or dataset before using it. Nothing will show up in the console:

library(palmerpenguins) 
data(penguins)

2-2. Prepare data

Import csv file and select certain columns:

csv_file <- read_csv("test.csv")
select(csv_file, column1, column2, column3)

Import excel file and read specific sheet:

read_excel("test.xls")
excel_sheets("test.xls")
read_excel("test.xls", sheet="sales")

2-3. Inspect data

View will display data in reader-friendly table format in separate tab:

View(penguins)  

Also, summary data can be reviewed with below functions.

head(penguins)
colnames(penguins)
glimpse(penguins)
str(penguins)

As an example, str() has been run as below:

str(penguins)
## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

Summary statistics can also be checked:

skim_without_charts(penguins) 
summary(penguins)                  

Example of summary():

summary(penguins)                  
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2

Inspect specific column(s):

select(penguins, species) 

By adding - sign before the column, result will display all columns except for marked column. See example, same as select(penguins, -species):

penguins %>%
  select(-species)
## # A tibble: 344 x 7
##    island  bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex    year
##    <fct>            <dbl>         <dbl>            <int>       <int> <fct> <int>
##  1 Torger~           39.1          18.7              181        3750 male   2007
##  2 Torger~           39.5          17.4              186        3800 fema~  2007
##  3 Torger~           40.3          18                195        3250 fema~  2007
##  4 Torger~           NA            NA                 NA          NA <NA>   2007
##  5 Torger~           36.7          19.3              193        3450 fema~  2007
##  6 Torger~           39.3          20.6              190        3650 male   2007
##  7 Torger~           38.9          17.8              181        3625 fema~  2007
##  8 Torger~           39.2          19.6              195        4675 male   2007
##  9 Torger~           34.1          18.1              193        3475 <NA>   2007
## 10 Torger~           42            20.2              190        4250 <NA>   2007
## # ... with 334 more rows

2-4. Organize data

Sort the data. By adding - sign, sort in DESC order instead of default ASC:

arrange(penguins, bill_length_mm)
arrange(penguins, desc(bill_length_mm))
penguins %>% arrange(bill_length_mm)
penguins %>% arrange(-bill_length_mm)

Filter values:

filter(penguins, species=='Gentoo')

Example of using arrange and filter functions:

penguins %>% 
  filter(species=="Gentoo") %>%
  select(species, island, body_mass_g) %>%
  arrange(-body_mass_g)
## # A tibble: 124 x 3
##    species island body_mass_g
##    <fct>   <fct>        <int>
##  1 Gentoo  Biscoe        6300
##  2 Gentoo  Biscoe        6050
##  3 Gentoo  Biscoe        6000
##  4 Gentoo  Biscoe        6000
##  5 Gentoo  Biscoe        5950
##  6 Gentoo  Biscoe        5950
##  7 Gentoo  Biscoe        5850
##  8 Gentoo  Biscoe        5850
##  9 Gentoo  Biscoe        5850
## 10 Gentoo  Biscoe        5800
## # ... with 114 more rows

Find max, min, mean: if dataset contains NA, result will show as NA.

min(penguins$year)
max(penguins$year)
mean(penguins$year)

Group data for summary statistics. Below example illustrates the process of grouping data -> removing NA values -> assigning new column name -> summarizing them.

penguins %>% 
  group_by(species, island) %>% drop_na() %>%
  summarize(mean_bl=mean(bill_length_mm),
            max_bl=max(bill_length_mm))
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
## # A tibble: 5 x 4
## # Groups:   species [3]
##   species   island    mean_bl max_bl
##   <fct>     <fct>       <dbl>  <dbl>
## 1 Adelie    Biscoe       39.0   45.6
## 2 Adelie    Dream        38.5   44.1
## 3 Adelie    Torgersen    39.0   46  
## 4 Chinstrap Dream        48.8   58  
## 5 Gentoo    Biscoe       47.6   59.6

Save cleaned data frame:

cleaned_penguins <- penguins %>% arrange(bill_length_mm)
cleaned2_penguins <- penguins %>% select(island, species)

2-5. Manipulate data

Rename column or variable: rename(dataset, new_name=old_name):

rename(penguins, weight=body_mass_g)

Update all columns to upper/lower case

rename_with(penguins,toupper)
rename_with(penguins,tolower)

Clean column names by ensuring only characters, numbers and _ are in the columns

clean_names(penguins)

Combine columns:

unite(data_set, 
      'new_column_name', 
      coulmn1_to_unite, 
      column2_to_unite, 
      sep=' ,')

Applying as below:

example <- bookings_data %>%
  select(arrival_date_year, arrival_date_month) %>%
  unite(arrival_year_month, c("arrival_date_year",
                              "arrival_date_month"), sep = " ,")

Separate column:

separate(data_set, 
         column_to_separate, 
         into=c('column1', 'colmn2'), 
         sep= ' ')

Add column: mutate(dataset, new_column=explain)

mutate(penguins, body_mass_kg=body_mass_g/1000)

Example below:

penguins %>%
  mutate(body_mass_kg=body_mass_g/1000) %>%
  select(species, island, body_mass_kg) %>%
  arrange(-body_mass_kg)
## # A tibble: 344 x 3
##    species island body_mass_kg
##    <fct>   <fct>         <dbl>
##  1 Gentoo  Biscoe         6.3 
##  2 Gentoo  Biscoe         6.05
##  3 Gentoo  Biscoe         6   
##  4 Gentoo  Biscoe         6   
##  5 Gentoo  Biscoe         5.95
##  6 Gentoo  Biscoe         5.95
##  7 Gentoo  Biscoe         5.85
##  8 Gentoo  Biscoe         5.85
##  9 Gentoo  Biscoe         5.85
## 10 Gentoo  Biscoe         5.8 
## # ... with 334 more rows

Summary statistics: summarize(dataset, col_name=mean(col))

penguins %>%
  drop_na() %>%
  summarize(avg_g=mean(body_mass_g), sum_g=sum(body_mass_g))

2-6. Nested query and pipe

Nested query can be written as below:

arrange(filter(ToothGrowth, dose==0.5), len)

Using pipe, process step is clearer and less cluttered:

filtered_toothgrowth <- ToothGrowth %>%
  filter(dose==0.5) %>%
  arrange(len)

2-7. Conditional statement

Logical operators: and &, or |, not!

x <-10

x<12 & x>11      # false as 10 is not bigger than 11.
## [1] FALSE
x<12 | x>11      # true, less than 12 or bigger than 11.
## [1] TRUE
!x>11            # true, not bigger than 11.
## [1] TRUE
!(x>15 | x<5)    # true, not bigger than 15 or less than 5
## [1] TRUE

Conditional statement: if(){then}

x <-5

if(x>0){
  print("x is a positive number")
  } else {
    print("x is a negative number")
    }
## [1] "x is a positive number"
y <-1982

if (y>1990) {
  print("Group1")
  } else if (y>1980) {
    print("Group2")
    } else {
      print("Group3")
      }
## [1] "Group2"

2-8. Visualize data

ggplot(data)+geom_shape(mapping=aes(argument)). geom_bar will display bar chart, geom_point will display scatter plot. See below:

ggplot(data=penguins)+ 
  geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))

Above can be also written as below:

ggplot(data=penguins, 
       mapping=aes(x=flipper_length_mm,y=body_mass_g))+
  geom_point()

Enhance the chart by mapping color, size, shape or alpha(difference in density):

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm,
                        y=body_mass_g,
                        color=species,
                        shape=species))

See the chart with alpha aes:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm,
                         y=body_mass_g,
                         alpha=species))

Below is the alternative way of writing codes for archive purpose:

ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
  geom_point()

ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
  geom_point(color="purple")

ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
  geom_point()+
  facet_wrap(~species)

If color needs to be applied to entire chart instead of specific variables:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g),
             color="purple")

Different chart types can be created by changing geom. For example, smooth line as below:

ggplot(data=penguins)+
  geom_smooth(mapping=aes(x=flipper_length_mm,y=body_mass_g))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

bar chart below. If color is used instead of fill, the outline will only be colored:

ggplot(data=penguins)+
  geom_bar(mapping=aes(x=species,fill=species)) 

stacked bar chart:

ggplot(data=penguins)+
  geom_bar(mapping=aes(x=species,fill=island)) 

Examine relationship between trend line and data points by adding two geom:

ggplot(data=penguins)+
  geom_smooth(mapping=aes(x=flipper_length_mm, y=body_mass_g))+
  geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There are also different types of soothing lines. For reference:

ggplot(data=penguins,aes(x=flipper_length_mm,y=body_mass_g))+
  geom_smooth(method="loess")                                       #Loess smoothing
  
ggplot(data=penguins,aes(x=flipper_length_mm,y=body_mass_g))+
  geom_smooth(method="gam",formula=y~s(x))                          #Gam smoothing

Facet function is also handy for focusing on specific data points:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))+
  facet_wrap(~species)

facet_grid for 1+ facet:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
  facet_grid(sex~species)

Other functions can be used as normal:

penguins %>%
  filter(island=="Biscoe") %>%
  ggplot(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
  geom_point()

Adding label and annotation will help readers understand about data:

ggplot(data=penguins)+
  geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g, color=species))+
  labs(title="Palmer Penguins: Relationship between Body Mass vs Flipper Length",
       subtitle="Body Mass (gram) / Flipper Length (mm) ",
       caption="R Package used: palmerpenguins")+
  annotate("text", x=220,y=4000,label="positive relationship observed",
           angle=45, fontface="bold", size=3)

Label can be added using variables too:

minamount <- min(dataframe$column)
maxamount <- max(dataframe$column)
  
labs(caption=paste0("The minimum is ",minamount," and the maximum is ", maxamount))

Add x/y axis header title if missing:

labs(x="X-axis name", y="Y-axis name")

Rotate x-axis headers:

theme(axis.text.x=element_text(angle=45))

2-9. Save the visuals

ggsave will save the latest visual:

ggsave("test_visuals.png", width=5, height=5)

Part 3: Tips

3-1. Do not assume

Summary may indicate the similar dataset for below groups:

library('Tmisc')
data(quartet)
quartet %>%
  group_by(set) %>%
  summarize(mean(x),sd(x),mean(y),sd(y),cor(x,y))
## # A tibble: 4 x 6
##   set   `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
##   <fct>     <dbl>   <dbl>     <dbl>   <dbl>       <dbl>
## 1 I             9    3.32      7.50    2.03       0.816
## 2 II            9    3.32      7.50    2.03       0.816
## 3 III           9    3.32      7.5     2.03       0.816
## 4 IV            9    3.32      7.50    2.03       0.817

But visualizing them will reveal the difference between the grouped data:

ggplot(quartet,aes(x,y))+
  geom_point()+
  geom_smooth(method=lm,se=FALSE)+
  facet_wrap(~set)
## `geom_smooth()` using formula 'y ~ x'

3-2. Check for bias

Package needed:

install.packages("SimDesign")
library("SimDesign")

Compare data:

actual_data <- c(10,20,30,40,50)
predicted_data <-c(8,14,22,39,45)
bias(actual_data,predicted_data)
## [1] 4.4

Another example:

actual_data2 <- c(10,20,30,40,50)
predicted_data2 <- c(12,24,39,47,55)
bias(actual_data2, predicted_data2)
## [1] -5.4

The more the result is closer to 0, the less the data is biased.

END

If you have any feedback, please do not hesitate to reach out