Introduction
For the purpose of self study and code review, this R markdown report has been created. The original contents can be found on Sylvia’s GitHub.
Package used
- dplyr
- janitor
- palmerpenguins
- skimr
- tidyverse
Table of contents
I. Concepts
- Vector, list, matrix and data frame
- Function, variable and documents
II. Basic
- Install and load packages
- Prepare data
- Inspect data
- Organize data
- Manipulate data
- Nested and pipe
- Logical operator and conditional statement
- Visualize data
- Save the visuals
III. Tips
- Do not assume
- Check for bias
I. Concepts
I-1. Function, variable and documents
functions: a code to perform a specific task. Type function and then add argument within the (). Example:
print("It's a beautiful day!")## [1] "It's a beautiful day!"
variable: a value which can be stored for later use. Also called as objects. Define variable name followed by <- and function.
var_x <- "this is variable"
var_y <- 123.45A calculation can be performed with variable as below:
var_y - 23.45## [1] 100
Ask about function by adding ? before function name.
?print()or find out more about the packages.
browseVignettes("tidyverse")I-2. Vector, list, matrix and data frame
vector: a group of data elements of the same type stored in a sequence. Once a vector is created, it will show as a set of data.
vec_1 <- c(12,34,56,78.9)
vec_2 <- c(1L, 5L, 10L)
vec_3 <- c("Anna", "Beta", "Cera", "Delta")now, let’s run one of vector (vec_3) below.
## [1] "Anna" "Beta" "Cera" "Delta"
list: it’s similar to vector but it can contain different data type.
list_1 <- list("a", 1L, 3.5, FALSE)Other operations can be done to vectors including:
calculating the length of vector
length(vec_3)## [1] 4
or, assigning titles to vector can be done too.
names(vec_1) <- c("spring", "summer", "fall", "winter")
print(vec_1)## spring summer fall winter
## 12.0 34.0 56.0 78.9
The same can be done to list.
list1 <- list('x-axis'=1, 'y-axis'=2, 'z-axis'=3)
print(list1)## $`x-axis`
## [1] 1
##
## $`y-axis`
## [1] 2
##
## $`z-axis`
## [1] 3
data frame: collection of columns, typically imported from different source. Example below:
df_1 <- data.frame(city=c("NY", "SF", "CO"), days=c(2.4, 4.4, 5.1), rank=c(2,1,3))or codes can be added step by step by first defining the variables:
city <- c("NY", "SF", "CO")
days <- c(2.4, 4.4, 5.1)
rank <- c(2,1,3)
df_1 <- data.frame(city,days,rank)Now, let’s run the code to see the result.
## city days rank
## 1 NY 2.4 2
## 2 SF 4.4 1
## 3 CO 5.1 3
matrix: two-dimentional collection of data elements, containing a single data type.
matrix(c(3:10), nrow=2)## [,1] [,2] [,3] [,4]
## [1,] 3 5 7 9
## [2,] 4 6 8 10
matrix(c(3:10), ncol=2)## [,1] [,2]
## [1,] 3 7
## [2,] 4 8
## [3,] 5 9
## [4,] 6 10
Note that matrix(c(3:9), nrow=2) will give an error, as 7 elements are not 2x multiplier.
II. Basic
II-1. Install and load packages
Check installed packages. If needed, install for use.
installed.packages()
install.packages("palmerpenguins")load package and/or dataset before using it. Nothing will show up in the console:
library(palmerpenguins)
data(penguins)II-2. Prepare data
Import csv file and select certain columns:
csv_file <- read_csv("test.csv")
select(csv_file, column1, column2, column3)Import excel file and read specific sheet:
read_excel("test.xls")
excel_sheets("test.xls")
read_excel("test.xls", sheet="sales")II-3. Inspect data
View will display data in reader-friendly table format in separate tab:
View(penguins) Also, summary data can be reviewed with below functions.
head(penguins)
colnames(penguins)
glimpse(penguins)
str(penguins)As an example, str() has been run as below:
str(penguins)## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Summary statistics can also be checked:
skim_without_charts(penguins)
summary(penguins) Example of summary():
summary(penguins) ## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
Inspect specific column(s):
select(penguins, species) By adding - sign before the column, result will display all columns except for marked column. See example, same as select(penguins, -species):
penguins %>%
select(-species)## # A tibble: 344 x 7
## island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex year
## <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Torger~ 39.1 18.7 181 3750 male 2007
## 2 Torger~ 39.5 17.4 186 3800 fema~ 2007
## 3 Torger~ 40.3 18 195 3250 fema~ 2007
## 4 Torger~ NA NA NA NA <NA> 2007
## 5 Torger~ 36.7 19.3 193 3450 fema~ 2007
## 6 Torger~ 39.3 20.6 190 3650 male 2007
## 7 Torger~ 38.9 17.8 181 3625 fema~ 2007
## 8 Torger~ 39.2 19.6 195 4675 male 2007
## 9 Torger~ 34.1 18.1 193 3475 <NA> 2007
## 10 Torger~ 42 20.2 190 4250 <NA> 2007
## # ... with 334 more rows
II-4. Organize data
Sort the data. By adding - sign, sort in DESC order instead of default ASC:
arrange(penguins, bill_length_mm)
arrange(penguins, desc(bill_length_mm))
penguins %>% arrange(bill_length_mm)
penguins %>% arrange(-bill_length_mm)Filter values:
filter(penguins, species=='Gentoo')Example of using arrange and filter functions:
penguins %>%
filter(species=="Gentoo") %>%
select(species, island, body_mass_g) %>%
arrange(-body_mass_g)## # A tibble: 124 x 3
## species island body_mass_g
## <fct> <fct> <int>
## 1 Gentoo Biscoe 6300
## 2 Gentoo Biscoe 6050
## 3 Gentoo Biscoe 6000
## 4 Gentoo Biscoe 6000
## 5 Gentoo Biscoe 5950
## 6 Gentoo Biscoe 5950
## 7 Gentoo Biscoe 5850
## 8 Gentoo Biscoe 5850
## 9 Gentoo Biscoe 5850
## 10 Gentoo Biscoe 5800
## # ... with 114 more rows
Find max, min, mean: if dataset contains NA, result will show as NA.
min(penguins$year)
max(penguins$year)
mean(penguins$year)Group data for summary statistics. Below example illustrates the process of grouping data -> removing NA values -> assigning new column name -> summarizing them.
penguins %>%
group_by(species, island) %>% drop_na() %>%
summarize(mean_bl=mean(bill_length_mm),
max_bl=max(bill_length_mm))## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
## # A tibble: 5 x 4
## # Groups: species [3]
## species island mean_bl max_bl
## <fct> <fct> <dbl> <dbl>
## 1 Adelie Biscoe 39.0 45.6
## 2 Adelie Dream 38.5 44.1
## 3 Adelie Torgersen 39.0 46
## 4 Chinstrap Dream 48.8 58
## 5 Gentoo Biscoe 47.6 59.6
Save cleaned data frame:
cleaned_penguins <- penguins %>% arrange(bill_length_mm)
cleaned2_penguins <- penguins %>% select(island, species)II-5. Manipulate data
Rename column or variable: rename(dataset, new_name=old_name):
rename(penguins, weight=body_mass_g)Update all columns to upper/lower case
rename_with(penguins,toupper)
rename_with(penguins,tolower)Clean column names by ensuring only characters, numbers and _ are in the columns
clean_names(penguins)Combine columns:
unite(data_set,
'new_column_name',
coulmn1_to_unite,
column2_to_unite,
sep=' ,')Applying as below:
example <- bookings_data %>%
select(arrival_date_year, arrival_date_month) %>%
unite(arrival_year_month, c("arrival_date_year",
"arrival_date_month"), sep = " ,")Separate column:
separate(data_set,
column_to_separate,
into=c('column1', 'colmn2'),
sep= ' ')Add column: mutate(dataset, new_column=explain)
mutate(penguins, body_mass_kg=body_mass_g/1000)Example below:
penguins %>%
mutate(body_mass_kg=body_mass_g/1000) %>%
select(species, island, body_mass_kg) %>%
arrange(-body_mass_kg)## # A tibble: 344 x 3
## species island body_mass_kg
## <fct> <fct> <dbl>
## 1 Gentoo Biscoe 6.3
## 2 Gentoo Biscoe 6.05
## 3 Gentoo Biscoe 6
## 4 Gentoo Biscoe 6
## 5 Gentoo Biscoe 5.95
## 6 Gentoo Biscoe 5.95
## 7 Gentoo Biscoe 5.85
## 8 Gentoo Biscoe 5.85
## 9 Gentoo Biscoe 5.85
## 10 Gentoo Biscoe 5.8
## # ... with 334 more rows
Summary statistics: summarize(dataset, col_name=mean(col))
penguins %>%
drop_na() %>%
summarize(avg_g=mean(body_mass_g), sum_g=sum(body_mass_g))II-6. Nested query and pipe
Nested query can be written as below:
arrange(filter(ToothGrowth, dose==0.5), len)Using pipe, process step is clearer and less cluttered:
filtered_toothgrowth <- ToothGrowth %>%
filter(dose==0.5) %>%
arrange(len)II-7. Logical operators and conditional statement
Logical operators: and &, or |, not!
x <-10
x<12 & x>11 # false as 10 is not bigger than 11.## [1] FALSE
x<12 | x>11 # true, less than 12 or bigger than 11.## [1] TRUE
!x>11 # true, not bigger than 11.## [1] TRUE
!(x>15 | x<5) # true, not bigger than 15 or less than 5## [1] TRUE
Conditional statement: if(){then}
x <-5
if(x>0){
print("x is a positive number")
} else {
print("x is a negative number")
}## [1] "x is a positive number"
y <-1982
if (y>1990) {
print("Group1")
} else if (y>1980) {
print("Group2")
} else {
print("Group3")
}## [1] "Group2"
II-8. Visualize data
ggplot(data)+geom_shape(mapping=aes(argument)). geom_bar will display bar chart, geom_point will display scatter plot. See below:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))Above can be also written as below:
ggplot(data=penguins,
mapping=aes(x=flipper_length_mm,y=body_mass_g))+
geom_point()Enhance the chart by mapping color, size, shape or alpha(difference in density):
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,
y=body_mass_g,
color=species,
shape=species))See the chart with alpha aes:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,
y=body_mass_g,
alpha=species))Below is the alternative way of writing codes for archive purpose:
ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
geom_point()
ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
geom_point(color="purple")
ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
geom_point()+
facet_wrap(~species)If color needs to be applied to entire chart instead of specific variables:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g),
color="purple")Different chart types can be created by changing geom. For example, smooth line as below:
ggplot(data=penguins)+
geom_smooth(mapping=aes(x=flipper_length_mm,y=body_mass_g))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
bar chart below. If color is used instead of fill, the outline will only be colored:
ggplot(data=penguins)+
geom_bar(mapping=aes(x=species,fill=species)) stacked bar chart:
ggplot(data=penguins)+
geom_bar(mapping=aes(x=species,fill=island)) Examine relationship between trend line and data points by adding two geom:
ggplot(data=penguins)+
geom_smooth(mapping=aes(x=flipper_length_mm, y=body_mass_g))+
geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There are also different types of soothing lines. For reference:
ggplot(data=penguins,aes(x=flipper_length_mm,y=body_mass_g))+
geom_smooth(method="loess") #Loess smoothing
ggplot(data=penguins,aes(x=flipper_length_mm,y=body_mass_g))+
geom_smooth(method="gam",formula=y~s(x)) #Gam smoothingFacet function is also handy for focusing on specific data points:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))+
facet_wrap(~species)facet_grid for 1+ facet:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
facet_grid(sex~species)Other functions can be used as normal:
penguins %>%
filter(island=="Biscoe") %>%
ggplot(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
geom_point()Adding label and annotation will help readers understand about data:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g, color=species))+
labs(title="Palmer Penguins: Relationship between Body Mass vs Flipper Length",
subtitle="Body Mass (gram) / Flipper Length (mm) ",
caption="R Package used: palmerpenguins")+
annotate("text", x=220,y=4000,label="positive relationship observed",
angle=45, fontface="bold", size=3)Label can be added using variables too:
minamount <- min(dataframe$column)
maxamount <- max(dataframe$column)
labs(caption=paste0("The minimum is ",minamount," and the maximum is ", maxamount))Add x/y axis header title if missing:
labs(x="X-axis name", y="Y-axis name")Rotate x-axis headers:
theme(axis.text.x=element_text(angle=45))II-9. Save the visuals
ggsave will save the latest visual:
ggsave("test_visuals.png", width=5, height=5)III. Tips
III-1. DO not assume
Summary may indicate the similar dataset for below groups:
library('Tmisc')
data(quartet)
quartet %>%
group_by(set) %>%
summarize(mean(x),sd(x),mean(y),sd(y),cor(x,y))## # A tibble: 4 x 6
## set `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 I 9 3.32 7.50 2.03 0.816
## 2 II 9 3.32 7.50 2.03 0.816
## 3 III 9 3.32 7.5 2.03 0.816
## 4 IV 9 3.32 7.50 2.03 0.817
But visualizing them will reveal the difference between the grouped data:
ggplot(quartet,aes(x,y))+
geom_point()+
geom_smooth(method=lm,se=FALSE)+
facet_wrap(~set)## `geom_smooth()` using formula 'y ~ x'
III-2. Check for bias
Package needed:
install.packages("SimDesign")
library("SimDesign")Compare data:
actual_data <- c(10,20,30,40,50)
predicted_data <-c(8,14,22,39,45)
bias(actual_data,predicted_data)## [1] 4.4
Another example:
actual_data2 <- c(10,20,30,40,50)
predicted_data2 <- c(12,24,39,47,55)
bias(actual_data2, predicted_data2)## [1] -5.4
The more the result is closer to 0, the less the data is biased.