For the purpose of self study and code review, this R markdown report has been created. The original contents can be found on Sylvia’s GitHub.
functions: a code to perform a specific task. Type function and then add argument within the (). Example:
print("It's a beautiful day!")
## [1] "It's a beautiful day!"
variable: a value which can be stored for later use. Also called as objects. Define variable name followed by <- and function.
var_x <- "this is variable"
var_y <- 123.45
A calculation can be performed with variable as below:
var_y - 23.45
## [1] 100
Ask about function by adding ? before function name.
?print()
or find out more about the packages.
browseVignettes("tidyverse")
vector: a group of data elements of the same type stored in a sequence. Once a vector is created, it will show as a set of data.
vec_1 <- c(12,34,56,78.9)
vec_2 <- c(1L, 5L, 10L)
vec_3 <- c("Anna", "Beta", "Cera", "Delta")
now, let’s run one of vector (vec_3) below.
## [1] "Anna" "Beta" "Cera" "Delta"
list: it’s similar to vector but it can contain different data type.
list_1 <- list("a", 1L, 3.5, FALSE)
Other operations can be done to vectors including:
calculating the length of vector
length(vec_3)
## [1] 4
or, assigning titles to vector can be done too.
names(vec_1) <- c("spring", "summer", "fall", "winter")
print(vec_1)
## spring summer fall winter
## 12.0 34.0 56.0 78.9
The same can be done to list.
list1 <- list('x-axis'=1, 'y-axis'=2, 'z-axis'=3)
print(list1)
## $`x-axis`
## [1] 1
##
## $`y-axis`
## [1] 2
##
## $`z-axis`
## [1] 3
data frame: collection of columns, typically imported from different source. Example below:
df_1 <- data.frame(city=c("NY", "SF", "CO"), days=c(2.4, 4.4, 5.1), rank=c(2,1,3))
or codes can be added step by step by first defining the variables:
city <- c("NY", "SF", "CO")
days <- c(2.4, 4.4, 5.1)
rank <- c(2,1,3)
df_1 <- data.frame(city,days,rank)
Now, let’s run the code to see the result.
## city days rank
## 1 NY 2.4 2
## 2 SF 4.4 1
## 3 CO 5.1 3
matrix: two-dimentional collection of data elements, containing a single data type.
matrix(c(3:10), nrow=2)
## [,1] [,2] [,3] [,4]
## [1,] 3 5 7 9
## [2,] 4 6 8 10
matrix(c(3:10), ncol=2)
## [,1] [,2]
## [1,] 3 7
## [2,] 4 8
## [3,] 5 9
## [4,] 6 10
Note that matrix(c(3:9), nrow=2) will give an error, as 7 elements are not 2x multiplier.
Check installed packages. If needed, install for use.
installed.packages()
install.packages("palmerpenguins")
load package and/or dataset before using it. Nothing will show up in the console:
library(palmerpenguins)
data(penguins)
Import csv file and select certain columns:
csv_file <- read_csv("test.csv")
select(csv_file, column1, column2, column3)
Import excel file and read specific sheet:
read_excel("test.xls")
excel_sheets("test.xls")
read_excel("test.xls", sheet="sales")
View will display data in reader-friendly table format in separate tab:
View(penguins)
Also, summary data can be reviewed with below functions.
head(penguins)
colnames(penguins)
glimpse(penguins)
str(penguins)
As an example, str() has been run as below:
str(penguins)
## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
Summary statistics can also be checked:
skim_without_charts(penguins)
summary(penguins)
Example of summary():
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 female:165 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
## Median :197.0 Median :4050 NA's : 11 Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
Inspect specific column(s):
select(penguins, species)
By adding - sign before the column, result will display all columns except for marked column. See example, same as select(penguins, -species):
penguins %>%
select(-species)
## # A tibble: 344 x 7
## island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex year
## <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Torger~ 39.1 18.7 181 3750 male 2007
## 2 Torger~ 39.5 17.4 186 3800 fema~ 2007
## 3 Torger~ 40.3 18 195 3250 fema~ 2007
## 4 Torger~ NA NA NA NA <NA> 2007
## 5 Torger~ 36.7 19.3 193 3450 fema~ 2007
## 6 Torger~ 39.3 20.6 190 3650 male 2007
## 7 Torger~ 38.9 17.8 181 3625 fema~ 2007
## 8 Torger~ 39.2 19.6 195 4675 male 2007
## 9 Torger~ 34.1 18.1 193 3475 <NA> 2007
## 10 Torger~ 42 20.2 190 4250 <NA> 2007
## # ... with 334 more rows
Sort the data. By adding - sign, sort in DESC order instead of default ASC:
arrange(penguins, bill_length_mm)
arrange(penguins, desc(bill_length_mm))
penguins %>% arrange(bill_length_mm)
penguins %>% arrange(-bill_length_mm)
Filter values:
filter(penguins, species=='Gentoo')
Example of using arrange and filter functions:
penguins %>%
filter(species=="Gentoo") %>%
select(species, island, body_mass_g) %>%
arrange(-body_mass_g)
## # A tibble: 124 x 3
## species island body_mass_g
## <fct> <fct> <int>
## 1 Gentoo Biscoe 6300
## 2 Gentoo Biscoe 6050
## 3 Gentoo Biscoe 6000
## 4 Gentoo Biscoe 6000
## 5 Gentoo Biscoe 5950
## 6 Gentoo Biscoe 5950
## 7 Gentoo Biscoe 5850
## 8 Gentoo Biscoe 5850
## 9 Gentoo Biscoe 5850
## 10 Gentoo Biscoe 5800
## # ... with 114 more rows
Find max, min, mean: if dataset contains NA, result will show as NA.
min(penguins$year)
max(penguins$year)
mean(penguins$year)
Group data for summary statistics. Below example illustrates the process of grouping data -> removing NA values -> assigning new column name -> summarizing them.
penguins %>%
group_by(species, island) %>% drop_na() %>%
summarize(mean_bl=mean(bill_length_mm),
max_bl=max(bill_length_mm))
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
## # A tibble: 5 x 4
## # Groups: species [3]
## species island mean_bl max_bl
## <fct> <fct> <dbl> <dbl>
## 1 Adelie Biscoe 39.0 45.6
## 2 Adelie Dream 38.5 44.1
## 3 Adelie Torgersen 39.0 46
## 4 Chinstrap Dream 48.8 58
## 5 Gentoo Biscoe 47.6 59.6
Save cleaned data frame:
cleaned_penguins <- penguins %>% arrange(bill_length_mm)
cleaned2_penguins <- penguins %>% select(island, species)
Rename column or variable: rename(dataset, new_name=old_name):
rename(penguins, weight=body_mass_g)
Update all columns to upper/lower case
rename_with(penguins,toupper)
rename_with(penguins,tolower)
Clean column names by ensuring only characters, numbers and _ are in the columns
clean_names(penguins)
Combine columns:
unite(data_set,
'new_column_name',
coulmn1_to_unite,
column2_to_unite,
sep=' ,')
Applying as below:
example <- bookings_data %>%
select(arrival_date_year, arrival_date_month) %>%
unite(arrival_year_month, c("arrival_date_year",
"arrival_date_month"), sep = " ,")
Separate column:
separate(data_set,
column_to_separate,
into=c('column1', 'colmn2'),
sep= ' ')
Add column: mutate(dataset, new_column=explain)
mutate(penguins, body_mass_kg=body_mass_g/1000)
Example below:
penguins %>%
mutate(body_mass_kg=body_mass_g/1000) %>%
select(species, island, body_mass_kg) %>%
arrange(-body_mass_kg)
## # A tibble: 344 x 3
## species island body_mass_kg
## <fct> <fct> <dbl>
## 1 Gentoo Biscoe 6.3
## 2 Gentoo Biscoe 6.05
## 3 Gentoo Biscoe 6
## 4 Gentoo Biscoe 6
## 5 Gentoo Biscoe 5.95
## 6 Gentoo Biscoe 5.95
## 7 Gentoo Biscoe 5.85
## 8 Gentoo Biscoe 5.85
## 9 Gentoo Biscoe 5.85
## 10 Gentoo Biscoe 5.8
## # ... with 334 more rows
Summary statistics: summarize(dataset, col_name=mean(col))
penguins %>%
drop_na() %>%
summarize(avg_g=mean(body_mass_g), sum_g=sum(body_mass_g))
Nested query can be written as below:
arrange(filter(ToothGrowth, dose==0.5), len)
Using pipe, process step is clearer and less cluttered:
filtered_toothgrowth <- ToothGrowth %>%
filter(dose==0.5) %>%
arrange(len)
Logical operators: and &, or |, not!
x <-10
x<12 & x>11 # false as 10 is not bigger than 11.
## [1] FALSE
x<12 | x>11 # true, less than 12 or bigger than 11.
## [1] TRUE
!x>11 # true, not bigger than 11.
## [1] TRUE
!(x>15 | x<5) # true, not bigger than 15 or less than 5
## [1] TRUE
Conditional statement: if(){then}
x <-5
if(x>0){
print("x is a positive number")
} else {
print("x is a negative number")
}
## [1] "x is a positive number"
y <-1982
if (y>1990) {
print("Group1")
} else if (y>1980) {
print("Group2")
} else {
print("Group3")
}
## [1] "Group2"
ggplot(data)+geom_shape(mapping=aes(argument)). geom_bar will display bar chart, geom_point will display scatter plot. See below:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))
Above can be also written as below:
ggplot(data=penguins,
mapping=aes(x=flipper_length_mm,y=body_mass_g))+
geom_point()
Enhance the chart by mapping color, size, shape or alpha(difference in density):
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,
y=body_mass_g,
color=species,
shape=species))
See the chart with alpha aes:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,
y=body_mass_g,
alpha=species))
Below is the alternative way of writing codes for archive purpose:
ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
geom_point()
ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
geom_point(color="purple")
ggplot(data=penguins, aes(x=flipper_length_mm,y=body_mass_g,color=species))+
geom_point()+
facet_wrap(~species)
If color needs to be applied to entire chart instead of specific variables:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g),
color="purple")
Different chart types can be created by changing geom. For example, smooth line as below:
ggplot(data=penguins)+
geom_smooth(mapping=aes(x=flipper_length_mm,y=body_mass_g))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
bar chart below. If color is used instead of fill, the outline will only be colored:
ggplot(data=penguins)+
geom_bar(mapping=aes(x=species,fill=species))
stacked bar chart:
ggplot(data=penguins)+
geom_bar(mapping=aes(x=species,fill=island))
Examine relationship between trend line and data points by adding two geom:
ggplot(data=penguins)+
geom_smooth(mapping=aes(x=flipper_length_mm, y=body_mass_g))+
geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There are also different types of soothing lines. For reference:
ggplot(data=penguins,aes(x=flipper_length_mm,y=body_mass_g))+
geom_smooth(method="loess") #Loess smoothing
ggplot(data=penguins,aes(x=flipper_length_mm,y=body_mass_g))+
geom_smooth(method="gam",formula=y~s(x)) #Gam smoothing
Facet function is also handy for focusing on specific data points:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm, y=body_mass_g))+
facet_wrap(~species)
facet_grid for 1+ facet:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
facet_grid(sex~species)
Other functions can be used as normal:
penguins %>%
filter(island=="Biscoe") %>%
ggplot(mapping=aes(x=flipper_length_mm,y=body_mass_g,color=species))+
geom_point()
Adding label and annotation will help readers understand about data:
ggplot(data=penguins)+
geom_point(mapping=aes(x=flipper_length_mm,y=body_mass_g, color=species))+
labs(title="Palmer Penguins: Relationship between Body Mass vs Flipper Length",
subtitle="Body Mass (gram) / Flipper Length (mm) ",
caption="R Package used: palmerpenguins")+
annotate("text", x=220,y=4000,label="positive relationship observed",
angle=45, fontface="bold", size=3)
Label can be added using variables too:
minamount <- min(dataframe$column)
maxamount <- max(dataframe$column)
labs(caption=paste0("The minimum is ",minamount," and the maximum is ", maxamount))
Add x/y axis header title if missing:
labs(x="X-axis name", y="Y-axis name")
Rotate x-axis headers:
theme(axis.text.x=element_text(angle=45))
ggsave will save the latest visual:
ggsave("test_visuals.png", width=5, height=5)
Summary may indicate the similar dataset for below groups:
library('Tmisc')
data(quartet)
quartet %>%
group_by(set) %>%
summarize(mean(x),sd(x),mean(y),sd(y),cor(x,y))
## # A tibble: 4 x 6
## set `mean(x)` `sd(x)` `mean(y)` `sd(y)` `cor(x, y)`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 I 9 3.32 7.50 2.03 0.816
## 2 II 9 3.32 7.50 2.03 0.816
## 3 III 9 3.32 7.5 2.03 0.816
## 4 IV 9 3.32 7.50 2.03 0.817
But visualizing them will reveal the difference between the grouped data:
ggplot(quartet,aes(x,y))+
geom_point()+
geom_smooth(method=lm,se=FALSE)+
facet_wrap(~set)
## `geom_smooth()` using formula 'y ~ x'
Package needed:
install.packages("SimDesign")
library("SimDesign")
Compare data:
actual_data <- c(10,20,30,40,50)
predicted_data <-c(8,14,22,39,45)
bias(actual_data,predicted_data)
## [1] 4.4
Another example:
actual_data2 <- c(10,20,30,40,50)
predicted_data2 <- c(12,24,39,47,55)
bias(actual_data2, predicted_data2)
## [1] -5.4
The more the result is closer to 0, the less the data is biased.
If you have any feedback, please do not hesitate to reach out sylviahk416@gmail.com