Set the working directory in the session tab. To check wd:
getwd().
df = read.csv("fileName.csv") Looking for information
about an embedded dataset? help(iris) Looking for
information about a dataset from a package?
?house_prices
Arguments you can add inside read.csv to alter parameters:
stringsAsFactors = FALSErowNames = FALSEwrite.csv(data, "name.csv")
row.names = FALSEna = ""By explicitly using dplyr::, you can ensure that the
functions are called from the dplyr package namespace.
Example: dplyr::summarize(). This can be done with any
package, dplyr is just an example.
Tidy data:
Clean data:
Before these tools were around, it was very common to just remove observations with NA’s. This was and is not ideal.
References: Tidy Data by Hadley Wickham
All text in this section is pulled directly from this paper, formatted for easier daily reference.
Vocabulary:
Notes:
While the order of variables and observations does not affect analysis, a good ordering makes it easier to scan the raw values. There are fixed variables, which contain values that are fixed by the design of the data collection, and there are measured variables, which contain values that are measured during the course of the experiment.
To tidy this messy data it needs to be melted (or stacked). The columns that are not variables need to be turned into rows. The result of melting is a molten dataset.
df = tibble(
Name = c('John', 'Jane', 'Bob'),
Math = c(90, 85, 78),
Biology = c(88, 92, 80),
English = c(75, 80, 88),
History = c(82, 78, 90))
df
## # A tibble: 3 × 5
## Name Math Biology English History
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 John 90 88 75 82
## 2 Jane 85 92 80 78
## 3 Bob 78 80 88 90
Note: Wickham would call this arrangement messy, however in some cases it can be extremely useful. It provides efficient storage for completely crossed designs, and it can lead to extremely efficient computation if desired operations can be expressed as matrix operations. This issue is discussed in depth in Section 6.
There are a few different ways to melt this messy data, there are
functions from tidyr, reshape2, and
data.table that can help us.
library(tidyr)
melted_df <- pivot_longer(df,
cols = "Math":"History",
names_to = "Subject",
values_to = "Score")
#pivot_longer(df, cols = -id_vars, names_to = "key", values_to = "value", values_drop_na = FALSE)
#NOTE: for cols, can use "Math":"History", -Name (minus the ones you don't want melted), or c("Math", "Biology", "English", "History"), many ways to do this.
library(reshape2)
melted_df <- reshape2::melt(df,
id.vars = c("Name"),
measure.vars = c("Math", "Biology", "English", "History"),
variable.name = "Subject",
value.name = "Score")
data.table vs reshape2: The
data.table::melt function provides additional features and
optimization, making it more efficient, especially for large datasets.
While the basic usage and syntax of the two functions are similar,
data.table::melt is known for its performance improvements
and additional capabilities. If you’re working with relatively small
datasets, the differences in performance may not be significant, and you
can choose either based on your preference.
library(data.table)
dt <- as.data.table(df) # Convert the data.frame to data.table
melted_dt <- data.table::melt(dt,
id.vars = c("Name"),
measure.vars = c("Math", "Biology", "English", "History"),
variable.name = "Subject",
value.name = "Score")
Tidy data:
## Name Subject Score
## 1 John Math 90
## 2 Jane Math 85
## 3 Bob Math 78
## 4 John Biology 88
## 5 Jane Biology 92
## 6 Bob Biology 80
## 7 John English 75
## 8 Jane English 80
## 9 Bob English 88
## 10 John History 82
## 11 Jane History 78
## 12 Bob History 90
To tidy this messy dataset, it first needs to be melted. This puts the multiple variables into one column. This column is then split into separate real variables.
Column headers in this format are often separated by some character (., -, _, :). While the string can be broken into pieces using that character as a divider, in other cases, more careful string processing is required. For example, the variable names can be matched to a lookup table that converts single compound value into multiple component values.
In this example the class and month variables are stored in one column and the value is the student’s score:
df <- tibble(
Name = c('John', 'Jane', 'Bob'),
Gender = c('Male', 'Female', 'Male'),
Math_Score_Jan = c(90, 85, 78),
Math_Score_Feb = c(88, 92, 80),
English_Score_Jan = c(75, 80, 88),
English_Score_Feb = c(82, 78, 90),
Science_Score_Jan = c(85, 88, 92),
Science_Score_Feb = c(90, 85, 88))
df
## # A tibble: 3 × 8
## Name Gender Math_Score_Jan Math_Score_Feb English_Score_Jan English_Score_Feb
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 John Male 90 88 75 82
## 2 Jane Female 85 92 80 78
## 3 Bob Male 78 80 88 90
## # ℹ 2 more variables: Science_Score_Jan <dbl>, Science_Score_Feb <dbl>
Step 1: Melting
library(tidyr)
melted_df <- pivot_longer(df,
cols = c(
"Math_Score_Jan", "Math_Score_Feb",
"English_Score_Jan", "English_Score_Feb",
"Science_Score_Jan", "Science_Score_Feb"),
names_to = "Column",
values_to = "Score")
head(melted_df)
## # A tibble: 6 × 4
## Name Gender Column Score
## <chr> <chr> <chr> <dbl>
## 1 John Male Math_Score_Jan 90
## 2 John Male Math_Score_Feb 88
## 3 John Male English_Score_Jan 75
## 4 John Male English_Score_Feb 82
## 5 John Male Science_Score_Jan 85
## 6 John Male Science_Score_Feb 90
Step 2: Splitting
library(tidyr)
library(dplyr)
split_df <- melted_df %>%
separate(Column, into = c("Subject", "Score", "Month"), sep = "_", remove = TRUE) %>%
select(-Score)
head(split_df)
## # A tibble: 6 × 4
## Name Gender Subject Month
## <chr> <chr> <chr> <chr>
## 1 John Male Math Jan
## 2 John Male Math Feb
## 3 John Male English Jan
## 4 John Male English Feb
## 5 John Male Science Jan
## 6 John Male Science Feb
The most complicated form of messy data!
In this example, multiple variables are stored in one column
Element. And the values for those variables are in the
column Measurement.
df <- tibble(
Site = rep(c("A", "B", "C"), each = 9),
Month = rep(rep(c("January", "February", "March"), each = 3), times = 3),
Element = rep(c("pH", "Dissolved_Oxygen", "Nitrate"), times = 9),
Measurement = c(
7.2, 6.8, 7.5,
8.2, 7.5, 8.0,
2.0, 1.5, 1.8,
7.0, 6.5, 7.8,
7.8, 7.2, 7.5,
1.8, 1.4, 1.6,
7.4, 6.9, 7.2,
8.0, 7.3, 8.2,
2.2, 1.7, 1.9))
head(df)
## # A tibble: 6 × 4
## Site Month Element Measurement
## <chr> <chr> <chr> <dbl>
## 1 A January pH 7.2
## 2 A January Dissolved_Oxygen 6.8
## 3 A January Nitrate 7.5
## 4 A February pH 8.2
## 5 A February Dissolved_Oxygen 7.5
## 6 A February Nitrate 8
library(tidyr)
library(dplyr)
tidy_df <- df %>%
pivot_wider(names_from = Element, values_from = Measurement)
tidy_df
## # A tibble: 9 × 5
## Site Month pH Dissolved_Oxygen Nitrate
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 A January 7.2 6.8 7.5
## 2 A February 8.2 7.5 8
## 3 A March 2 1.5 1.8
## 4 B January 7 6.5 7.8
## 5 B February 7.8 7.2 7.5
## 6 B March 1.8 1.4 1.6
## 7 C January 7.4 6.9 7.2
## 8 C February 8 7.3 8.2
## 9 C March 2.2 1.7 1.9
Datasets often involve values collected at multiple levels, on different types of observational units. During tidying, each type of observational unit should be stored in its own table.
Normalisation is useful for tidying and eliminating inconsistencies. However, there are few data analysis tools that work directly with relational data, so analysis usually also requires denormalisation or the merging the datasets back into one table.
It’s also common to find data values about a single type of observational unit spread out over multiple tables or files. These tables and files are often split up by another variable, so that each represents a single year, person, or location. As long as the format for individual records is consistent, this is an easy problem to fix:
The plyr package makes this a straightforward task in R.
The following code generates a vector of file names in a directory
(data/) which match a regular expression (ends in
.csv). Next we name each element of the vector with the
name of the file. We do this because plyr will preserve the
names in the following step, ensuring that each row in the final data
frame is labelled with its source. Finally, ldply() loops
over each path, reading in the csv file and combining the results into a
single data frame.
paths <- dir("data", pattern = "\\.csv$", full.names = TRUE)
names(paths) <- basename(paths)
ldply(paths, read.csv, stringsAsFactors = FALSE)
When examining missing values, it is important to consider the experimental design to determine if the values can be safely dropped. If a missing value represents an observation that was supposed to be made, it’s important to keep it. However, structural missing values, which are measurements that cannot be made (pregnant males), can be safely removed.
head(df)summary(df)names(df)str(df)dim(df)Setting factors:
factor() function. You
can specify the levels and, if needed, the order of the levels.as.factor() assumes that the levels should be set based
on the unique values present in the column.levels() retrieves the
levels of a factor. nlevels() returns the number of levels
in a factor.titanic = read.csv("titanic_train.csv", stringsAsFactors = FALSE)
titanic$Embarked = as.factor(titanic$Embarked)
levels(titanic$Embarked)
## [1] "" "C" "Q" "S"
titanic$Embarked = factor(titanic$Embarked, levels = c("C", "Q", "S"))
levels(titanic$Embarked)
## [1] "C" "Q" "S"
summary(data) to see number of NA’s per
variablehead(levels(factor(data$variable)), n=10), ’’ in
output indicate missing valuesWhen subsetting, this format is followed:
[rows, columns]. There are different ways to subset:
data[1:3,5:9]data[c(1,2,3),c(5,6,7,8,9)]data[15,] (gets all columns for row
15)Note: it is also acceptable to use the uniqueRowID
and the colName instead of indices. Ex:
df["uniqueRowID", "colName"]
split=split(df,df$discreteCatCol)split data:
sapply(split, function(x) colMeans(x[,c("col1", "col2","col3")],na.rm=TRUE))Note: colMeans calculates the mean of each given column in the vector, for each category of the split
In this airquality example, Month is a
column containing discrete categories (5,6,7,8,9) and there are columns
containing measurements c("Ozone", "Solar.R","Wind"). We
can calculate the average measurement for each measurement type for each
Month.
df=airquality
split=split(df,df$Month)
sapply(split, function(x) colMeans(x[,c("Ozone", "Solar.R","Wind")],na.rm=TRUE))
## 5 6 7 8 9
## Ozone 23.61538 29.44444 59.115385 59.961538 31.44828
## Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
## Wind 11.62258 10.26667 8.941935 8.793548 10.18000
There is another way to do this using dplyr pipe
function
Data manipulation includes variable-by-variable transformation (e.g., log or sqrt), as well as aggregation, filtering and reordering. These are the four fundamental verbs of data manipulation:
Filtering and transforming are performed by the base
R functions subset() and transform(). These
are input and output-tidy. The aggregate() function
performs group-wise aggregation. It is input-tidy.
Provided that a single aggregation method is used, it is also
output-tidy. The plyr package provides tidy
summarise() and arrange() functions for
aggregation and sorting.
With dplyr, you are never actually altering the original dataframe. Note that dplyr functions don’t modify data, they return modified copies of orginal data. You should save the function results to a variable if you want to keep using them.
select() filters, chooses variables
(cols) / subsetsmutate() transforms, creates new
variables as function of existing variablesfilter() filters, chooses observations
(row) that meet a logical testarrange() sorts, sorts the rows of
your data based on the values in a column (or columns) in aecending or
descending order, or based on the value returned from an
expression.summarize() aggregates, it calculates
summary statistics (summing, taking means etc.)Other important tools:
%>% to chain
commands togethergroup_by allows us to calculate summary stats over
subgroupings of dataTests can be combined using and &, or
|, or not !. ex:
filter(df, a > 0 & b > 0),
filter(df, !is.na(x))
select()The select function is used to choose which variables (columns) to keep (or remove) in a data frame or tibble. Like all dplyr functions, select expects a data frame or tibble as the first argument. The remaining arguments indicate the variables to select. In all dplyr functions, you don’t need quotes around variable names as you do in most base-R functions.
select(df, col1, col2, col3) selects listed columns
select(df, col1:col4) selects col1, col2, col3, col4
(inclusive:inclusive)
select(df, -(col1, col2)) excludes col1 and col2,
selects all other columns
select helper functions: dplyr comes with helper functions for selecting groups of variables
starts_with(“X”): every name that starts with “X”ends_with(“X”): every name that ends with “X”contains(“X”): every name that contains “X”matches(“X”): every name that matches “X”, where “X”
can be a regular expressionnum_range(“x”, 1:5): the variables named x01, x02, x03,
x04 and x05one_of(x): every name that appears in x, which should
be a character vectorexample: select(flightData, ends_with("Delay")) selects
all columns in flightData that end with “Delay”.
Important note: none of these helper functions are case sensitive
mutate()The mutate function adds new variables to dataset based on existing
variables. For example, we could create a volume variable based on
height, length, width. You can also use it to change the values of an
existing variable, for example, converting a column in centimeters to
meters: mutate(df, colName=colName/100)
To call mutate, pass a tibble/data frame as the first argument and definitions of new variables using a new_name = transformation expression.
mutate(df, newColName = expression)
Example: mutate(flightData, loss = ArrDelay - DepDelay)
makes a new column loss, which contains the result of the
expression ArrDelay - DepDelay for each row in
flightData.
mutate() is great for making date
columns!
mutate(df, Date = paste(Year, Month, DayofMonth, sep = "-"))
adjust to your desires!
Note: you can add as many new columns at once as you want in
mutate(), just seperate by commas
filter()The filter function chooses observations (rows) to keep based on a logical expression and filters out the rest. To use filter, give it the name of a data frame/tibble and one or more logical tests.
Especially helpful for binary columns.
filter(df, colName <insert logic here>)
Example: filter(flightData, Cancelled == 1) filters
every row in the df flightData where the
Cancelled == 1.
the %in% operator
The %in% operator can be used with filter to retain only
rows that contain a certain value c("a", "b", "c") within a
specified column colName in the df.
filter(df, colName %in% c("a", "b", "c"))
arrange()The arrange() function reorders rows of the data set. It
is used for sorting.
Pass arrange() a tibble or data frame to arrange, and
tell it what column(s) to sort by. You can also indicate whether to sort
in ascending or descending order. (default = ascending order, to get
descending order: desc(colName))
arrange(df, colName...) list as many colNames as
desired
You can also sort by expressions!
arrange(df, <your expression>)
summarize()The summarize() function uses existing data to create a
new dataset of summary statistics. The summarize syntax is like mutate:
indicate the data frame or tibble to summarize, then give a list of
summary names and operations using a name = expression pattern, as in
this example:
summarize(df, newColName = <summarizing function(column)>...)
aggregate functions in summarize():
sum() get the summin(x) - minimum value of vector x.max(x) - maximum value of vector x.mean(x) - mean value of vector x.median(x) - median value of vector x.quantile(x, p) - pth quantile of vector x.sd(x) - standard deviation of vector x.var(x) - variance of vector x.IQR(x) - Inter Quartile Range (IQR) of vector x.diff(range(x)) - total range of vector x.first(x) - The first element of vector x.last(x) - The last element of vector x.nth(x, n) - The nth element of vector x.n() - The number of rows in the data.frame or group of
observations that summarize() describes.n_distinct(x) - The number of unique values in vector
x.Custom functions in summarize():
You can use custom functions using fun(). For example,
summarize(mean_value = mean(variable), custom_summary = fun(variable))
To define and use your custom function:
# Example custom function that calculates the range of a vector
custom_range <- function(x) {
max_value <- max(x, na.rm = TRUE)
min_value <- min(x, na.rm = TRUE)
return(max_value - min_value)
}
library(dplyr)
df <- data.frame(
group = rep(c("A", "B", "C"), each = 4),
value = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12))
# Use the custom function in summarize
result <- df %>%
group_by(group) %>%
dplyr::summarize(custom_range_value = custom_range(value))
result
## # A tibble: 3 × 2
## group custom_range_value
## <chr> <dbl>
## 1 A 3
## 2 B 3
## 3 C 3
Using group_by() with
summarize():
If you’re using summarize() within a group_by() operation, you can apply the above functions to each group.
In this library(yarrr) example, sex is a
column containing discrete categories and there are columns containing
measurements age and beard.length. We can
calculate the average measurement for each measurement type fo each
sex.
library(yarrr)
library(dplyr)
group_by(pirates, sex) %>%
dplyr::summarize(
meanAge = mean(age),
meanBeardLength = mean(beard.length, na.rm = TRUE)
)
## # A tibble: 3 × 3
## sex meanAge meanBeardLength
## <chr> <dbl> <dbl>
## 1 female 29.9 0.399
## 2 male 25.0 19.4
## 3 other 27 14.9
The fivenum function will compute the Tukey-min,
lower-hinge, median, upper-hinge, and max values for any specified
variable in your dataset. Notice it produces ‘five numbers’ for the
specified variable.
fivenum(iris$Sepal.Length)
## [1] 4.3 5.1 5.8 6.4 7.9
We can use the Hmisc package to compute: n, nmiss,
unique, mean, 5,10,25,50,75,90,95th percentiles, 5 lowest and 5 highest
scores for our dataset.
library(Hmisc) # Activates the Hmisc package from the library
describe(iris) # runs the descriptive stats function `describe` from Hmisc package on the `iris` data frame
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10
## Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00
## kurtosis se
## Sepal.Length -0.61 0.07
## Sepal.Width 0.14 0.04
## Petal.Length -1.42 0.14
## Petal.Width -1.36 0.06
## Species* -1.52 0.07
We can use the pastecs package to compute useful
descriptive stats for each variable of our dataset at one time! We can
get the number of values, number of null values, number of NA values,
min, max, range, sum, median, mean, SE.mean, CI.mean, std.dev, coef.var
ALL AT ONCE!!!
library(pastecs) # activate pastecs package
stat.desc(iris) # Run pastecs descriptive stats function on the iris dataset
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## nbr.val 150.00000000 150.00000000 150.0000000 150.00000000 NA
## nbr.null 0.00000000 0.00000000 0.0000000 0.00000000 NA
## nbr.na 0.00000000 0.00000000 0.0000000 0.00000000 NA
## min 4.30000000 2.00000000 1.0000000 0.10000000 NA
## max 7.90000000 4.40000000 6.9000000 2.50000000 NA
## range 3.60000000 2.40000000 5.9000000 2.40000000 NA
## sum 876.50000000 458.60000000 563.7000000 179.90000000 NA
## median 5.80000000 3.00000000 4.3500000 1.30000000 NA
## mean 5.84333333 3.05733333 3.7580000 1.19933333 NA
## SE.mean 0.06761132 0.03558833 0.1441360 0.06223645 NA
## CI.mean.0.95 0.13360085 0.07032302 0.2848146 0.12298004 NA
## var 0.68569351 0.18997942 3.1162779 0.58100626 NA
## std.dev 0.82806613 0.43586628 1.7652982 0.76223767 NA
## coef.var 0.14171126 0.14256420 0.4697441 0.63555114 NA
The psych package This one will give use; item name,
item number, number of valid data points (n), mean, SD, median, mad,
min, max, skew, kurtosis, and SE. What a useful tool!!
The describeBy function allows you to examine if the
descriptive stats change significantly based on a categorical
variable.
library(psych) # Activate psych package from the library
describe(iris)
## vars n mean sd median trimmed mad min max range skew
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10
## Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00
## kurtosis se
## Sepal.Length -0.61 0.07
## Sepal.Width 0.14 0.04
## Petal.Length -1.42 0.14
## Petal.Width -1.36 0.06
## Species* -1.52 0.07
describe(iris$Sepal.Length) #df$numVariable
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 150 5.84 0.83 5.8 5.81 1.04 4.3 7.9 3.6 0.31 -0.61 0.07
describeBy(iris$Sepal.Length, iris$Species) #df$numVariable, df$chrVariable
##
## Descriptive statistics by group
## group: setosa
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 5.01 0.35 5 5 0.3 4.3 5.8 1.5 0.11 -0.45 0.05
## ------------------------------------------------------------
## group: versicolor
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 5.94 0.52 5.9 5.94 0.52 4.9 7 2.1 0.1 -0.69 0.07
## ------------------------------------------------------------
## group: virginica
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 6.59 0.64 6.5 6.57 0.59 4.9 7.9 3 0.11 -0.2 0.09
Visual inspection can sometimes be unreliable and/or difficult to judge. It’s possible to use a test of statistical significance to compare the distribution of your data to a normal distribution in order to ascertain whether the data actually exhibit a serious deviation from normality.
There are several methods for normality test such as Kolmogorov-Smirnov (K-S) normality test and Shapiro-Wilk’s test.
Shapiro-Wilk’s method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.
Note: normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it’s important to combine visual inspection and significance test in order to take the right decision.
shapiro.test(iris$Sepal.Length) #df$numVariable
##
## Shapiro-Wilk normality test
##
## data: iris$Sepal.Length
## W = 0.97609, p-value = 0.01018
with(iris, shapiro.test(Sepal.Length[Species == "setosa"])) #(responseVariable[factor == "name"]))
##
## Shapiro-Wilk normality test
##
## data: Sepal.Length[Species == "setosa"]
## W = 0.9777, p-value = 0.4595
#Transforming Data
Log transformations allow us to alter the scale of a variable to focus on multiplicative changes instead of additive changes. In other words, they shift the view to be on relative changes instead of absolute changes. Such multiplicative/relative changes are also called changes in orders of magnitude.
Scatterplot matrices can reveal patterns in data.
plot(df) creates a simple matrix. These plots can also be
colored in ways that make trends more obvious.
References: Plotting Scatterplot matrices in R & Adding non-overlapping legend in pairs()
Category colored scatterplot matrix
library(grDevices) # Required for the `adjustcolor` function
library(viridis) # Required for customColors
# Create a custom color palette
customColors = viridis(3)
# Create a pairs plot with custom point colors
pairs(iris[, 1:4],
col = customColors[iris$Species],
pch = 1, oma=c(3,3,3,15))
par(xpd = TRUE)
legend("bottomright", legend = levels(iris$Species), fill = customColors)
Comparison colored scatterplot matrix
data("Seatbelts")
Seatbelts.df = as.data.frame(Seatbelts)
# Coloring points by different level of variable "law"
pairs(Seatbelts.df[,1:7], col=ifelse(Seatbelts.df$law==0, "black", "red"))
Scatter plots display the relationship between two numerical variables. They show how changes in one variable relate to changes in another and help identify patterns, trends, and correlations in the data.
plot(iris$Petal.Length,iris$Petal.Width,
pch=16,
ylab = "numvariable",
xlab = "chrvariable",
main = "title")
Box plots, or box-and-whisker plots, can show us important information about the measure of central tendency and spread if we want to compare our data among factors, or categories.
The rectangular box that represents the IQR, with a vertical line inside the box indicating the median. Two “whiskers” extend from the box, showing the range of the data. Outliers, are plotted individually as points.
boxplot(iris$Sepal.Length ~ iris$Species,
horizontal = FALSE,
ylab = "numvariable",
xlab = "chrvariable",
main = "title",
col = c("#999999", "#E69F00", "#56B4E9"))
Histograms are used to examine the distributions in
our dataset. We can use the hist command to create
histograms for any parameters we are interested in. We can use the
abline tool to also represent the median and mean of the
data variable.
hist(iris$Sepal.Length) # Create histogram for numvariable
abline(v=mean(iris$Sepal.Length),col="blue") # add a line to the histogram to represent the mean value of numvariable
abline(v=median(iris$Sepal.Length),col="red") # add a line to the histogram to represent the median value of numvariable
Strip charts
stripchart(iris$Sepal.Length ~ iris$Species,
pch = 19, frame = FALSE, vertical = TRUE,
method = "jitter",
ylab = "numvariable",
xlab = "chrvariable",
main = "title",
col = c("#999999", "#E69F00", "#56B4E9"))
Bar plots barplot(x) barplot of one
variable.
# Bar plot of one variable
# Horizontal bar plot, horiz = TRUE
Stacked bar plots
myColors = c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk")
barplot(VADeaths,
col = myColors)
legend("topright", legend = rownames(VADeaths),
fill = myColors, box.lty = 0, cex = 0.8)
Grouped bar plots
myColors = c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk")
barplot(VADeaths,
col = myColors, beside = TRUE)
legend("topleft", legend = rownames(VADeaths),
fill = myColors, box.lty = 0, cex = 0.8)
QQ plots are used to check whether the data appears to be normally distributed. Inside the gray region = appears normal.
library(ggpubr)
ggqqplot(iris, x = "Sepal.Length") +
labs(title="QQ Plot")
ECDF plots shows the fraction of data smaller than or equal to x. How data is distributed across its range, revealing the proportion of values that are less than or equal to a specific value. It provides a visual representation of the data set’s cumulative distribution.
ggecdf(iris, x = "Sepal.Length",
main = "Fraction of data smaller than 'X'")
Strip charts let us look at the spread and central tendency of the data as well as the actual data points.
ggstripchart(iris, x = "Species", y = "Sepal.Length", color = "Species", add = "mean_sd")
library(ggplot2)
ggplot(diamonds) #if only the dataset is known.
ggplot(diamonds, aes(x=carat)) #if only X-axis is known.
The Y-axis can be specified in future geoms.
ggplot(diamonds, aes(x=carat, y=price)) #if both X and Y
axes are fixed for all layers.
ggplot(diamonds, aes(x=carat, color=cut)) #Each category
of the ‘cut’ variable will now have a distinct color, once a geom is
added.
In many situations, you will need to follow a style guide. You can make a custom theme to resolve this issue.
References: stack
List of default font options (there are more, these are just the ones I find relevant):
Methods Theme
This theme will be a masterpiece and will meet the following requirements:
The function will take in the following arguments:
font = "nameOfFont" default = NA,base_size = n default = 12,What can’t be done in the theme:
scale_y_continuous(expand = c(0, 0)) to your ggplot for
vertical bar plot, and scale_x_continuous(expand = c(0, 0))
to your ggplot for a horizontal bar plot.fill = "black" or color = "black".library(ggplot2)
methodsTheme =
function(font = NA, base_size = 12){
theme(
#Text font
text = element_text(family = font, size = base_size),
#Color
line = element_line(color = "black"),
panel.background = element_rect(fill = "white"), #fill plot background
legend.key = element_rect(fill = "white"), #fill background of legend key
axis.line = element_line(color = "black"), #color axes lines
#Positioning
legend.position = "right", #change legend position
#options: none, left, right, bottom, top
#Text sizes and options
plot.title = element_text(
size = base_size*1.3, #title size
face = "plain"), #options: plain, italic, bold, bold.italic
axis.title = element_text(size = base_size), #axis titles text size
axis.text = element_text(size = base_size, color = "black"), #axis labels/tick numbers text size
legend.text = element_text(size = base_size), #legend text size
legend.title = element_text(size = base_size) #legend title text size
)
}
Scatterplots display the relationship between two numerical variables. They show how changes in one variable relate to changes in another and help identify patterns, trends, and correlations in the data.
ggplot(iris, aes(x=Petal.Width, y=Petal.Length, shape=Species)) +
geom_point() +
methodsTheme(font = "Times")
Facet wrap seperates a ggplot based on a discrete categorical variable. Can assist in making sense of crowded plots!
library(ggplot2)
plot = ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.5) +
geom_smooth(aes(fill=cut)) +
theme(legend.position = "bottom") +
labs(title = "Carat v price")
plot + facet_wrap(~cut)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Box plots, or box-and-whisker plots, can show us important information about the measure of central tendency and spread if we want to compare our data among factors, or categories.
The rectangular box that represents the IQR, with a vertical line inside the box indicating the median. Two “whiskers” extend from the box, showing the range of the data. Outliers, are plotted individually as points.
ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species)) +
geom_boxplot()
Bar plots
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_bar(stat = "summary",
fun = "mean",
position = "dodge") +
labs(title = "Average Sepal Length by Species",
x = "Species",
y = "Average Sepal Length")
Density plots
ggplot(iris, aes(x=Petal.Width)) +
geom_density(color = "red", fill = "red", alpha = 0.5)
ggplot(iris, aes(x=Petal.Width, color=Species, fill=Species)) +
geom_density(alpha = 0.5)
Subfigures can be arranged into figures using the
grid.arrange() function from
library(gridExtra). Example:
grid.arrange(p1, p2, p3, ncol = 2).
coord_cartesian(ylim=c(n, n), xlim=c(n,n)).labs(title="title", x="label", y="label").References:
Themes can be set for all plots in session with:
theme_set()
Note that, the theme functions can take the two arguments below :
base_size: base font size (to change the size of all
plot text elements)base_family: base font familyHelpful add ons:
scale_x_continuous(labels=dollar_format()): change x
axis to dollarsReferences:
Viridis palettes are the most robust under various forms of colorblindness, and are quite pretty.
Key functions:
scale_color_viridis()scale_fill_viridis()viridis(n), magma(n),
inferno(n) and plasma(n): Generate color
palettes for base plot, where n is the number of colors to returns.library(viridis)
#use in ggplot
library(ggplot2)
# Gradient color
ggplot(iris, aes(Sepal.Length, Sepal.Width))+
geom_point(aes(color = Sepal.Length)) +
scale_color_viridis(option = "D")+
theme_minimal() +
theme(legend.position = "bottom")
# Discrete color. use the argument discrete = TRUE
ggplot(iris, aes(Sepal.Length, Sepal.Width))+
geom_point(aes(color = Species)) +
geom_smooth(aes(color = Species, fill = Species), method = "lm") +
scale_color_viridis(discrete = TRUE, option = "D")+
scale_fill_viridis(discrete = TRUE) +
theme_minimal() +
theme(legend.position = "bottom")
## `geom_smooth()` using formula = 'y ~ x'
#use in base R plots
barplot(1:10, col = viridis(10))
RColorBrewer palettes are beautiful and have color blind friendly palattes. The package contains 3 types of color palettes: sequential, diverging, and qualitative.
display.brewer.pal(n = 8, name = 'Dark2') display
palattebrewer.pal(n=8, name='Dark2') find hexadecimal,
n=number of colors from 0scale_fill_brewer() for box plot, bar plot, violin
plot, dot plot, etcscale_color_brewer() for lines and pointslibrary(RColorBrewer)
#Use in ggplot
boxplot = ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot()
boxplot + scale_fill_brewer(palette = "Dark2")
#Use in base R
boxplot(iris$Sepal.Length ~ iris$Species,
col = brewer.pal(n=3, name="Dark2")) #generate a vector of colors n long from -- palatte
Grey color palettes are black and white!
Key functions:
scale_fill_grey() for box plot, bar plot, violin plot,
dot plot, etcscale_colour_grey() for points, lines, etclibrary(ggplot2) #this is a ggplot function, so not applicable to base R plots
#Use in ggplot
boxplot = ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot()
boxplot + scale_fill_grey(start = 0.8, end = 0.3) #start and end are to avoid colors too light and too dark
Wes Anderson color palettes
names(wes_palettes) get nameslibrary(wesanderson)
#use in ggplot
boxplot = ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot()
#wes_palette(name, n, type = c("discrete", "continuous"))
boxplot + scale_fill_manual(values = wes_palette("Zissou1", n = 3, type = c("discrete")))
#use in base R plots
#wes_palette(name, n, type = "discrete", "continuous")
boxplot(iris$Sepal.Length ~ iris$Species,
col = wes_palette("Zissou1", 3, type = "discrete"))
Base R color palettes are included in base R.
To generate a vector of n contiguous
colors:
col = name(n) where name is rainbow, heat.colors,
terrain.colors, topo.colors, or cm.colors, and n is the number of colors
desired.# Use rainbow, heat.colors, terrain.colors, topo.colors, cm.colors
barplot(1:5, col=rainbow(5))