Basic Tips

Working directory

Set the working directory in the session tab. To check wd: getwd().

Importing data

df = read.csv("fileName.csv") Looking for information about an embedded dataset? help(iris) Looking for information about a dataset from a package? ?house_prices

Arguments you can add inside read.csv to alter parameters:

stringsAsFactors = FALSE
rowNames = FALSE

Writing CSVs

write.csv(data, "name.csv")

If you have unique identifiers already in your data: row.names = FALSE
What NAs are in your data: na = ""

Preventing bugs

By explicitly using dplyr::, you can ensure that the functions are called from the dplyr package namespace. Example: dplyr::summarize(). This can be done with any package, dplyr is just an example.

Tidy data and Clean data

Tidy data:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table
Observations are not pseudo replicated, each row must be a single observation

Clean data:

Attributes have correct data types
There are no missing values

Before these tools were around, it was very common to just remove observations with NA’s. This was and is not ideal.

Tidying data

References: Tidy Data by Hadley Wickham

All text in this section is pulled directly from this paper, formatted for easier daily reference.

Vocabulary:

A dataset is a collection of values. Every value belongs to a variable and an observation.
A variable contains all values that measure the same underlying attribute (temperature, coral cover, salinity, fish abundance)
An observation contains all values measured of the same unit (sample, person, transect) across attributes. (Wickham 2014)

Notes:

It is easier to describe functional relationships between variables than between rows, and it is easier to make comparisons between groups of observations than between groups of columns. (Wickham 2014)

Ordering variables and observations

While the order of variables and observations does not affect analysis, a good ordering makes it easier to scan the raw values. There are fixed variables, which contain values that are fixed by the design of the data collection, and there are measured variables, which contain values that are measured during the course of the experiment.

Fixed variables should come first, followed by measured variables, each ordered so that related variables are contiguous (grouped together/ “touching”).
Rows can be ordered by the first variable, breaking ties with the second and subsequent (fixed) variables.

Common issues with messy datasets

Column headers are values, not variable names.
Multiple variables are stored in one column
Variables are stored in both rows and columns
Multiple types of observational units are stored in the same table
A single observational unit is stored in multiple tables

Column headers are values, not variables

To tidy this messy data it needs to be melted (or stacked). The columns that are not variables need to be turned into rows. The result of melting is a molten dataset.

df = tibble(
  Name = c('John', 'Jane', 'Bob'),
  Math = c(90, 85, 78),
  Biology = c(88, 92, 80),
  English = c(75, 80, 88),
  History = c(82, 78, 90))
df

## # A tibble: 3 × 5
##   Name   Math Biology English History
##   <chr> <dbl>   <dbl>   <dbl>   <dbl>
## 1 John     90      88      75      82
## 2 Jane     85      92      80      78
## 3 Bob      78      80      88      90

Note: Wickham would call this arrangement messy, however in some cases it can be extremely useful. It provides efficient storage for completely crossed designs, and it can lead to extremely efficient computation if desired operations can be expressed as matrix operations. This issue is discussed in depth in Section 6.

There are a few different ways to melt this messy data, there are functions from tidyr, reshape2, and data.table that can help us.

library(tidyr)

melted_df <- pivot_longer(df,
  cols = "Math":"History",
  names_to = "Subject",
  values_to = "Score")
#pivot_longer(df, cols = -id_vars, names_to = "key", values_to = "value", values_drop_na = FALSE)

#NOTE: for cols, can use "Math":"History", -Name (minus the ones you don't want melted), or c("Math", "Biology", "English", "History"), many ways to do this.

library(reshape2)

melted_df <- reshape2::melt(df,
  id.vars = c("Name"),
  measure.vars = c("Math", "Biology", "English", "History"),
  variable.name = "Subject",
  value.name = "Score")

data.table vs reshape2: The data.table::melt function provides additional features and optimization, making it more efficient, especially for large datasets. While the basic usage and syntax of the two functions are similar, data.table::melt is known for its performance improvements and additional capabilities. If you’re working with relatively small datasets, the differences in performance may not be significant, and you can choose either based on your preference.

library(data.table)

dt <- as.data.table(df) # Convert the data.frame to data.table

melted_dt <- data.table::melt(dt,
  id.vars = c("Name"),
  measure.vars = c("Math", "Biology", "English", "History"),
  variable.name = "Subject",
  value.name = "Score")

Tidy data:

##    Name Subject Score
## 1  John    Math    90
## 2  Jane    Math    85
## 3   Bob    Math    78
## 4  John Biology    88
## 5  Jane Biology    92
## 6   Bob Biology    80
## 7  John English    75
## 8  Jane English    80
## 9   Bob English    88
## 10 John History    82
## 11 Jane History    78
## 12  Bob History    90

Multiple variables stored in one column

To tidy this messy dataset, it first needs to be melted. This puts the multiple variables into one column. This column is then split into separate real variables.

Column headers in this format are often separated by some character (., -, _, :). While the string can be broken into pieces using that character as a divider, in other cases, more careful string processing is required. For example, the variable names can be matched to a lookup table that converts single compound value into multiple component values.

In this example the class and month variables are stored in one column and the value is the student’s score:

df <- tibble(
  Name = c('John', 'Jane', 'Bob'),
  Gender = c('Male', 'Female', 'Male'),
  Math_Score_Jan = c(90, 85, 78),
  Math_Score_Feb = c(88, 92, 80),
  English_Score_Jan = c(75, 80, 88),
  English_Score_Feb = c(82, 78, 90),
  Science_Score_Jan = c(85, 88, 92),
  Science_Score_Feb = c(90, 85, 88))
df

## # A tibble: 3 × 8
##   Name  Gender Math_Score_Jan Math_Score_Feb English_Score_Jan English_Score_Feb
##   <chr> <chr>           <dbl>          <dbl>             <dbl>             <dbl>
## 1 John  Male               90             88                75                82
## 2 Jane  Female             85             92                80                78
## 3 Bob   Male               78             80                88                90
## # ℹ 2 more variables: Science_Score_Jan <dbl>, Science_Score_Feb <dbl>

Step 1: Melting

library(tidyr)

melted_df <- pivot_longer(df,
  cols = c(
  "Math_Score_Jan", "Math_Score_Feb",
  "English_Score_Jan", "English_Score_Feb",
  "Science_Score_Jan", "Science_Score_Feb"),
  names_to = "Column",
  values_to = "Score")
head(melted_df)

## # A tibble: 6 × 4
##   Name  Gender Column            Score
##   <chr> <chr>  <chr>             <dbl>
## 1 John  Male   Math_Score_Jan       90
## 2 John  Male   Math_Score_Feb       88
## 3 John  Male   English_Score_Jan    75
## 4 John  Male   English_Score_Feb    82
## 5 John  Male   Science_Score_Jan    85
## 6 John  Male   Science_Score_Feb    90

Step 2: Splitting

library(tidyr)
library(dplyr)

split_df <- melted_df %>%
  separate(Column, into = c("Subject", "Score", "Month"), sep = "_", remove = TRUE) %>%
  select(-Score)
head(split_df)

## # A tibble: 6 × 4
##   Name  Gender Subject Month
##   <chr> <chr>  <chr>   <chr>
## 1 John  Male   Math    Jan  
## 2 John  Male   Math    Feb  
## 3 John  Male   English Jan  
## 4 John  Male   English Feb  
## 5 John  Male   Science Jan  
## 6 John  Male   Science Feb

Variables stored in both rows and columns

The most complicated form of messy data!

In this example, multiple variables are stored in one column Element. And the values for those variables are in the column Measurement.

df <- tibble(
  Site = rep(c("A", "B", "C"), each = 9),
  Month = rep(rep(c("January", "February", "March"), each = 3), times = 3),
  Element = rep(c("pH", "Dissolved_Oxygen", "Nitrate"), times = 9),
  Measurement = c(
    7.2, 6.8, 7.5,
    8.2, 7.5, 8.0,
    2.0, 1.5, 1.8,
    7.0, 6.5, 7.8,
    7.8, 7.2, 7.5,
    1.8, 1.4, 1.6,
    7.4, 6.9, 7.2,
    8.0, 7.3, 8.2,
    2.2, 1.7, 1.9))
head(df)

## # A tibble: 6 × 4
##   Site  Month    Element          Measurement
##   <chr> <chr>    <chr>                  <dbl>
## 1 A     January  pH                       7.2
## 2 A     January  Dissolved_Oxygen         6.8
## 3 A     January  Nitrate                  7.5
## 4 A     February pH                       8.2
## 5 A     February Dissolved_Oxygen         7.5
## 6 A     February Nitrate                  8

library(tidyr)
library(dplyr)

tidy_df <- df %>%
  pivot_wider(names_from = Element, values_from = Measurement)
tidy_df

## # A tibble: 9 × 5
##   Site  Month       pH Dissolved_Oxygen Nitrate
##   <chr> <chr>    <dbl>            <dbl>   <dbl>
## 1 A     January    7.2              6.8     7.5
## 2 A     February   8.2              7.5     8  
## 3 A     March      2                1.5     1.8
## 4 B     January    7                6.5     7.8
## 5 B     February   7.8              7.2     7.5
## 6 B     March      1.8              1.4     1.6
## 7 C     January    7.4              6.9     7.2
## 8 C     February   8                7.3     8.2
## 9 C     March      2.2              1.7     1.9

Multiple types in one table

Datasets often involve values collected at multiple levels, on different types of observational units. During tidying, each type of observational unit should be stored in its own table.

Normalisation is useful for tidying and eliminating inconsistencies. However, there are few data analysis tools that work directly with relational data, so analysis usually also requires denormalisation or the merging the datasets back into one table.

One type in multiple tables

An example

It’s also common to find data values about a single type of observational unit spread out over multiple tables or files. These tables and files are often split up by another variable, so that each represents a single year, person, or location. As long as the format for individual records is consistent, this is an easy problem to fix:

Read the files into a list of tables.
For each table, add a new column that records the original file name (because the file name is often the value of an important variable).
Combine all tables into a single table.

The plyr package makes this a straightforward task in R. The following code generates a vector of file names in a directory (data/) which match a regular expression (ends in .csv). Next we name each element of the vector with the name of the file. We do this because plyr will preserve the names in the following step, ensuring that each row in the final data frame is labelled with its source. Finally, ldply() loops over each path, reading in the csv file and combining the results into a single data frame.

paths <- dir("data", pattern = "\\.csv$", full.names = TRUE)

names(paths) <- basename(paths)

ldply(paths, read.csv, stringsAsFactors = FALSE)

Missing values

When examining missing values, it is important to consider the experimental design to determine if the values can be safely dropped. If a missing value represents an observation that was supposed to be made, it’s important to keep it. However, structural missing values, which are measurements that cannot be made (pregnant males), can be safely removed.

Cleaning data

Understanding the data

head(df)
summary(df)
names(df)
str(df)
dim(df)

Setting factors

Categorical Representation: Factors represent discrete categorical data, such as “Male” or “Female,” “High,” “Medium,” or “Low,” etc.
Levels: A factor consists of a set of predefined levels, which are the distinct categories or groups that the variable can take. Levels define the possible values of the factor.
Ordering: Factors can be ordered or unordered. Ordered factors have a meaningful order among their levels, while unordered factors do not.

Setting factors:

Factors are created using the factor() function. You can specify the levels and, if needed, the order of the levels.
as.factor() assumes that the levels should be set based on the unique values present in the column.
Checking Factor Properties: levels() retrieves the levels of a factor. nlevels() returns the number of levels in a factor.

titanic = read.csv("titanic_train.csv", stringsAsFactors = FALSE)

titanic$Embarked = as.factor(titanic$Embarked)
levels(titanic$Embarked)

## [1] ""  "C" "Q" "S"

titanic$Embarked = factor(titanic$Embarked, levels = c("C", "Q", "S"))
levels(titanic$Embarked)

## [1] "C" "Q" "S"

Identifying and handing missing values

Use summary(data) to see number of NA’s per variable
Use head(levels(factor(data$variable)), n=10), ’’ in output indicate missing values

Subsetting base R

When subsetting, this format is followed: [rows, columns]. There are different ways to subset:

Ranges of indices: data[1:3,5:9]
Vectors of indices: data[c(1,2,3),c(5,6,7,8,9)]
Individual indices: data[15,] (gets all columns for row 15)

Note: it is also acceptable to use the uniqueRowID and the colName instead of indices. Ex: df["uniqueRowID", "colName"]

Calculating colMeans for measurement columns associated with a column of discrete categories

To split the column of discrete categories into dataframes for each category: split=split(df,df$discreteCatCol)
To calculate descriptive stats for the split data: sapply(split, function(x) colMeans(x[,c("col1", "col2","col3")],na.rm=TRUE))

Note: colMeans calculates the mean of each given column in the vector, for each category of the split

In this airquality example, Month is a column containing discrete categories (5,6,7,8,9) and there are columns containing measurements c("Ozone", "Solar.R","Wind"). We can calculate the average measurement for each measurement type for each Month.

df=airquality
split=split(df,df$Month)
sapply(split, function(x) colMeans(x[,c("Ozone", "Solar.R","Wind")],na.rm=TRUE))

##                 5         6          7          8         9
## Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
## Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
## Wind     11.62258  10.26667   8.941935   8.793548  10.18000

There is another way to do this using dplyr pipe function

Manipulating tidy data with dplyr

Data manipulation includes variable-by-variable transformation (e.g., log or sqrt), as well as aggregation, filtering and reordering. These are the four fundamental verbs of data manipulation:

Filter: subsetting or removing observations based on some condition.
Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).
Aggregate: collapsing multiple values into a single value (by summing or taking means, etc.)
Sort: changing the order of observations.

Filtering and transforming are performed by the base R functions subset() and transform(). These are input and output-tidy. The aggregate() function performs group-wise aggregation. It is input-tidy. Provided that a single aggregation method is used, it is also output-tidy. The plyr package provides tidy summarise() and arrange() functions for aggregation and sorting.

With dplyr, you are never actually altering the original dataframe. Note that dplyr functions don’t modify data, they return modified copies of orginal data. You should save the function results to a variable if you want to keep using them.

dplyr’s five main functions

select() filters, chooses variables (cols) / subsets
mutate() transforms, creates new variables as function of existing variables
filter() filters, chooses observations (row) that meet a logical test
arrange() sorts, sorts the rows of your data based on the values in a column (or columns) in aecending or descending order, or based on the value returned from an expression.
summarize() aggregates, it calculates summary statistics (summing, taking means etc.)

Other important tools:

The pipe operator %>% to chain commands together
group_by allows us to calculate summary stats over subgroupings of data

Basics of logical operators

Tests can be combined using and &, or |, or not !. ex: filter(df, a > 0 & b > 0), filter(df, !is.na(x))

`select()`

The select function is used to choose which variables (columns) to keep (or remove) in a data frame or tibble. Like all dplyr functions, select expects a data frame or tibble as the first argument. The remaining arguments indicate the variables to select. In all dplyr functions, you don’t need quotes around variable names as you do in most base-R functions.

select(df, col1, col2, col3) selects listed columns

select(df, col1:col4) selects col1, col2, col3, col4 (inclusive:inclusive)

select(df, -(col1, col2)) excludes col1 and col2, selects all other columns

select helper functions: dplyr comes with helper functions for selecting groups of variables

starts_with(“X”): every name that starts with “X”
ends_with(“X”): every name that ends with “X”
contains(“X”): every name that contains “X”
matches(“X”): every name that matches “X”, where “X” can be a regular expression
num_range(“x”, 1:5): the variables named x01, x02, x03, x04 and x05
one_of(x): every name that appears in x, which should be a character vector

example: select(flightData, ends_with("Delay")) selects all columns in flightData that end with “Delay”.

Important note: none of these helper functions are case sensitive

`mutate()`

The mutate function adds new variables to dataset based on existing variables. For example, we could create a volume variable based on height, length, width. You can also use it to change the values of an existing variable, for example, converting a column in centimeters to meters: mutate(df, colName=colName/100)

To call mutate, pass a tibble/data frame as the first argument and definitions of new variables using a new_name = transformation expression.

mutate(df, newColName = expression)

Example: mutate(flightData, loss = ArrDelay - DepDelay) makes a new column loss, which contains the result of the expression ArrDelay - DepDelay for each row in flightData.

mutate() is great for making date columns! mutate(df, Date = paste(Year, Month, DayofMonth, sep = "-")) adjust to your desires!

Note: you can add as many new columns at once as you want in mutate(), just seperate by commas

`filter()`

The filter function chooses observations (rows) to keep based on a logical expression and filters out the rest. To use filter, give it the name of a data frame/tibble and one or more logical tests.

Especially helpful for binary columns.

filter(df, colName <insert logic here>)

Example: filter(flightData, Cancelled == 1) filters every row in the df flightData where the Cancelled == 1.

the %in% operator

The %in% operator can be used with filter to retain only rows that contain a certain value c("a", "b", "c") within a specified column colName in the df.

filter(df, colName %in% c("a", "b", "c"))

`arrange()`

The arrange() function reorders rows of the data set. It is used for sorting.

Pass arrange() a tibble or data frame to arrange, and tell it what column(s) to sort by. You can also indicate whether to sort in ascending or descending order. (default = ascending order, to get descending order: desc(colName))

arrange(df, colName...) list as many colNames as desired

You can also sort by expressions! arrange(df, <your expression>)

`summarize()`

The summarize() function uses existing data to create a new dataset of summary statistics. The summarize syntax is like mutate: indicate the data frame or tibble to summarize, then give a list of summary names and operations using a name = expression pattern, as in this example: summarize(df, newColName = <summarizing function(column)>...)

aggregate functions in summarize():

sum() get the sum
min(x) - minimum value of vector x.
max(x) - maximum value of vector x.
mean(x) - mean value of vector x.
median(x) - median value of vector x.
quantile(x, p) - pth quantile of vector x.
sd(x) - standard deviation of vector x.
var(x) - variance of vector x.
IQR(x) - Inter Quartile Range (IQR) of vector x.
diff(range(x)) - total range of vector x.
first(x) - The first element of vector x.
last(x) - The last element of vector x.
nth(x, n) - The nth element of vector x.
n() - The number of rows in the data.frame or group of observations that summarize() describes.
n_distinct(x) - The number of unique values in vector x.

Custom functions in summarize():

You can use custom functions using fun(). For example, summarize(mean_value = mean(variable), custom_summary = fun(variable))

To define and use your custom function:

# Example custom function that calculates the range of a vector
custom_range <- function(x) {
  max_value <- max(x, na.rm = TRUE)
  min_value <- min(x, na.rm = TRUE)
  return(max_value - min_value)
}

library(dplyr)

df <- data.frame(
  group = rep(c("A", "B", "C"), each = 4),
  value = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12))

# Use the custom function in summarize
result <- df %>%
  group_by(group) %>%
  dplyr::summarize(custom_range_value = custom_range(value))

result

## # A tibble: 3 × 2
##   group custom_range_value
##   <chr>              <dbl>
## 1 A                      3
## 2 B                      3
## 3 C                      3

Using group_by() with summarize():

If you’re using summarize() within a group_by() operation, you can apply the above functions to each group.

In this library(yarrr) example, sex is a column containing discrete categories and there are columns containing measurements age and beard.length. We can calculate the average measurement for each measurement type fo each sex.

library(yarrr)
library(dplyr)

group_by(pirates, sex) %>%
  dplyr::summarize(
    meanAge = mean(age),
    meanBeardLength = mean(beard.length, na.rm = TRUE)
  )

## # A tibble: 3 × 3
##   sex    meanAge meanBeardLength
##   <chr>    <dbl>           <dbl>
## 1 female    29.9           0.399
## 2 male      25.0          19.4  
## 3 other     27            14.9

Examining Data

Descriptive statistics

The fivenum function will compute the Tukey-min, lower-hinge, median, upper-hinge, and max values for any specified variable in your dataset. Notice it produces ‘five numbers’ for the specified variable.

fivenum(iris$Sepal.Length)

## [1] 4.3 5.1 5.8 6.4 7.9

We can use the Hmisc package to compute: n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles, 5 lowest and 5 highest scores for our dataset.

library(Hmisc) # Activates the Hmisc package from the library
describe(iris) # runs the descriptive stats function `describe` from Hmisc package on the `iris` data frame

##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.61 0.07
## Sepal.Width      0.14 0.04
## Petal.Length    -1.42 0.14
## Petal.Width     -1.36 0.06
## Species*        -1.52 0.07

We can use the pastecs package to compute useful descriptive stats for each variable of our dataset at one time! We can get the number of values, number of null values, number of NA values, min, max, range, sum, median, mean, SE.mean, CI.mean, std.dev, coef.var ALL AT ONCE!!!

library(pastecs) # activate pastecs package
stat.desc(iris) # Run pastecs descriptive stats function on the iris dataset

##              Sepal.Length  Sepal.Width Petal.Length  Petal.Width Species
## nbr.val      150.00000000 150.00000000  150.0000000 150.00000000      NA
## nbr.null       0.00000000   0.00000000    0.0000000   0.00000000      NA
## nbr.na         0.00000000   0.00000000    0.0000000   0.00000000      NA
## min            4.30000000   2.00000000    1.0000000   0.10000000      NA
## max            7.90000000   4.40000000    6.9000000   2.50000000      NA
## range          3.60000000   2.40000000    5.9000000   2.40000000      NA
## sum          876.50000000 458.60000000  563.7000000 179.90000000      NA
## median         5.80000000   3.00000000    4.3500000   1.30000000      NA
## mean           5.84333333   3.05733333    3.7580000   1.19933333      NA
## SE.mean        0.06761132   0.03558833    0.1441360   0.06223645      NA
## CI.mean.0.95   0.13360085   0.07032302    0.2848146   0.12298004      NA
## var            0.68569351   0.18997942    3.1162779   0.58100626      NA
## std.dev        0.82806613   0.43586628    1.7652982   0.76223767      NA
## coef.var       0.14171126   0.14256420    0.4697441   0.63555114      NA

The psych package This one will give use; item name, item number, number of valid data points (n), mean, SD, median, mad, min, max, skew, kurtosis, and SE. What a useful tool!!

The describeBy function allows you to examine if the descriptive stats change significantly based on a categorical variable.

library(psych) # Activate psych package from the library
describe(iris)

##              vars   n mean   sd median trimmed  mad min max range  skew
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
##              kurtosis   se
## Sepal.Length    -0.61 0.07
## Sepal.Width      0.14 0.04
## Petal.Length    -1.42 0.14
## Petal.Width     -1.36 0.06
## Species*        -1.52 0.07

describe(iris$Sepal.Length) #df$numVariable

##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 150 5.84 0.83    5.8    5.81 1.04 4.3 7.9   3.6 0.31    -0.61 0.07

describeBy(iris$Sepal.Length, iris$Species) #df$numVariable, df$chrVariable

## 
##  Descriptive statistics by group 
## group: setosa
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 50 5.01 0.35      5       5 0.3 4.3 5.8   1.5 0.11    -0.45 0.05
## ------------------------------------------------------------ 
## group: versicolor
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 50 5.94 0.52    5.9    5.94 0.52 4.9   7   2.1  0.1    -0.69 0.07
## ------------------------------------------------------------ 
## group: virginica
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 50 6.59 0.64    6.5    6.57 0.59 4.9 7.9     3 0.11     -0.2 0.09

Assessing Normality

Visual inspection can sometimes be unreliable and/or difficult to judge. It’s possible to use a test of statistical significance to compare the distribution of your data to a normal distribution in order to ascertain whether the data actually exhibit a serious deviation from normality.

There are several methods for normality test such as Kolmogorov-Smirnov (K-S) normality test and Shapiro-Wilk’s test.

Shapiro-Wilk’s test

Shapiro-Wilk’s method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.

Note: normality test is sensitive to sample size. Small samples most often pass normality tests. Therefore, it’s important to combine visual inspection and significance test in order to take the right decision.

Null hypothesis: sample distribution is normal
Alternative hypothesis: sample distribution is non-normal

shapiro.test(iris$Sepal.Length) #df$numVariable

## 
##  Shapiro-Wilk normality test
## 
## data:  iris$Sepal.Length
## W = 0.97609, p-value = 0.01018

with(iris, shapiro.test(Sepal.Length[Species == "setosa"])) #(responseVariable[factor == "name"]))

## 
##  Shapiro-Wilk normality test
## 
## data:  Sepal.Length[Species == "setosa"]
## W = 0.9777, p-value = 0.4595

Equal Variences

#Transforming Data

Log Transformations

Log transformations allow us to alter the scale of a variable to focus on multiplicative changes instead of additive changes. In other words, they shift the view to be on relative changes instead of absolute changes. Such multiplicative/relative changes are also called changes in orders of magnitude.

Data Visualization

Base R Visualizations

Scatterplot matrices can reveal patterns in data. plot(df) creates a simple matrix. These plots can also be colored in ways that make trends more obvious.

References: Plotting Scatterplot matrices in R & Adding non-overlapping legend in pairs()

Category colored scatterplot matrix

library(grDevices) # Required for the `adjustcolor` function
library(viridis) # Required for customColors

# Create a custom color palette
customColors = viridis(3)

# Create a pairs plot with custom point colors
pairs(iris[, 1:4], 
      col = customColors[iris$Species], 
      pch = 1, oma=c(3,3,3,15))
par(xpd = TRUE)
legend("bottomright", legend = levels(iris$Species), fill = customColors)

Comparison colored scatterplot matrix

data("Seatbelts")
Seatbelts.df = as.data.frame(Seatbelts)
# Coloring points by different level of variable "law"
pairs(Seatbelts.df[,1:7], col=ifelse(Seatbelts.df$law==0, "black", "red"))

Scatter plots display the relationship between two numerical variables. They show how changes in one variable relate to changes in another and help identify patterns, trends, and correlations in the data.

plot(iris$Petal.Length,iris$Petal.Width,
     pch=16,
     ylab = "numvariable",
     xlab = "chrvariable",
     main = "title")

Box plots, or box-and-whisker plots, can show us important information about the measure of central tendency and spread if we want to compare our data among factors, or categories.

The rectangular box that represents the IQR, with a vertical line inside the box indicating the median. Two “whiskers” extend from the box, showing the range of the data. Outliers, are plotted individually as points.

boxplot(iris$Sepal.Length ~ iris$Species,
        horizontal = FALSE,
        ylab = "numvariable",
        xlab = "chrvariable",
        main = "title",
        col = c("#999999", "#E69F00", "#56B4E9"))

Histograms are used to examine the distributions in our dataset. We can use the hist command to create histograms for any parameters we are interested in. We can use the abline tool to also represent the median and mean of the data variable.

hist(iris$Sepal.Length) # Create histogram for numvariable
abline(v=mean(iris$Sepal.Length),col="blue") # add a line to the histogram to represent the mean value of numvariable
abline(v=median(iris$Sepal.Length),col="red") # add a line to the histogram to represent the median value of numvariable

Strip charts

stripchart(iris$Sepal.Length ~ iris$Species,
         pch = 19, frame = FALSE, vertical = TRUE,
         method = "jitter",
         ylab = "numvariable",
         xlab = "chrvariable",
         main = "title",
         col = c("#999999", "#E69F00", "#56B4E9"))

Bar plots barplot(x) barplot of one variable.

# Bar plot of one variable
# Horizontal bar plot, horiz = TRUE

Stacked bar plots

myColors = c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk")
barplot(VADeaths,
        col = myColors)
legend("topright", legend = rownames(VADeaths), 
       fill = myColors, box.lty = 0, cex = 0.8)

Grouped bar plots

myColors = c("lightblue", "mistyrose", "lightcyan","lavender", "cornsilk")
barplot(VADeaths,
         col = myColors, beside = TRUE)
legend("topleft", legend = rownames(VADeaths), 
       fill = myColors, box.lty = 0, cex = 0.8)

ggplot Visualizations

QQ plots are used to check whether the data appears to be normally distributed. Inside the gray region = appears normal.

library(ggpubr)
ggqqplot(iris, x = "Sepal.Length") +
  labs(title="QQ Plot")

ECDF plots shows the fraction of data smaller than or equal to x. How data is distributed across its range, revealing the proportion of values that are less than or equal to a specific value. It provides a visual representation of the data set’s cumulative distribution.

ggecdf(iris, x = "Sepal.Length", 
       main = "Fraction of data smaller than 'X'")

Strip charts let us look at the spread and central tendency of the data as well as the actual data points.

ggstripchart(iris, x = "Species", y = "Sepal.Length", color = "Species", add = "mean_sd")

Introduction to ggplots

library(ggplot2)

ggplot(diamonds) #if only the dataset is known.

ggplot(diamonds, aes(x=carat)) #if only X-axis is known. The Y-axis can be specified in future geoms.

ggplot(diamonds, aes(x=carat, y=price)) #if both X and Y axes are fixed for all layers.

ggplot(diamonds, aes(x=carat, color=cut)) #Each category of the ‘cut’ variable will now have a distinct color, once a geom is added.

Custom themes

In many situations, you will need to follow a style guide. You can make a custom theme to resolve this issue.

References: stack

List of default font options (there are more, these are just the ones I find relevant):

“Times” is Times new Roman
“sans” is aerial

Methods Theme

This theme will be a masterpiece and will meet the following requirements:

All theme elements black and white
No lines around plot or grid lines, only black axes lines
Always display test statistic and p value
Same font as the paper the plot will be included in
Axis labels must be very large, so size will need to be manually altered

The function will take in the following arguments:

font = "nameOfFont" default = NA,
base_size = n default = 12,

What can’t be done in the theme:

Bars in bar graphs must touch the x axis: add scale_y_continuous(expand = c(0, 0)) to your ggplot for vertical bar plot, and scale_x_continuous(expand = c(0, 0)) to your ggplot for a horizontal bar plot.
The color of points, bars, etc.: specify in geom() with fill = "black" or color = "black".
Adding the p-values and test statistics

library(ggplot2)

methodsTheme = 
  function(font = NA, base_size = 12){
  theme(
    #Text font
    text = element_text(family = font, size = base_size),
    
    #Color
    line = element_line(color = "black"),
    panel.background = element_rect(fill = "white"), #fill plot background
    legend.key = element_rect(fill = "white"), #fill background of legend key
    axis.line = element_line(color = "black"), #color axes lines
    
    #Positioning
    legend.position = "right", #change legend position
      #options: none, left, right, bottom, top
    
    #Text sizes and options
    plot.title = element_text(
      size = base_size*1.3, #title size
      face = "plain"), #options: plain, italic, bold, bold.italic
    axis.title = element_text(size = base_size), #axis titles text size
    axis.text = element_text(size = base_size, color = "black"), #axis labels/tick numbers text size
    legend.text = element_text(size = base_size), #legend text size
    legend.title = element_text(size = base_size) #legend title text size
  )
}

Visualizing continuous data by continous data

Scatterplots display the relationship between two numerical variables. They show how changes in one variable relate to changes in another and help identify patterns, trends, and correlations in the data.

ggplot(iris, aes(x=Petal.Width, y=Petal.Length, shape=Species)) +
  geom_point() + 
  methodsTheme(font = "Times")

Facet wrap seperates a ggplot based on a discrete categorical variable. Can assist in making sense of crowded plots!

library(ggplot2)

plot = ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
  geom_point(alpha = 0.5) +
  geom_smooth(aes(fill=cut)) +
  theme(legend.position = "bottom") +
  labs(title = "Carat v price")

plot + facet_wrap(~cut)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Visualizing continuous data by categories

Box plots, or box-and-whisker plots, can show us important information about the measure of central tendency and spread if we want to compare our data among factors, or categories.

ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species)) +
  geom_boxplot()

Bar plots

ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_bar(stat = "summary", 
           fun = "mean", 
           position = "dodge") +
  labs(title = "Average Sepal Length by Species",
       x = "Species",
       y = "Average Sepal Length")

Density plots

ggplot(iris, aes(x=Petal.Width)) +
  geom_density(color = "red", fill = "red", alpha = 0.5)

ggplot(iris, aes(x=Petal.Width, color=Species, fill=Species)) +
  geom_density(alpha = 0.5)

Combining plots with gridExtra

Subfigures can be arranged into figures using the grid.arrange() function from library(gridExtra). Example: grid.arrange(p1, p2, p3, ncol = 2).

Customizing ggplots

Zooming in is done using coord_cartesian(ylim=c(n, n), xlim=c(n,n)).
Labels are done using labs(title="title", x="label", y="label").

ggplot Themes

References:

Themes can be set for all plots in session with: theme_set()

Note that, the theme functions can take the two arguments below :

base_size: base font size (to change the size of all plot text elements)
base_family: base font family

Helpful add ons:

scale_x_continuous(labels=dollar_format()): change x axis to dollars

Color palettes