Lesson 2 | 31 Jan 2020

Reading and exploring the data (with style)

For example, to add striped lines (alternative row colors) to your table and you want to highlight the hovered row, you can simply type:

The option condensed can also be handy in many cases when you don’t want your table to be too large. It has slightly shorter row height.

For some small tables with only few columns, a page wide table looks awful. To make it easier, you can specify whether you want the table to have full_width or not in kable_styling. By default, full_width is set to be TRUE for HTML tables (note that for LaTeX, the default is FALSE since I don’t want to change the “common” looks unless you specified it.)

What is the difference between tidy() function vs. kable() function for table styling

What I am noticing so far, is that tidy() works best for matrices and vectors. For example, if you used the function as.matrix() to convert your data into a matrix and then do tidy() then you can use the tidy function.

On the other hand kable() seems to be much more flexible when it comes to its arguments. It can take in matrices, data frames, and vectors.

morphology_data <- read_csv("mate_trials_summer_2019.csv")

Parsed with column specification:
cols(
  ID_num = col_double(),
  TgroupID = col_character(),
  GgroupID = col_character(),
  sex = col_character(),
  beak = col_double(),
  thorax = col_double(),
  wing = col_double(),
  body = col_double(),
  w_morph = col_character(),
  recorder = col_character(),
  computer = col_character(),
  date_recorded_by_hand = col_character(),
  data_recorded_on_excel = col_character(),
  notes = col_character()
)

# Using tidy() to display the table 
morph_matrix <- as.matrix(morphology_data)
tidy(head(morph_matrix))

Warning: 'tidy.matrix' is deprecated.
See help("Deprecated")

#Using kable() and kable_styling() to make 
kable(head(morphology_data)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)

ID_num	TgroupID	GgroupID	sex	beak	thorax	wing	body	w_morph	recorder	computer	date_recorded_by_hand	data_recorded_on_excel	notes
475	NA	T1	F	6.29	3.69	7.81	11.22	S	A	yes	09.11.19	09.11.19	NA
268	NA	T2	F	8.44	3.73	10.20	14.00	L	A	yes	09.11.19	09.11.19	NA
261	NA	T3	F	8.44	3.68	9.57	13.21	S	A	yes	09.11.19	09.11.19	too big for microscope
261	NA	T3	F	8.55	3.56	9.49	13.13	S	A	no	09.17.19	09.17.19	NA
284	NA	T4	F	8.42	3.83	9.54	13.15	L	A	yes	09.11.19	09.11.19	NA
327	NA	T5	F	8.82	3.79	10.20	13.90	L	A	yes	09.11.19	09.11.19	NA

What we learned together last time

The differences between vectors and lists in R.

list1 = c(1,2,3) # this is a vector NOT a list

list2 = list(1,2,3) # this is a list NOT a vector

Updates from the last lesson.

When using select(), there is no need to place your column names that you want to pull out from your dataframe as a list of strings. You can just list them inside the curved parentheses as their column names.
‘$’ refers to a specific column relative to a specific data frame. This is the efficient R notation that will automatically call to that column.

female_data <- morphology_data %>%
  select(sex, beak) %>%
  filter(sex =="F")

kable(head(female_data)) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                  full_width = F, position = "left")

sex	beak
F	6.29
F	8.44
F	8.44
F	8.55
F	8.42
F	8.82

f_mean <- mean(female_data$beak)
f_sd <- sd(female_data$beak)
f_max <- max(female_data$beak)
f_min <- min(female_data$beak)

# Get rid of rows with NA values

male_data <- morphology_data %>%
  select(sex, beak) %>%
  filter(sex =="M", beak != "NA")

kable(head(male_data)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = F, position = "left")

sex	beak
M	6.00
M	6.40
M	5.82
M	6.35
M	5.45
M	5.97

m_mean <- mean(male_data$beak)
m_sd <- sd(male_data$beak)
m_max <- max(male_data$beak)
m_min <- min(male_data$beak)
m_min2 <- min(male_data$beak)

# Let's make a dataframe from scratch summarizing the statistics.

summary_table = matrix(byrow = TRUE, c(f_mean, f_sd, f_max, f_min, 
                        m_mean, m_sd, m_max, m_min), nrow = 2, 
                       dimnames = list(c("female","male"),
                                       c("mean", "sd", "max", "min")))
kable(summary_table) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = F, position = "left")

	mean	sd	max	min
female	7.641	0.9773	9.34	5.77
male	5.781	0.4585	6.91	4.69

Data Visualization: Graphing the data.

To help visualize the data we can graph histograms - this will help us see the distribution. To do so we can quickly use generic functions such as

hist()
boxplot()
plot()

female_hist <- hist(female_data$beak)

male_hist <- hist(male_data$beak)

female_bp <- boxplot(female_data$beak)

male_bp <- boxplot(male_data$beak)

But the ggformula package has specific functions that can make data visualizaiotn more dynamic:

gf_histogram() - count and density histograms in ggformula
gf_boxplot() - displays the distribution of a continuous variable. It visualises five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually.
gf_point() - scatterplots in ggformula

We can then use grid.arrange() to help space out how we want our graphs to be placed when we knit to a PDF. grid.arrange() arranges multiple grobs on a page. What is a grob? A grid graphical object (“grob”) is a description of a graphical item. These basic classes provide default behavior for validating, drawing, and modifying graphical objects.

R Colors! https://status.rstudio.com

Generic argument input into the ggformula functions:

gf_( y ~ x, data = _____, )

y can be NULL (meaning don’t have to put anything in there) or it can be your y/response variable (usually not for histograms or boxplots but more so for ‘point’ graphs)
x is you x variable
data can be your full data, or subsets of the data

# How does beak length differ by sex?
p0 <- gf_histogram(~ beak, data = morphology_data, bins= 15, binwidth= 1, color =~sex, fill=~sex,
                   title="Beak length distribution by sex",
                   xlab= "Beak length (mm)",
                   ylab= "Number of soapberry bugs")

# How does thorax length differ by wing morph?
p5 <- gf_boxplot( ~thorax, data=morphology_data, color= ~w_morph)

# What is the relationship between beak length and thorax?
p6 <- gf_point(beak ~ thorax, data=morphology_data, color= ~w_morph)

# Let's arrange the graphs using grid.arrange()
grid.arrange(p0, p5, p6, ncol=2.5)

Warning: Removed 1 rows containing non-finite values (stat_bin).

Warning: Removed 1 rows containing non-finite values (stat_boxplot).

Warning: Removed 1 rows containing missing values (geom_point).

# More graphs we can make to see different relationships:
p1 <- gf_histogram(~beak, data = female_data, color = "white", bins = 15, 
                   title = "Female Beak Lengths", xlab = "beak lengths (mm)")
p2 <- gf_histogram(~beak, data=male_data, color = "white", bins = 15, 
                   title = "Male Break Lengths", xlab = "beak lengths (mm)")
p3 <- gf_boxplot(~beak, data= female_data)
p4 <- gf_boxplot(~beak, data=male_data)

grid.arrange(grobs = list(p1, p2, p3, p4), ncol=2)

Data visualization: Plotting the data.

Using the generic plot() function, we can plot our data and take advantage of the many arguments the function can take in order to produce a presentable graph.

Arguments:

main = title of graph
xlab = x-axis title
ylab = y-axis title
sub = subtitle placed at the bottom of the x-axis
col = colors (http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)

Font size can be modified using the graphical parameter : cex. The default value is 1. If cex value is inferior to 1, then the text size is decreased. Conversely, any value of cex greater than 1 can increase the font size. The following arguments can be used to change the font size :

cex.main : text size for main title
cex.lab : text size for axis title
cex.sub : text size of the sub-title

plot(female_data$beak)

plot(male_data$beak, main="male beak lengths",
     xlab = "Index",
     ylab = "beak lengths (mm)",
     sub = "Figure 2. Male beak lengths....",
     col= 'blue')

R Fundamentals

Data Types in R:

Numbers (which includes integers and floats: e.g. 6.88 and 1)
Strings (which includes anything in quotations). Strings in R imply a character vector.

String Specific Functions:

print()
cat() - The cat function can be used to concatenate strings and then output them.

Basic operations that can be used in R

$+$ or $-$ addition or subtraction

$*$ multiplication

/ division

%% remainder

** exponent

What are some if statements that you can do with these operations?

l <- c(1, 2, 3, 4, 5)
for (i in l) {
  if (i %% 2 == 0)
    print(i) }

[1] 2
[1] 4

For loops, making counts, and if statements.

i<-0
l <- vector(mode = "list", length = 0)
for (b in female_data$beak) {
  if (b < 6.00) {
    i <- i + 1
    l <- c(l, b)
    cat("Beak length that's smaller than 6: ", b)
  }}

Beak length that's smaller than 6:  5.77

print(i)

[1] 1

print(l)

[[1]]
[1] 5.77

Questions

How would this look like on a different/larger dataset?
What sort of questions do you want to find out from morphology data and put into a comprehensive HTML and/or PDF file?
What variables can you already hypothesize will be correlated?

Morphology Data Analysis

Anastasia Bernat