Assignment 1 - R practice

require(dplyr)
require(ggplot2)
require(tidyr)
require(vegan)
require(corrplot)

1) Loading data

Load both data files – Oak_data_47x216.csv and Oak_species_189x5.csv

Verify that the files have been loaded correctly. What dimensionality and class is each?

A: Oak dim: 47x217; class: data frame. Oak_species dim: 189x5; class: data frame.

# 1) loading data ----
Oak <- read.csv("data/Oak_data_47x216.csv", header = TRUE)
Oak_species <- read.csv("data/Oak_species_189x5.csv", header = TRUE)
#View(Oak) 
dim(Oak)
class(Oak)
#View(Oak_species) 
dim(Oak_species)
class(Oak_species)

2) Searching For Help

We will focus on four functions available in dplyr: select(), arrange(), filter(), and mutate(). Briefly describe what each function does and illustrate it using the oak dataset that you loaded above.

dplyr::select() = this function subset columns using their names and types (e.g., numeric). Operators ( :, ! , &, |, c()) and selection helpers (e.g., everything(), matches(), etc.) can be used in the select() function to target desired columns based on different criteria.
dplyr::arrange() = orders the rows of a data frame by the values of selected columns. Order can be descendent if desc() is used.
dplyr::filter() = The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions. Note that when a condition evaluates to NA the row will be dropped, unlike base subsetting with [. Filter functions can be used to refine our conditions, for instance we could use: ==, >, >= , &, |, !, xor(), is.na(), between(), near().
dplyr::mutate() = Create, modify, and delete columns. mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name. Variables can be removed by setting their value to NULL. Some useful mutate() functions are: +, -, log(), etc., lead(), lag(), dense_rank(), min_rank(), percent_rank(), row_number(), cume_dist(), ntile(), cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(), na_if(), coalesce(), if_else(), recode(), case_when().

# 2) Searching For Help ----
?dplyr
browseVignettes(package = "dplyr")

# select() example 
#colnames(Oak)
# select Oak spp. that start with "La" or "Li" and exclude "LatAppx", "Landform"
oak_select <- Oak %>%
  dplyr::select(starts_with("La") | starts_with("Li"), -c("LatAppx", "Landform"))
#head(oak_select)
#colnames(oak_select)

# arrange() example
?dplyr::arrange()
# arrange Oak data by ascending elevation values
oak_arrange <- Oak %>%
  arrange(Elev.m)
#head(oak_arrange)

# filter() example
?dplyr::filter()
# filter rows that include Liap values >= 0.5 at elevation greater than mean(Elev.m)
oak_filter <- Oak %>%
  filter(Liap >= 0.5 & Elev.m > mean(Elev.m))
#head(oak_filter)
# check if the conditions were satisfied 
# min(oak_filter$Liap) #0.5
# mean(Oak$Elev.m) #145.5745
# min(oak_filter$Elev.m) #152

# mutate() example
?dplyr::mutate
# create a new column Syal.s_elev that represents average Syal.s abundance by 
# elevation 
oak_mutate <- Oak %>%
  group_by(Elev.m) %>%
  mutate(Syal.s_elev = mean(Syal.s))
#View(oak_mutate)
#head(oak_mutate)
# check is the conditions were satisfied (e.g., where elevation is 152 
# Syal.s_elev is always 16.85714)
# oak_mutate$Elev.m
# oak_mutate$Syal.s_elev

3) Data Manipulation and Indexing

Answer the following questions. You may use regular indexing or the dplyr functions you explored above.

Which soil series is associated with Stand 17?

A: Carlton

Which soil series is most common? Least common?

A: Most common: Steiwer, count 2. Least common: Olympia, count 22.

Corvallis, Oregon, is 44.5667 N latitude, 123.2833 W longitude. How many plots are located east of Corvallis? West of Corvallis?

A: 20 plots are East of Corvallis, and 27 plots are West of Corvallis.

Which stand contained the most species, and how many species did it contain?

A: Stand 43, 61 species

Does species richness differ between stands that were grazed or ungrazed when the data were collected?

A: Based on the plot “SppRich vs GrazCurr”, stands with the lowest species richness were not grazed. Around the SppRich’s mean (red line) I observe both grazed and ungrazed stands, while the highest SppRich is found in a grazed stand. Moreover, SppRich’s values in the 1st quantile (below lower blue line) are mostly ungrazed, while those in the 4th quantile (above upper blue line) are more abundant in the grazed class. Thus, species richness seems to differ between stands that were grazed or ungrazed when the data were collected, although further analysis is needed to evaluate if the difference is significant. Lastly, I think the analysis should include past grazing (GrazPast) as well to provide more informative results.

On average, how many shrubs occurred in each stand?

A: On average, each stand had 13 shrub species (R output = 12.87234).

# 3) Data Manipulation and Indexing ----
# * a) Which soil series is associated with Stand 17? ----
stand17_soil <- Oak %>%
  filter(X == "Stand17") %>%
  select(SoilSeriesName)

print(stand17_soil)
# SoilSeriesName
# 1        Carlton

# * b) Which soil series is most common? Least common? ----
Oak$SoilSeriesName

oak_soil <- Oak %>%
  mutate(count = as.numeric(1)) %>%
  group_by(SoilSeriesName) %>%
  mutate(soil_count = sum(count)) %>%
  select(SoilSeriesName, soil_count) %>%
  distinct(soil_count, .keep_all = TRUE) %>%
  arrange(soil_count)

#View(oak_soil)
nrow(oak_soil) # 7 
oak_soil[1, ] # least common 
# SoilSeriesName soil_count
# <chr>               <dbl>
#   1 Olympia                 2
oak_soil[7, ] # most common
# SoilSeriesName soil_count
# <chr>               <dbl>
#   1 Steiwer                22

# * c) Corvallis, Oregon, is 44.5667 N latitude, 123.2833 W longitude. ----
# How many plots are located east of Corvallis? West of Corvallis?

oak_east <- Oak %>%
  filter(LongAppx > 123.2833) 
nrow(oak_east) #20
oak_west <- Oak %>%
  filter(LongAppx < 123.2833)
nrow(oak_west) # 27

# * d) Which stand contained the most species, and how many species did it contain? ----
length(unique(Oak$X)) #47
oak_SppRich <- Oak %>%
  arrange(desc(SppRich)) %>%
  select(X, SppRich)
#View(oak_SppRich)

oak_SppRich[1,]
# X SppRich
# 1 Stand43      61

# * e) Does species richness differ between stands that were grazed or ungrazed ----
# when the data were collected? GrazCurr 
oak_graz <- oak_SppRich <- Oak %>%
  arrange(desc(SppRich)) %>%
  select(SppRich, GrazCurr)
#View(oak_graz)
summary(oak_graz$SppRich)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 16.00   29.50   34.00   34.45   39.00   61.00 
quantile((oak_graz$SppRich))
#   0%  25%  50%  75% 100% 
# 16.0 29.5 34.0 39.0 61.0 
#library(ggplot2)
oak_graz %>%
  ggplot(aes(GrazCurr, SppRich)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = mean(oak_graz$SppRich), col = "red") +
  geom_hline(yintercept = quantile(oak_graz$SppRich)[c(2,4)], col = "blue") +
  ggtitle("SppRich vs GrazCurr")

# * f) On average, how many shrubs occurred in each stand? ----
# don't know how to answer this question since there is not a field that counts the number of shrubs 
# library(tidyr)
# library(vegan)

oak_shrub <- Oak %>%
  select(contains(".s")) %>%
  decostand(method = "pa") 

Oak_shrubRich <- Oak %>% 
  mutate(shrubRich = rowSums(oak_shrub))

mean(Oak_shrubRich$shrubRich) #12.87234

4) Relativizing Data

Someone has asked you whether the number of shrub species relates to the depth of the A horizon.

Answer their question using the raw depth data.

A: In the plot “shrubRich ~ AHoriz” the blue line represents the linear fit of the data (assuming that both variables satisfy the assumption of normal distribution). Based on visual interpretation, I observe that the relationship between shrub richness and the depth of the A horizon is weak and slightly negative. This observation is also supported by their Spearman’s and Pearson’s correlation coefficients, -0.13 and -0.21 respectively.

Would relativizing the depth of the A horizon by its maximum value affect your conclusion? Why or why not?

A: It is unlikely that relativizing the depth of the A horizon by its maximum value will affect my conclusion. The A horizon distribution is right-skewed (graph “AHoriz”) while the shrub richness as a more bell-shaped distribution (graph “shrubRich”), and applying a linear standardization may not reduce the skewness. Additionally, the CV of the A horizon has a moderate (between 50-100) magnitude of 64.3, and according to the material covered in class, relativization tends to be more effective when the CV’s magnitude is large. A logarithmic transformation could be explored to see if a possible skewness reduction could improve the relationship between the two variables.

Demonstrate the effect of this relativization by applying it, using the relativized data to answer their question, and comparing with your results above. Note: pay attention to more than just the p-value.

A: As show in the graph “shrubRich ~ AHorizMax”, the relativization did not visibly affect the relationship between the two variables. Moreover, as shown by the graph “AHorizMax”, the relativization did not reduce the skewness of the A horizon distribution. Finally, in both cases the p-value is not significant (0.166), the adjusted R-squared is extremely low (0.02), and by relativinzing the A horizon the standard error increased form 0.04 to 2.26.

I further explored the data by applying a natural log and a log 10 transformations, and although the skewness was reduced, the fitted models produced results showing a weak relationship between the two variables and a non significant p-value (0.099) assuming an alpha = 0.05 (see graphs “AHorizLog”, “AHorizLog10”, “shrubRich ~ AHorizLog”, “shrubRich ~ AHorizLog10”)

You also have information about the B horizon. Propose a relativization that would allow you to incorporate this information into your analysis, while still relating the A horizon to shrub richness. How does this change the question being addressed?

A: I propose the relativization method “Total”, which assigns to each variable its proportional value by row. I’m going to apply this method to the variables AHoriz and BHoriz. By using this method. the question should be: “Considering the A horizon depth as relative to the B horizon depth, does the number of shrub specie relate to the proportion of the A horizon?

A: As shown in the graph “shrubRich ~ AHorizTot”, this relativization did not improve the relationship between shrub richness and A horizon, and did not reduce the skewness of the A horizon (graph “AHorizTot”). Moreover, the fitted model produced the worst results so far, including a p-value = 0.601, a standard error = 2.68, and an adjusted R-squared = -0.02.

# 4 Relativizing Data ----
# Someone has asked you whether the number of shrub species relates to the depth 
# of the A horizon

# * a) Answer their question using the raw depth data ----
# plot the data 
Oak_shrubRich %>%
  ggplot(aes(AHoriz, shrubRich)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  ggtitle("shrubRich ~ AHoriz")

#?geom_smooth

# check correlation  
#library(corrplot)
Oak_shrubRich %>%
  select(AHoriz, shrubRich) %>%
  cor(., method = "spearman")
#             AHoriz    shrubRich 
# AHoriz     1.0000000 -0.1268875
# shrubRich -0.1268875  1.0000000

Oak_shrubRich %>%
  select(AHoriz, shrubRich) %>%
  cor(., method = "pearson")
#               AHoriz shrubRich
# AHoriz     1.000000 -0.205476
# shrubRich -0.205476  1.000000

# * b)  Would relativizing the depth of the A horizon by its maximum value ----
# affect your conclusion? Why or why not?

# check the distribution of the data 
hist(Oak$AHoriz, main = "AHoriz") # right-skewed

# CV function 
CV <- function(x) {100 * sd(x) / mean(x)}
CV(Oak$AHoriz) # 64.31733

# claculate CV manually (sd/mu) * 100 to double check if the function works 
(sd(Oak$AHoriz)/mean(Oak$AHoriz))*100 # 64.31733, the magnitude is moderate (within 50-100)


# * c) Demonstrate the effect of this relativization by applying it, ----
# using the relativized data to answer their question, and comparing with your 
# results above. Note: pay attention to more than just the p-value.

?decostand

AHoriz_r <- Oak %>%
  select(AHoriz) %>%
  decostand(method = "max")
summary(AHoriz_r) # check if the range of values is between 0 and 1

Oak_AHoriz_r <- Oak %>% 
  mutate(shrubRich = rowSums(oak_shrub)) %>%
  mutate(AHorizMax = AHoriz_r$AHoriz)

# check again the distribution of the data 
#hist(Oak_AHoriz_r$AHoriz, main = "AHoriz") # right-skewed 
hist(Oak_AHoriz_r$AHorizMax, main = "AHorizMax") # right-skewed

hist(Oak_AHoriz_r$shrubRich, main = "shrubRich") # ~ bell-shaped

# plot the data 
Oak_AHoriz_r %>%
  ggplot(aes(AHorizMax, shrubRich)) +
  geom_point(alpha = 0.5) +
  #geom_abline(intercept = 0, slope = 3, size=0.5, color = "red") +
  geom_smooth(method = "lm") +
  ggtitle("shrubRich ~ AHorizMax")

# * d) You also have information about the B horizon. ----
# Propose a relativization that would allow you to incorporate this information 
# into your analysis, while still relating the A horizon to shrub richness. 
# How does this change the question being addressed?

# I suggest to use "total" to get the proportion of the A and B horizon. 
ABHoriz_r <- Oak %>%
  select(AHoriz, BHoriz) %>%
  decostand(method = "total")

#check if the sum by row adds to 1 
rowSums(ABHoriz_r) # yes!

# * e) Demonstrate the effect of this relativization by applying it, ----
#using the relativized data to answer their question, and comparing with your 
#results above. Note: pay attention to more than just the p-value.

ABHoriz_r <- Oak %>%
  select(AHoriz, BHoriz) %>%
  decostand(method = "total")

Oak_ABHoriz_r <- Oak %>% 
  mutate(shrubRich = rowSums(oak_shrub)) %>%
  mutate(AHorizMax = AHoriz_r$AHoriz) %>%
  mutate(AHorizTot = ABHoriz_r$AHoriz) %>%
  mutate(BHorizTot = ABHoriz_r$BHoriz)

# check if the A and B horizons were correctly added to the Oak dataset
summary(ABHoriz_r$AHoriz)
summary(Oak_ABHoriz_r$AHorizTot)
summary(ABHoriz_r$BHoriz)
summary(Oak_ABHoriz_r$BHorizTot)

# plot the data - total transformation 
Oak_ABHoriz_r %>%
  ggplot(aes(AHorizTot, shrubRich)) +
  geom_point(alpha = 0.5) +
  geom_abline(intercept = 0, slope = 3, size=0.5, color = "red") +
  geom_smooth(method = "lm") +
  ggtitle("shrubRich ~ AHorizTot")

hist(Oak_ABHoriz_r$AHorizTot, main = "AHorizTot")

# try log transformation 
AHoriz_r <- Oak %>%
  select(AHoriz) %>%
  decostand(method = "log")
summary(AHoriz_r) # check if the range of values is between 0 and 1

Oak_AHoriz_log <- Oak %>% 
  mutate(shrubRich = rowSums(oak_shrub)) %>%
  mutate(AHorizLog = AHoriz_r$AHoriz) %>%
  mutate(AHorizLog10 = log10(Oak$AHoriz))

# plot the data - natural log # BEST TRANSFORMATION 
Oak_AHoriz_log %>%
  ggplot(aes(AHorizLog, shrubRich)) +
  geom_point(alpha = 0.5) +
  #geom_abline(intercept = 0, slope = 3, size=0.5, color = "red") +
  geom_smooth(method = "lm") +
  ggtitle("shrubRich ~ AHorizLog")

hist(Oak_AHoriz_log$AHorizLog, main = "AHorizLog")

# plot the data - log 10
Oak_AHoriz_log %>%
  ggplot(aes(AHorizLog10, shrubRich)) +
  geom_point(alpha = 0.5) +
  #geom_abline(intercept = 0, slope = 3, size=0.5, color = "red") +
  geom_smooth(method = "lm") +
  ggtitle("shrubRich ~ AHorizLog10")

hist(Oak_AHoriz_log$AHorizLog10, main = "AHorizLog10")

lm.raw <- lm(shrubRich ~ AHoriz, data = Oak_shrubRich)
summary(lm.raw) # pvalue 0.166, std. err 0.04341, Adjusted R-squared:  0.02094 

lm.max <- lm(shrubRich ~ AHorizMax, data = Oak_AHoriz_r)
summary(lm.max) # pvalue 0.166, std. err 2.2575, Adjusted R-squared:  0.02094 

lm.tot <- lm(shrubRich ~ AHorizTot, data = Oak_ABHoriz_r)
summary(lm.tot) # pvalue 0.601, std. err 2.68, Adjusted R-squared:  -0.01594

lm.log <- lm(shrubRich ~ AHorizLog, data = Oak_AHoriz_log)
summary(lm.log) # pvalue 0.09922 , std. err 0.4707, Adjusted R-squared:  0.03834 

lm.log10 <- lm(shrubRich ~ AHorizLog10, data = Oak_AHoriz_log)
summary(lm.log10) # pvalue 0.0992, std err 1.564, Adjusted R-squared:  0.03834

Assignment 1 - R practice

Astrid Sanna

1/12/2023

1) Loading data

2) Searching For Help

3) Data Manipulation and Indexing

4) Relativizing Data