Data Science in R: Tree strength comparison

Introduction
Part 1: Comparison of redwoods against dataset
Part 2: Comparing tree strength as a function of species
Part 3: Using existing analysis as a baseline

Introduction

This is a script to compare strength of Redwoods against other trees and to display that data visually, based on CSV data.

Part 1: Comparison of redwoods against dataset

First – for nice graphs, let’s use the ggplot2 and gpairs packages.

require(ggplot2)

## Loading required package: ggplot2

require(gpairs)

## Loading required package: gpairs

Cleaning the data

Next, input the CSV file. The na.string arguement drops 0 and - values. This will help us work with the data by removing meaningless entries from the start. Let’s further clean the data by creating a subset of the dataframe to remove trees without groups (null in the Tree.Group header) and 0’s from the na header. Otherwise, we would have some trees grouped that are not related and with no label or R would return an error for no-finite value (the NA’s), but still complete the qplot command:

my.Trees<- read.csv("trees.csv",
                    header = TRUE,
                    na.string = c("-","0"))
named.Trees <- subset(my.Trees, my.Trees$Tree.Group != "null")
named.Trees <- subset(named.Trees, named.Trees$Strength != "NA")

Exploring the data: headers and quartiles

Now let’s take a look at the dataset we’re dealing with. This has the added benefit of displaying header names, which is required for manipulating the dataframe.

head(named.Trees)

##      Tree.Species Tree.Group SpecG Modulus Break.Height Compression
## 2      Ash, Black        Ash  0.49    1.60           35        5970
## 3       Ash, Blue        Ash  0.58    1.40           NA        6980
## 4      Ash, Green        Ash  0.56    1.66           32        7080
## 5     Ash, Oregon        Ash  0.55    1.36           33        6040
## 6      Ash, White        Ash  0.60    1.74           43        7410
## 7 Aspen, Bigtooth      Aspen  0.39    1.43           NA        5300
##   Compress.perp Strength
## 2           760     1570
## 3          1420     2030
## 4          1310     1910
## 5          1250     1790
## 6          1160     1910
## 7           450     1080

summary(named.Trees)

##           Tree.Species   Tree.Group     SpecG         Modulus    
##  Ash, Black     : 1    Oak    :17   Min.   :0.31   Min.   :0.80  
##  Ash, Blue      : 1    Pine   :15   1st Qu.:0.40   1st Qu.:1.35  
##  Ash, Green     : 1    Fir    : 7   Median :0.49   Median :1.57  
##  Ash, Oregon    : 1    Cedar  : 6   Mean   :0.51   Mean   :1.57  
##  Ash, White     : 1    Ash    : 5   3rd Qu.:0.63   3rd Qu.:1.77  
##  Aspen, Bigtooth: 1    Hickory: 5   Max.   :0.88   Max.   :2.28  
##  (Other)        :79    (Other):30                                
##   Break.Height   Compression   Compress.perp     Strength   
##  Min.   :12.0   Min.   :3960   Min.   : 300   Min.   : 790  
##  1st Qu.:22.8   1st Qu.:5410   1st Qu.: 550   1st Qu.:1110  
##  Median :30.5   Median :6250   Median : 770   Median :1390  
##  Mean   :33.2   Mean   :6399   Mean   : 860   Mean   :1480  
##  3rd Qu.:41.0   3rd Qu.:7200   3rd Qu.:1060   3rd Qu.:1850  
##  Max.   :88.0   Max.   :9210   Max.   :2840   Max.   :2660  
##  NA's   :9

quantile(named.Trees$Strength)

##   0%  25%  50%  75% 100% 
##  790 1110 1390 1850 2660

Now we need to determine the mean, median, and quartile ranges to add value to the plots. The output of the quantile command can be indexed. In R, indexes start at 1, not 0.

grand.mean = mean(named.Trees$Strength)
median = median(named.Trees$Strength)
LQ = quantile(named.Trees$Strength)[2]
UQ = quantile(named.Trees$Strength)[4]

Data wrangling: isolate group a group by creating subset

Create subset for redwood and plot the strength against the entire dataset to see how redwood strength compares.

my.Redwoods <- subset(named.Trees, Tree.Group == "Redwood")
boxplot(my.Redwoods$Strength, named.Trees$Strength,
        names = c("Redwoods", "All trees"),
        xlab = "Tree type", ylab = "Strength [psi]",
        col = "darkgreen",
        main = "Comparison of Redwood vs Aggregate Tree Strength",
        las = 2)
abline(h = grand.mean, col = "red", lty = 2)
abline(h = LQ, col = "blue", lty = 4)
abline(h = UQ, col = "blue", lty = 4)

plot of chunk unnamed-chunk-5

Exploring the data: summary

As good practice, let’s also produce a summary

tree.Summary <- list(summary(my.Redwoods$Strength),
                   summary(my.Trees$Strength))
names(tree.Summary) <- c("Redwoods stength [psi]", "All trees strength [psi]")
tree.Summary

## $`Redwoods stength [psi]`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     940     982    1020    1020    1070    1110 
## 
## $`All trees strength [psi]`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     790    1130    1390    1470    1820    2660       5

Interepting the data: Redwood strength

As you can see, Redwoods are relatively soft, below both IQR, grand mean, and median strength.

Part 2: Comparing tree strength as a function of species

Perhaps there are other species of trees that are soft like redwood. Let’s see all the strength of all the trees by species.

q <- qplot(Tree.Group, Strength,
         data = named.Trees,
         geom = "boxplot",
         main = "Tree strength by species",
         xlab = "Tree Species", ylab = "Strength [psi]")
tree.average <- c(my.Trees$Strength)

q + geom_hline(yintercept=grand.mean,
               colour = "red",
               linetype = 2,
               show_guide = TRUE) + 
  geom_hline(yintercept=LQ,
               colour = "blue",
               linetype = 4,
               show_guide = TRUE) +
  geom_hline(yintercept=UQ,
               colour = "blue",
               linetype = 4,
               show_guide = TRUE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot of chunk unnamed-chunk-7

Part 3: Using existing analysis as a baseline

But perhaps there are other mechanical benefits to Redwood compared to the data set. Let’s see how the other material properties compare to strength, which we know that Redwoods are lacking, compared to the dataset.

Data wrangling: subsetting to generate regressions

Let’s create a new dataframe with the material properties and do a regression. Regressions against other data, as displayed with the head() command, would be meaningless. The best way to do this is to create a new dataframe by combining the columns of interest from the original dataframe.

trees.Properties <- data.frame(my.Trees[c("SpecG","Modulus", "Break.Height", "Compression", "Strength")])

gpairs(trees.Properties)

plot of chunk unnamed-chunk-8

Exploring the data: linear modeling & coefficients of determination

We can observe general linear trends with the regressions, but let’s get more of a quantitative feel for the data.

SpecG.lm = lm(Strength ~ SpecG, data = trees.Properties)
Modulus.lm = lm(Strength ~ Modulus, data = trees.Properties)
Break.Height.lm = lm(Strength ~ Break.Height, data = trees.Properties)
Compression.lm = lm(Strength ~ Compression, data = trees.Properties)

Calling the summary function gives a detailed list of statitics… let’s take a look at the coefficients of determination by calling the r.squared attribute:

summary(SpecG.lm)$r.squared

## [1] 0.8174

summary(Modulus.lm)$r.squared

## [1] 0.3855

summary(Break.Height.lm)$r.squared

## [1] 0.6362

summary(Compression.lm)$r.squared

## [1] 0.5999

Interpreting the data: predicting other properties

As we can see, althought there is a general linear relationship, strength is not always the best predictor of other mechanical properties. To determine how redwoods or other species compare, a treatment similar to the strength analysis may be used.