Introduction and Background

This report is intended to model how to export publically available data from .txt viles for analysis. The intended audience is a secondary level science research class who may intend on using R in order to conduct research and prepare reports for presentation.

The data sets herein were taken from the National Oceanic and Atmospheric Association’s (NOAA) buoy array system. BUOY Station: 44017, Montauk Point, NY.

1. Exporting and “Cleaning” Data Sets

The data of interest includes both average annual wave height and air temperature from the years 2003-2006 and will be presented as boxplots for ease of comparison.

The first objective is reading the data, which is accessible as a .txt file download from the NOAA site and then uploaded to your rstudio.

This is done by first applying a read.delim fuction to the select file, followed by a sep function in order to separate the data into columns.

Note that in this specific data set, columns are separated by spaces in the .txt file and the appropriate “sep=” is applied. This may require adjustment if your columns are separated by other characters, such as “|”.

ds03 <- read.delim("03_Data.txt", sep="") #2003 Data

Certain values in the data set are placefiller in the automatic recording software, namely a “99.00” code for values unrecorded. This is adjusted by applying a na.string function in the read.delim for values deemed inappropriate or unneeded.

ds03 <- read.delim("03_Data.txt", sep="", na.string="99.00") #2003 Data

Lastly, we select a specific column for analysis, in this case wave height is isolated ($) using its column heading “WVHT.” This specific is then renamed to “O3_hdata,” for ease of interpretation.

ds03 <- read.delim("03_Data.txt", sep="", na.string="99.00") #2003 Data

O3_hdata <- ds03$WVHT

head(O3_hdata) #head(...) function applied to show small sample (first 5 values) of set. 
## [1] 1.21 1.33 1.49 1.57 1.59 1.45

This code is applied to the .txt files for all of the years within the target data set. Isolating the waveheight data from 2003-2006.

A similar second set of code is run in order to isolate the air temperature data from this range, taken from the same .txt files.

Note the only functional differences in the code are having to adjust for the air temperaure column heading, “ATMP”, unrecorded value code, “999.0” and the assignment of an appropriate object name “O3_tdata.”

ds03 <- read.delim("03_Data.txt", sep ="", na.strings = "999.0") #2003 Data

O3_tdata <- ds03$ATMP

head(O3_tdata)
## [1] 10.7 10.8 10.8 10.7 10.9 10.0

2. Data Frames

Once all of the data sets of interest have been isolated, they must be prepared for plotting.

In order for multiple boxes to be included on the same plot, a data frame must be prepared. This is most often achieved using the “data.frame” function. An example:

df <- data.frame(A = c(1,2,3),
                 B = c(4,5,6),
                 C = c(7,8,9))

df
##   A B C
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

For a successful data.frame we must ensure all columns are the same length. Notice of sets A, B & C in our example each contained 3 values. If this is not true, we may recieve the following error:

“arguments imply differing number of rows”

Sometimes our raw data will not be so aligned. If we want to include objects in our data.frame, such as our “O3_tdata,” we must make sure it is the same length as all other objects in the frame.

One way to achieve this is to assign one set as the standard, and align all other sets to be the same length. Here we will assign the 2004 wave height data to be the same length as the 2003 set.

length(O4_hdata) <- length(O3_hdata)

This is then done for all data sets of the same type. All wave height sets aligned to the 2003 height set, and all temperature sets aligned to the 2003 temperature set.

Once lengths have been aligned, we can build our data frame.

When we insert objects, such as “O3_hdata” we assign them headings in the frame (A,B,C from earlier) these headings will later appear on our plot so we must plan ahead.

We will assign each year as “y200X,” as we cannot name a column with strictly numbers. Each heading and its assigned object must be separated by a comma (,).

Here’s our wave height data frame (dfh):

dfh <- data.frame(y2003 = O3_hdata,
                 y2004 = O4_hdata,
                 y2005 = O5_hdata,
                 y2006 = O6_hdata)

head(dfh) #This function will produce a small sample of our frame.
##   y2003 y2004 y2005 y2006
## 1  1.21  1.27  0.92  0.95
## 2  1.33  1.54  1.02  0.99
## 3  1.49  1.87  1.19  0.88
## 4  1.57  2.42  1.29  0.87
## 5  1.59  2.55  1.39  0.94
## 6  1.45  3.01  1.52  0.77

We will do the same for our temperature data, being sure to name it appropriately (dft) and include the appropriate objects (“O3_tdata”, etc).

dft <- data.frame(y2003 = O3_tdata,
                 y2004 = O4_tdata,
                 y2005 = O5_tdata,
                 y2006 = O6_tdata)

3. Boxplots

Your data frames can be represented in a number of ways. For this particular data set will we use box plots. They include several features than can be useful in comparing data.

Boxplot Example The basic box plot function, “bloxpot(…)” may be used:

boxplot(dfh)

The plot includes the basic features: median, quartile ranges, minimum/maxium, but can be better refined.

Several additions to the boxplot function will allow us to complete the following:

  1. Add a chart title. +main = “Average Annual Wave Height

  2. Label the x-axis. +xlab = “Year”

  3. Label the y-axis. +ylab = “Wave Height(m)”

  4. Adjust y-axis range to better center our data, in our case from 0-4. +ylim = c(0,4)

  5. Remove outliers. The large quantity in our data set is distracting and not necessary for the comparisons we wish to make +outline = FALSE

Let’s put it all together:

boxplot(dfh, main="Average Annual Wave Height", 
        xlab= "Year", ylab = "Wave Height (m)" , ylim=c(0,4), outline=FALSE)

We can see now that there is not that great a difference in average wave weight year-to-year at this site during 2003-2006.

4. Adjusting for Temperature Plot(s)

For the temperature data we can run a very similar code. Our titles and labels must be adjusted, as well as our y-axis range.

Most notable, however, is trying to include the appropriate unit, degrees Celsius. This requires a special character. In R we must can use unicode to overcome this, specifically “u00B0C” note that a backslash, “" must be included in front of the u in order for the unicode character to appear (see code below).

boxplot(dft, data=df, main="Average Monthly Air Temperature", 
        xlab= "Year", ylab = "Water Temperature(\u00B0C)" , ylim=c(0,30), outline=FALSE)

There appears to be a difference in the average air temperature during the years represented, with 2005 & 2006 noticeably higher. This aligns with NOAA’s information on global temperature year-to-year, having those 2 years ranked in the top 10 hottest years on record.

Conclusion

R is a great tool for organizing and analyzing data collected from scientific experiments. There is great control in how we can manipulate and present the data we collect by using R. I hope this document has been helpful in clarifying how to isolate data, construct data frames and produce boxplots to compare these sets.