Task

Collect and document several datasets. There are three main steps

1: Select and Download data

First we need to find a dataset and download the raw data (in .txt or .csv).

  1. Go to http://archive.ics.uci.edu/ml/datasets.html
  2. Select a dataset by clicking on its name
  3. Record basic information about the dataset in the Google Sheet: Google Sheets link
  4. Download the main data files (e.g.; .txt, .csv, .data) and save to the txt_files folder in the Dropbox main folder.

2: Load data and save as .RData file

  1. Open the data in a text editor to see how columns are separated and to see if there are headers. Next, load the data into R with read.table(). Store the data as an object with an apporpirate name (i.e.; no spaces, all lower-case)
# Load the datafile into R as a new object.

# Give the data a short name (in this case, just abalone)
abalone <- read.table("~/Dropbox/datarepository/txt_files/abalone.txt", 
                      sep = ",",    # How are columns separated? (usually ",")
                      header = F,   # If there is a header row, use header = T
                      stringsAsFactors = F # Always include this!
                      )

Once you’ve loaded the data, take a quick look at itwith head() to make sure it loaded correctly. Here are the first few rows of abalone

head(abalone)
##   V1    V2    V3    V4     V5     V6     V7    V8 V9
## 1  M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
## 2  M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070  7
## 3  F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210  9
## 4  M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
## 5  I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055  7
## 6  I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120  8
  1. If the data do not have column names, you’ll need to add them. You may need to look at the documentation for the data in order to know what the columns are. For the Abalone dataset, I found the names here.

Use short column names (i.e. less than 10 - 15 characters) all in lower-case, and with no spaces.

# Add column names
#  The names should all be in lower-case and shouldn't be too long.

names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", 
                    "shucked_weight", "viscera_weight", "shell_weight", "rings")

Here’s how the data look now

head(abalone)
##   sex length diameter height whole_weight shucked_weight viscera_weight
## 1   M  0.455    0.365  0.095       0.5140         0.2245         0.1010
## 2   M  0.350    0.265  0.090       0.2255         0.0995         0.0485
## 3   F  0.530    0.420  0.135       0.6770         0.2565         0.1415
## 4   M  0.440    0.365  0.125       0.5160         0.2155         0.1140
## 5   I  0.330    0.255  0.080       0.2050         0.0895         0.0395
## 6   I  0.425    0.300  0.095       0.3515         0.1410         0.0775
##   shell_weight rings
## 1        0.150    15
## 2        0.070     7
## 3        0.210     9
## 4        0.155    10
## 5        0.055     7
## 6        0.120     8
  1. Save the data as a .RData file in the RData_files folder with save()
save(abalone, file = "~/Dropbox/datarepository/RData_files/abalone.RData")

3: Document the data with a NAME_doc.R file

Now you need to document the dataset in R by creating a new file called NAME_doc.R

  • Open the template_doc.R file (located in the documentation_files folder) in RStudio.
  • Save the file under a new name NAME_doc.R (where NAME is the name of the dataset) in the documentation_files folder (e.g.; abalone_doc.R)
  • Fill in the template (e.g.; dataset name, description, column names). You should be able to get the column names from the main data webpage. You can see a completed version in abalone_doc

Here’s how a completed version should look:

#' abalone dataset
#'
#' Predict the age of abalone from physical measurements
#'
#' @format A data frame containing 4177 rows and 8 columns
#' \describe{
#'   \item{sex}{either M, F, or I (infant)}
#'   \item{length}{Longest shell measurement}
#'   \item{diameter}{perpendicular to length}
#'   \item{height}{with meat in shell}
#'   \item{whole_weight}{whole abalone}
#'   \item{shucked_weight}{weight of meat}
#'   \item{viscera_weight}{gut weight (after bleeding)}
#'   \item{shell_weight}{after being dried}
#'   \item{rings}{+1.5 gives the age in years}
#'   ...
#' }
#' @source http://archive.ics.uci.edu/ml/datasets/Abalone
#' @export
#'
"abalone"