Module 2: Basic Functions

By the end of this lesson, you should be able to:

Load Project Mosaic
Differentiate between quantitative and categorical variables
Use basic statistical functions on variables of datasets
- Mean
- Standard Deviation
- Median
- Interquartile Range

Project Mosaic

Project Mosaic is a package in RStudio that contains all of the statistical functions you will use in this course. To access these functions, you must load the package. You can check to see if it is loaded by selecting 'Packages' in the Files/Plots/Packages Pane, and making sure the box next to Mosaic is selected.

We can also load Project Mosaic with the following command:

library(mosaic)

We will use the root data in this module, which is loaded for you below. It is called 'rootDataUpdated' because one row has been deleted from the original dataset. Don't worry about what the code is doing for now, but if you'd like to follow along, copy and paste the following three lines of code into your console.

rootData = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdEFXbUNtNkp5VWhieUo5N3VSMVhPemc&output=csv")

## Error: Missing packages.  Please retry after installing the following:
## RCurl

rootData$faa = as.factor(rootData$faa)

## Error: object 'rootData' not found

rootDataUpdated = rootData[-4,]

## Error: object 'rootData' not found

Types of Variables

Reminder! Variables are attributes that describe cases. This section further delves into this topic.

Quantitative variables hold numeric values while categorical variables hold discrete values. In the Roots dataset we will be working on root_length, which is a quantitative variable (think of it as having continuous values), while faa is a categorical variable.

A. Functions on Categorical Variables

Before reading on, keep in mind that the synatx dataset$column_name is a way of accessing individual columns of a dataset.

Categorical variables hold discrete values, and these values are referred to as levels. This is an important distinction that is easy to confuse. The 'faa' variable in the roots dataset has 8 levels: one for each treatment of FAA. These levels are different options for the same attribute. In R, we can type the following command to determine the various levels of a categorical variable

levels(rootData$faa)

## Error: object 'rootData' not found

Now that we know faa has eight levels, suppose we want to know the number of cases for each level. We can use the table() function to determine these values.

table(rootData$faa)

## Error: object 'rootData' not found

B. Functions on Quantitative Variables

The following functions will be used throughout this course on quantitative variables:

Take the mean by using the mean() function.

The mean() function returns the arithmetic average value for a variable in the dataset. The following command returns a mean of 6.12 mm.

mean(rootDataUpdated$root_length)

## Error: object 'rootDataUpdated' not found

The sd() function generates the standard deviation, which is a value that explains how much variation there is in the data. Each reported mean value should also report a standard deviation to give an idea of the variable's spread. Standard deviations measure distance, so they do not have explicit variables. The following command gives the standard deviation on the root_length variable as being 6.81. To interpret this, you should add and subtract this value from the mean value you previously found. We can now say that 68% of our cases have heights between -0.69 mm and 12.92 mm (don't worry about the negative value for now).

sd(rootDataUpdated$root_length)

## Error: object 'rootDataUpdated' not found

Another measure of the data is the interquartile range, which is used in certain types of graphs. If we rank all the height values from lowest to highest, we can then separate the values into quartiles. Using the quantile() function:

quantile(rootDataUpdated$root_length)

## Error: object 'rootDataUpdated' not found

we can see that 25% of our cases are of length 0 mm, and 25% of our cases fall above 9.96 mm. This middle 50% region is called the interquartile range. According to Wikipedia, “the interquartile range (IQR), also called the midspread or middle fifty, is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles: IQR = Q3 − Q1. In other words, the IQR is the 1st quartile subtracted from the 3rd quartile”. Here's a graphical depiction of the interquartile range: [insert image from Wikipedia?]

We use the middle section of our data because it is presumed to be free of outliers, which are observations subject to high levels of variability. To find the interquartile range spread, we use the iqr() function:

iqr(rootDataUpdated$root_length)

## Error: object 'rootDataUpdated' not found

The median() function generates the middle value of a variable. Half of the values are above the median and half of the values are below the median.

median(rootDataUpdated$root_length)

## Error: object 'rootDataUpdated' not found

Now it's your turn to try [we need to edit this]

The betagal.csv data is loaded into R for the exercises below. The entire dataset is also printed for you to give you a sense of the variables.

betagal = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdFlOcmt5bzY4VlFKRmtDdFJRMldTeEE&output=csv
")

## Error: Missing packages.  Please retry after installing the following:
## RCurl

What are the levels of the inducer variable, and how many cases fall under each level?
What is the average IU value?
What is the 75% cutoff for the data? Use the quantile() function to calculate this value
What is the size of the interquartile range?