Module 1: Introduction to R, RStudio, and Datasets

By the end of this lesson, you should be able to:

R As a Calculator

You can use R as a calculator. Try typing the following command into the console:

3+5 
## [1] 8

Note that the '## [1] 8' is the computer output from the command '3+5'

There are several other commands you can use in R. Try these in the console.

3^2 
## [1] 9
9^(.5)
## [1] 3

Note that the “^” sign is a shorthand for the exponent function. We have used it to find both the square of 3 and the square-root of 9.

3*5
## [1] 15
3/5
## [1] 0.6

Functions in R

A function is a set of instructions that perform a specific task. In R, functions are written with names, followed by a set of parentheses. Sometimes, there is more written inside the parentheses, and these statements are called arguments. At the beginning of this module, you practiced basic algebra in R. There are built-in functions that perform these tasks as well! For example:

sum(3,5)
## [1] 8

In this example, sum() is a function that has numbers as arguments, and takes the sum of those numbers. When there is more than one argument, the arguments are separated by commas. Another function is the sqrt() function, which takes the square root of a number. Therefore, sqrt() can only take in one argument, because it only performs an operation on one number. For example:

sqrt(9) 
## [1] 3

These both result in the answer 3.

But, if we input more than one number, then we get an error, which is a scenario in which R is unable to perform a given operation.

sqrt(9,16) #This will generate an error
## Error: 2 arguments passed to 'sqrt' which requires 1

Cases, Variables, and Canonical Data Form

A case is a trial in an experiment. In experiments, we want to have as many cases as possible in order to get the best results. For example, suppose we want to determine the average length a plant root will grow. If we use 50 roots to determine this value, then each of those roots is an individual case.

A variable is a specific aspect of a case. Cases can have only one variable, or they can have several variables. In the root length example, the length that each root grows is a variable, because it is an attribute that applies to each case. Additionally, we can treat roots with different concentrations of a chemical (called FAA) to see if FAA affects root growth. In this scenario, the amount of FAA used on each plant is another variable.

How do we organize this data?

We organize data in canonical data form. There are two important features of canonical data form:

To demonstrate, let's take a look at the root length data. Here is an image of part of the experiment, along with the data in canonical data form: [INSERT PICTURE]

## Error: Missing packages.  Please retry after installing the following:
## RCurl
## Error: object 'rootData' not found

Reading Data from Google Spreadsheets to RStudio

Before you export your data from Google to R, take a minute to make sure that you've done the following steps: 1. Make sure your variables only have one-word names. If you need them to have more than one word, using a period as a space. Do not use spaces in your variable names! 2. R is case-sensitive, so it is easier if everything is lower case. 3. The file that you want to upload should only contain your data in canonical format. 4. If you have other information that you want to keep, save it in a different document, called a codebook.

If you have formatted your data correctly, then it can be read as .csv form, which is how R reads your data. So, download your properly-formatted spreadsheet as a .csv file.

Next, open your RStudio browser. In the Files/Plots/Packages/Help pane, upload your data by selecting Files > Upload > Choose File > your_spreadsheet_name.csv. Then, select Open, and you have now uploaded your data to RStudio! You will find the name of your data in your list of Files.

To access this data, you will first have to give it a one-word name (let's call this exampleData). In the R console, type: exampleData = read.csv(“The name of your data frame”). read.csv() is a function that takes a file name as an argument and reads it into the RStudio console under the given name on the left-hand side of the equals sign. Now, you can perform functions on your data if you refer to it as exampleData!

Basic Functions on Datasets

Once you have read your data in R, there are a few functions you can use to determine whether you have done so correctly. All of the following functions take one argument: the name you have assigned to your data. These functions are demonstrated below:

The dim() function will return the number of cases (which is the number of rows) and the number of variables (which is the number of columns)

dim(rootData)
## Error: object 'rootData' not found

The names() function returns the names of the variables in the dataset

names(rootData)
## Error: object 'rootData' not found

The nrow() function returns the number of cases in the dataset, which corresponds to the number of rows

nrow(rootData)
## Error: object 'rootData' not found

The ncol() function returns the number of variables in the dataset, which corresponds to the number of columns

ncol(rootData)
## Error: object 'rootData' not found

The head() function returns the first six cases in the dataset

head(rootData)
## Error: object 'rootData' not found

The summary() function returns a summary of the dataset

summary(rootData)
## Error: object 'rootData' not found

Student assignment to submit (include your R code) WE NEED TO CHANGE THE DATA ON THIS

Download the First_Day file from Moodle into an Excel file. Put it in long format, and save it as a .csv file. Then, upload the file to R. Read the data into R, giving it an appropriate name, and then use the dim() and names() functions to show you are correct. The correct answers are at the bottom of this document

  1. How many cases and variables are there in your dataset?
# Insert code here!
  1. List the names of your variables below.
# Insert code here!

Answers:

  1. 12 cases, 2 variables.

  2. Answers may vary. The output should match the variable names from your Google Spreadsheet.