Lab 1: Getting Started with R

POLS 3316, FALL 2023, University of Houston, Instructor: Tom Hanna

We’re going to start by loading some data!

This data is actually built into R in a package called “cars,” but we want to practice loading it from a file. (Later, we’ll practice loading data from the “iris” R package directly.) While loading and examining this simple data set, we’ll look at some basic R concepts like objects, data types, variables, observations, commands, and functions. We will also briefly discuss good work flow, which is often considered an advanced topic but actually makes life simpler if you observe it from the start. Finally, we’ll look at some summary statistics and graphical representations or plots.

Loading data: two methods

We’ll use two methods.

            + Bad method: Click on the file
            + Okay method: *base R* method, you can use with just the minimum R installation. 
            + Better method:
            

Bad method: Point and click

  • Why is it bad?

              + Temporary
              + Not saved to script, etc.
              + Code won't run properly
              + Bad habit that will get you in trouble
  • What is it useful for:

              + running a quick one time check
              + getting the code to put in your scripts
  • Main Point: Don’t rely on this

base R Method

  • Clear the Global Environment
  • Set a working directory

What is a directory?

  • You may have heard directories referred to as “folders”
  • Specific section of your computer’s disk drive
  • disk drive - The permanent storage that remains even if you turn the computer off

R Code Chunk

  • This is our first “R Code Chunk”
  • This chunk clears the Global Environment and sets the Working Directory
#This is a comment
#Comments can go on their own line, or after a line of actual code

rm(list = ls()) #this clears the Global Environment

setwd("~/3 - R Studio Projects/pols3316-summer2022") #This sets the working directory

Loading the Data

With the working directory set, we’re ready to actually load the data. The next line of code actually does multiple things.

  • Create a new object called “cars_data”.

What is an object?

Everything we work on in R is an object - a data structure. When we create a new object, it appears in the upper right Global Environment. Once an object is created, it stays there and we can do multiple things with it, including changing the object itself.

Loading the Data

With the working directory set, we’re ready to actually load the data. The next line of code actually does multiple things.

  • Create a new object called “cars_data”.
  • Assign something to the object using the assignment operator

assignment operator

(Image from: https://shansabri.github.io/post/post1/)

The assignment operator

  • The assignment operator: usually <- (left arrow)
  • Tells R to load the result on the right into the object on the left
  • It can assign a number or character, a sequence of numbers, the result of a math operation, or even an entire set of data

Why do this?

  • Technical reasons become apparent as you do more
  • Big practical reason - it allows us to change the object without changing the original data

Loading the Data

With the working directory set, we’re ready to actually load the data. The next line of code actually does multiple things.

  • Create a new object called “cars_data”.
  • Assign something to the object using the assignment operator
  • The something will be the result of a command, read.csv

Last, read.csv is an R command that tells R to read the file we specify.

Loading the data

This code chunk will load the data

#
#

setwd("~/3 - R Studio Projects/pols3316-summer2022") #because of #the way R Notebooks work, I have to repeat this command in this #"chunk"
cars_data <- read.csv("./data/cars.csv") 

#
#

Loading Data: Better method!

This is not the preferred method for working with data. With the working directory method, if you run my script on your computer it probably won’t work. The reason is that your directory structure is unique to your computer. For example, I have all my R work in a directory called “~/3 - R Studio Projects/” which you almost certainly don’t have on your computer.

Loading Data: Better method!

Second, we don’t want to set a working directory since the R Studio Project, which we’ll discuss more later, does this for us. R Studio Projects, especially when used with Github, provide a much better organized way of handling your work with the added bonus of a simple way to backup your work to the cloud.

Loading Data: Project Oriented Workflow

Jenny Bryan, an R Studio (Posit) developer and educator, takes this so seriously that she said:

If the first line of your R script is

setwd(“C:”)

I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.

Loading Data: Project Oriented Workflow

If the first line of your R script is

rm(list = ls())

I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.

Loading Data: The Better Method

The recommended method is to use an R package called here.

  • package a collection of pre-written code intended to do specific things in R
  • the here package helps organize, load, and store data
  • here allows you to always find data relative to the current directory
  • No need for working directories
  • Portable between your own devices
  • Easily shared with others

Loading Data: Better Method Example

For example, in the following code, the data is in a subdirectory (or subfolder) called “data”.

#
rm(list = ls()) #this clears the Global Environment - this is not the first line of my Quarto document

#uncomment the following line the first time to install the "here" #package
#install.packages("here")

library("here") #This loads the here package
cars_data <- read.csv(here("data","cars.csv"))
#
#

Working with the data: head

Now that we have some data to work with, let’s take a look.

First, we can look at just the first several rows of data with the head command. (There is also a tail command that shows the last five rows.

#
head(cars_data)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
#

Working with the data: head

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
  • Data in long format
  • Rows are observations
  • Columns are variables
  • First column is an index added by R

Working with the data: head

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

In some datasets, there might be a column with observation names such as state names, country names, business names, etc. Since this dataset doesn’t have observation names, the index will allow us to use the information for specific observations or ranges of observations.

Working with the data: tail

   speed dist
45    23   54
46    24   70
47    24   92
48    24   93
49    24  120
50    25   85

The second and third column represent variables named speed and distance. This is just toy data, so we don’t really need to worry about this much. In real data, we’d be looking at variables thinking about possible theoretical relationships between them and then using statistics to either explore or confirm those relationships.

Data types

Working with the data: Data types

It is important to know the type of data we are working with, so we can use the str command to do that

'data.frame':   50 obs. of  2 variables:
 $ speed: int  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : int  2 10 4 22 16 10 18 26 34 17 ...

The str command returns the structure of the data, the data type of each variable, and the first several observations of each variable.

Working with the data: Data types

'data.frame':   50 obs. of  2 variables:
 $ speed: int  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : int  2 10 4 22 16 10 18 26 34 17 ...
  • Next to both variables, you see < int > .
  • int indicates a variable type of integer
  • This means a number with no decimal points
  • Whole numbers

Other data types

  • float - numbers with decimal places

              + 1.2
              + 7.2341
              + 1.0
  • double - like a float but capable of double the number of decimal places

  • numeric - numerals with or without decimals

  • character - alphanumeric

  • logical - TRUE or FALSE aka Boolean

  • complex - imaginary values like i

  • raw - very low level data that the computer reads directly

Warning! Alphanumeric data type

Note the alphanumeric! Sometimes numbers can be stored as character type. If that happens, you have to convert them to a numeric type before you can do math operations!

Some simple statistics

We can do simple statistics in R two ways.

  • Perform math operations using standard math operators

              • +
              • -
              • /
              • *
  • Built in R functions

              • sum()
              • mean()
              • median()

Some simple statistics: Sum

First, let’s find the sum of both variables.

#
#create an object called speed and assign it the value of 
#the function sum(cars_data$speed)
#
speed_sum <- sum(cars_data$speed) #the command/function is sum, cars_data is the data #frame and speed is the variable

#print the value of speed_sum to the screen
speed_sum
[1] 770
#Now let's find the sum of distance
#

Some simple statistics: Mean

Now let’s find the mean, first using math operations then using the build in function.

#
#create an object called speed_mean and compute it using math #operations
speed_mean <- speed_sum/50

#print speed_mean to the screen
speed_mean
[1] 15.4
#How would we create a variable called distance_mean and compute #it?
#

Some simple statistics: Mean

Now, let’s do the same thing using the built in function.

#create an object speed_mean2 and compute it with the mean function
speed_mean2 <- mean(cars_data$speed)

#print speed_mean2 to the screen
speed_mean2
[1] 15.4
#Now let's do the same thing for distance

Some simple statistics: Mean

Some other basic statistics that R can compute are: median, standard deviation, and variance. It can also give us a summary of several statistics.

#median
cat("The median speed is:")
The median speed is:
speed_median <- median(cars_data$speed)
speed_median
[1] 15
#variance
cat("The variance is:")
The variance is:
speed_variance <- var(cars_data$speed)
speed_variance
[1] 27.95918
#standard deviation
cat("The standard deviation is:")
The standard deviation is:
speed_sd <- sd(cars_data$speed)
speed_sd
[1] 5.287644
#minimum and maximum
cat("The minimum speed is:")
The minimum speed is:
min(cars_data$speed)
[1] 4
cat("The maximum speed is:")
The maximum speed is:
max(cars_data$speed)
[1] 25
#summary 
summary(cars_data$speed)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0    12.0    15.0    15.4    19.0    25.0 

Some simple statistics: Mean

Some other basic statistics that R can compute are: median, standard deviation, and variance. It can also give us a summary of several statistics.

The median speed is:
[1] 15
The variance is:
[1] 27.95918
The standard deviation is:
[1] 5.287644
The minimum speed is:
[1] 4
The maximum speed is:
[1] 25
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    4.0    12.0    15.0    15.4    19.0    25.0 

Operations on two variables

R can also perform operations on two variables. This is a preview of where we are heading later.

#
#correlation between two variables
cat("The correlation of speed and stopping distance is:")
The correlation of speed and stopping distance is:
cor(cars_data$speed,cars_data$dist)
[1] 0.8068949
#covariance of two variables
cat("The covariance of speed and stopping distance is:")
The covariance of speed and stopping distance is:
cov(cars_data$speed,cars_data$dist)
[1] 109.9469

Operations on two variables

The correlation of speed and stopping distance is:
[1] 0.8068949
The covariance of speed and stopping distance is:
[1] 109.9469

More complex statistics

  • Ultimately we would like to compute OLS regressions on two or more variables. This gives us:

              + a formula for predicting the value of one variable based on one or more other variables
              + an estimate of how meaningful the relationship is compared to random chance 

Regression example


Call:
lm(formula = dist ~ speed, data = cars_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Regression example


Call:
lm(formula = dist ~ speed, data = cars_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

stopping distance = -17.5791 + 3.9324*speed + \(\epsilon\)

Sample regression plot

The formula is the equation for a line

Simple plots

We can do some other plots with the built in plotting function.

More advanced plots can be done with packages like: + ggplot2 + coefplot + sjplot + many others

Other packages help share results in print or online: + stargazer + modelsummary + Shiny

Simple plots: Simple histograms

#plot histograms
hist(cars_data$speed, col = "lightblue")

Simple plots: Simple histograms

Simple plots: Simple histograms

hist(cars_data$dist, col = "lightyellow")

Simple plots: Simple histograms

Simple plots: Combined

#plot a combined histogram
hist(cars_data$speed, col = "lightblue")

hist(cars_data$dist, col = "lightyellow", add = TRUE)

Simple plots: Combined

Other plots

To show a couple of other plotting functions, I’m going to use a built in dataset from R. This data has information on iris flowers. Specifically, we’ll use the data on the length and width of the petals and sepals. (Sepals are the green leaf like structures around the petals of the flower.) Checkto R Coder for some more advanced plots

Other plots

# Numerical variables from the iris package
iris_data <- iris[, 1:4] 

pairs(iris_data)

Math operations

R can perform basic mathematical operations, as well as some that are not so basic. This can be useful when you just need to quickly check something about a result by typing a math operation in the console (the lower left corner). You can also include math operations in programs and functions. For example, it’s possible to create new variables using math operations.

Math operations

#basic math
1 + 1
[1] 2
2 / 2
[1] 1
1 * 5
[1] 5
1 - 1
[1] 0

Create objects based on a math operation

new_object <- 1 + 1 - 5
new_object
[1] -3
new_object2 <- 5 - 1 - 1
new_object2
[1] 3

Perform math operations on objects.

#perform a math operation on the new objects to create a third #object
new_object3 <- new_object + new_object2

#print result to the screen
new_object
[1] -3

Create new variables with math operations

Create a new variable for speed above the average (mean)

#create a new variable for the excess speed over the average #(mean)
cars_data$excess_speed <- cars_data$speed - speed_mean

Create new variables with math operations

#look at the head and tail of the data frame with the new variable
head(cars_data)
  speed dist excess_speed
1     4    2        -11.4
2     4   10        -11.4
3     7    4         -8.4
4     7   22         -8.4
5     8   16         -7.4
6     9   10         -6.4
tail(cars_data)
   speed dist excess_speed
45    23   54          7.6
46    24   70          8.6
47    24   92          8.6
48    24   93          8.6
49    24  120          8.6
50    25   85          9.6

License

This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa].