This data is actually built into R in a package called “cars,” but we want to practice loading it from a file. (Later, we’ll practice loading data from the “iris” R package directly.) While loading and examining this simple data set, we’ll look at some basic R concepts like objects, data types, variables, observations, commands, and functions. We will also briefly discuss good work flow, which is often considered an advanced topic but actually makes life simpler if you observe it from the start. Finally, we’ll look at some summary statistics and graphical representations or plots.
We’ll use two methods.
+ Bad method: Click on the file
+ Okay method: *base R* method, you can use with just the minimum R installation.
+ Better method:
Why is it bad?
+ Temporary
+ Not saved to script, etc.
+ Code won't run properly
+ Bad habit that will get you in trouble
What is it useful for:
+ running a quick one time check
+ getting the code to put in your scripts
Main Point: Don’t rely on this
With the working directory set, we’re ready to actually load the data. The next line of code actually does multiple things.
Everything we work on in R is an object - a data structure. When we create a new object, it appears in the upper right Global Environment. Once an object is created, it stays there and we can do multiple things with it, including changing the object itself.
With the working directory set, we’re ready to actually load the data. The next line of code actually does multiple things.
assignment operator
(Image from: https://shansabri.github.io/post/post1/)
With the working directory set, we’re ready to actually load the data. The next line of code actually does multiple things.
Last, read.csv is an R command that tells R to read the file we specify.
This code chunk will load the data
This is not the preferred method for working with data. With the working directory method, if you run my script on your computer it probably won’t work. The reason is that your directory structure is unique to your computer. For example, I have all my R work in a directory called “~/3 - R Studio Projects/” which you almost certainly don’t have on your computer.
Second, we don’t want to set a working directory since the R Studio Project, which we’ll discuss more later, does this for us. R Studio Projects, especially when used with Github, provide a much better organized way of handling your work with the added bonus of a simple way to backup your work to the cloud.
Jenny Bryan, an R Studio (Posit) developer and educator, takes this so seriously that she said:
If the first line of your R script is
setwd(“C:”)
I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.
If the first line of your R script is
rm(list = ls())
I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.
The recommended method is to use an R package called here.
For example, in the following code, the data is in a subdirectory (or subfolder) called “data”.
#
rm(list = ls()) #this clears the Global Environment - this is not the first line of my Quarto document
#uncomment the following line the first time to install the "here" #package
#install.packages("here")
library("here") #This loads the here package
cars_data <- read.csv(here("data","cars.csv"))
#
#
Now that we have some data to work with, let’s take a look.
First, we can look at just the first several rows of data with the head command. (There is also a tail command that shows the last five rows.
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
In some datasets, there might be a column with observation names such as state names, country names, business names, etc. Since this dataset doesn’t have observation names, the index will allow us to use the information for specific observations or ranges of observations.
speed dist
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85
The second and third column represent variables named speed and distance. This is just toy data, so we don’t really need to worry about this much. In real data, we’d be looking at variables thinking about possible theoretical relationships between them and then using statistics to either explore or confirm those relationships.
It is important to know the type of data we are working with, so we can use the str command to do that
'data.frame': 50 obs. of 2 variables:
$ speed: int 4 4 7 7 8 9 10 10 10 11 ...
$ dist : int 2 10 4 22 16 10 18 26 34 17 ...
The str command returns the structure of the data, the data type of each variable, and the first several observations of each variable.
'data.frame': 50 obs. of 2 variables:
$ speed: int 4 4 7 7 8 9 10 10 10 11 ...
$ dist : int 2 10 4 22 16 10 18 26 34 17 ...
float - numbers with decimal places
+ 1.2
+ 7.2341
+ 1.0
double - like a float but capable of double the number of decimal places
numeric - numerals with or without decimals
character - alphanumeric
logical - TRUE or FALSE aka Boolean
complex - imaginary values like i
raw - very low level data that the computer reads directly
Note the alphanumeric! Sometimes numbers can be stored as character type. If that happens, you have to convert them to a numeric type before you can do math operations!
We can do simple statistics in R two ways.
Perform math operations using standard math operators
• +
• -
• /
• *
Built in R functions
• sum()
• mean()
• median()
First, let’s find the sum of both variables.
Now let’s find the mean, first using math operations then using the build in function.
Now, let’s do the same thing using the built in function.
Some other basic statistics that R can compute are: median, standard deviation, and variance. It can also give us a summary of several statistics.
The median speed is:
[1] 15
The variance is:
[1] 27.95918
The standard deviation is:
[1] 5.287644
The minimum speed is:
[1] 4
The maximum speed is:
[1] 25
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.0 12.0 15.0 15.4 19.0 25.0
Some other basic statistics that R can compute are: median, standard deviation, and variance. It can also give us a summary of several statistics.
The median speed is:
[1] 15
The variance is:
[1] 27.95918
The standard deviation is:
[1] 5.287644
The minimum speed is:
[1] 4
The maximum speed is:
[1] 25
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.0 12.0 15.0 15.4 19.0 25.0
R can also perform operations on two variables. This is a preview of where we are heading later.
The correlation of speed and stopping distance is:
[1] 0.8068949
The covariance of speed and stopping distance is:
[1] 109.9469
The correlation of speed and stopping distance is:
[1] 0.8068949
The covariance of speed and stopping distance is:
[1] 109.9469
Ultimately we would like to compute OLS regressions on two or more variables. This gives us:
+ a formula for predicting the value of one variable based on one or more other variables
+ an estimate of how meaningful the relationship is compared to random chance
Call:
lm(formula = dist ~ speed, data = cars_data)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Call:
lm(formula = dist ~ speed, data = cars_data)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
stopping distance = -17.5791 + 3.9324*speed + \(\epsilon\)
The formula is the equation for a line
We can do some other plots with the built in plotting function.
More advanced plots can be done with packages like: + ggplot2 + coefplot + sjplot + many others
Other packages help share results in print or online: + stargazer + modelsummary + Shiny
To show a couple of other plotting functions, I’m going to use a built in dataset from R. This data has information on iris flowers. Specifically, we’ll use the data on the length and width of the petals and sepals. (Sepals are the green leaf like structures around the petals of the flower.) Checkto R Coder for some more advanced plots
R can perform basic mathematical operations, as well as some that are not so basic. This can be useful when you just need to quickly check something about a result by typing a math operation in the console (the lower left corner). You can also include math operations in programs and functions. For example, it’s possible to create new variables using math operations.
Create a new variable for speed above the average (mean)
This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa].