Introduction
When using R for the first time, I had a hard time
figuring out have to load my data for analysis. In my search, I found a
couple of ways to load data into the R environment.
For this exercise, I used RStudio, which is an
integrated development environment (IDE). In other words, it’s a nice
user interface that allows you to use R in a clean and
convenient environment.
Motivating example
Let’s download some data for this exercise.
We will use the Printed Circuit Board Processed Image data from the University of California Irvine Machine Learning Repository, which can be downloaded here.
The file name is TestPad_PCB_XYRGB_V2.csv and it will be
in the *.csv file format.
Download it someplace where you can retrieve it. In this exercise,
let’s download the data into the Downloads folder for
convenience. After you download the data, unzip it in the
Downloads folder.
I’ll walk you through two methods to load the data into
R.
The first method will set the working directory and load a
*.csv file format.
The second method will load a *.xlsx Excel file
format.
Note:
Although I’m using a MacOS to do this exercise, the instructions will work with a Windows-based system.
Method 1 - Setting the working directory and loading a
*.csv file format
The first method to load data in R involves selecting
the working directory. The working directory is where you tell
R where to go to for data for the current session. By
setting the working directory, R will use it to locate and
load data. (Note: When you start R, you are working in a
session.)
Click on Session ->
Set Working Directory -> Choose Directory
then navigate to the Downloads folder. Select “Open” to
open the Downloads folder. (Note: For this example, the
data was saved in the Downloads folder. But for you may
have the data saved in a different location. Navigate to that location
using the Choose Directory feature.)
R will automatically input the code in the Console that
will set the working director to the Downloads folder.
You can copy and paste this R code into your script file
so that you can run this again without having to click through the tabs
to locate and set the working directory.
Loading data by choosing the working directory
Once the working directory is set, you can load your data using the
read.csv() function.
You need to include the file name and the argument
header = TRUE, which indicates that the first row contains
the column labels.
## Set the working directory
setwd("~/Downloads")
## We will call this dataframe object `df1`
df1 <- read.csv("TestPad_PCB_XYRGB_V2.csv", header = TRUE)
## After the data has been loaded, you can view the first 6 rows:
head(df1)
## X Y R G B Grey
## 1 105 0 0.9098039 0.9764706 0.9372549 0
## 2 106 0 0.7921569 0.9019608 0.8431373 0
## 3 107 0 0.6313725 0.7882353 0.6941176 0
## 4 108 0 0.4745098 0.6705882 0.5568627 0
## 5 109 0 0.3411765 0.5843137 0.4392157 0
## 6 110 0 0.2745098 0.5411765 0.3882353 0
The data has been loaded and assigned to a dataframe object called
df1. You can use any name to your liking. I personally like
to keep the name of the dataframe short so that it is easier to
type.
Method 2 - Loading a *.xlsx Excel file format
In the above example, we loaded the
TestPad_PCB_XYRGB_V2.csv data into R.
Fortunately, this data was in the *.csv format, which
allowed us to use the read.csv() function to load the data
into the R environment.
What if the data was in an Excel format (e.g., *.xlsx)?
How would we load the data?
You can open the data in Excel and then save it as a
*.csv file. This is the easiest workaround.
Or, you can use the openxlsx package. You can read about
the openxlsx package on its GitHub site here.
You will need to install the openxlsx package before you
can use it in R.
Once you install the package, you cal load it using the
library() function.
After the openxlsx package has been loaded, you can load
the data into the R environment using the
read.xlsx() function. Make sure to include the argument
colNames = TRUE to let R know that the first
row contains the column labels.
(Note: For this exercise, I created the
TestPad_PCB_XYRGB_V2.xlsx as an *.xlsx
file.)
## Install the `openxlsx` package (You only need to do this once)
## install.packages("openxlsx")
## Load the library `openxlsx` package (You need to do this each time you start `R`)
library("openxlsx")
## Set the working directory
setwd("~/Downloads")
## Load the data as a dataframe `df2`
df2 <- read.xlsx("TestPad_PCB_XYRGB_V2.xlsx", colNames = TRUE)
## After the data has been loaded, you can view the first 6 rows:
head(df2)
## X Y R G B Grey
## 1 105 0 0.9098039 0.9764706 0.9372549 0
## 2 106 0 0.7921569 0.9019608 0.8431373 0
## 3 107 0 0.6313725 0.7882353 0.6941176 0
## 4 108 0 0.4745098 0.6705882 0.5568627 0
## 5 109 0 0.3411765 0.5843137 0.4392157 0
## 6 110 0 0.2745098 0.5411765 0.3882353 0
The data is now loaded in the R environment as a
dataframe object with the name df2. You can now use this
dataframe for your analyses.
Final thoughts
I reviewed how you can load data that are saved as a
*.csv and *.xlsx file format by setting the
working directory. However, there are many data formats that
R can load into its environment. Chances are there is also
a package that will allow you to load that data type into
R.
When you have a data format that is not a *.csv or
*.xlsx file, don’t worry. Try to search on online for an
R package that will load the data into its environment. I’m
confident that you will figure things out after going through this
exercise.
Acknowledgements
I got the data used in this exercise from the University of California Irvine Machine Learning Repository. There are a ton of data in their repository, and I highly encourage you to explore its contents.
To learn more about the openxlsx package, you can review
the source files on its GitHub site.
Disclaimer and Disclosures
This is a work in progress; thus, I may update this in the future.
Any errors or mistakes are my fault, and I’ll try my best to fix them.
Lastly, this is for educational purposes only.