You already finished first chapter of Data Import course at DataCamp and you might have come accross several resources online about importing files. However, best way to learn programming is to practice what you’ve just learned. Let’s import a csv file that we generate in our Desktop folder.
The data we’ll be importing is the one used in DataCamp course, states.csv
file:
state,capital,pop_mill,area_sqm
South Dakota,Pierre,0.853,77116
New York,Albany,19.746,54555
Oregon,Salem,3.970,98381
Vermont,Montpelier,0.627,9616
Hawaii,Honolulu,1.420,10931
Please open a new text file in your Desktop (in Windows, Notepad
might be used) and paste the contents and save the file as states.csv
in Desktop.
Now, let’s switch over to RStudio and try to import the file we just generated. The code to import the file should be:
read.csv("states.csv")
Most probably you ended up with error message after running this code. What does the error message say? cannot open file ____: No such file or directory
.
Note: Unfotunately, most students simply ignore the error message and got stuck. But, the error messages printed on screen give quite helpful information and can help troubleshooting it. So, PLEASE READ THE ERROR MESSAGES CAREFULLY
Why can’t R find the file? Because, R (and most other programs) search for files in a specific and certain folder called working directory. The issue of working directory was mention in Lesson02. If the file is not in working directory then R will complain as file or directory not found. In that case there are two ways to achieve successful file import:
Change working directory to “Desktop” (or whatever folder you placed the file) and then import file
setwd("C:/Users/your_username/Desktop")
read.csv("states.csv")
Without changing the working directory, just provide the full path of file to be imported:
path <- file.path("C:/Users/your_username/Desktop","states.csv")
path # should print "C:/Users/your_username/Desktop/states.csv"
read.csv(path)
The paths indicated above are for Windows and
your_username
should be replaced with correct username. Mac or Linux users can use"~/Desktop"
as path to their Desktop folder.
Some of students might be still getting same error although they have states.csv
file on their Desktop and they have changed the working directory correctly. If you can select “Go to Working Directory” under Files pane’s More menu (see figure below), you’ll be able to view contents of Desktop folder and there you’ll notice something strange, the file you generated was named states.csv.txt
instead of states.csv
. This is due to the fact that Notepad uses .txt
as extension unless otherwise indicated implicity in save dialog. Since, Windows (usually) hides extensions you might be misled to believe that the file is generated as intended. Using TAB might have prevented such a surpise during csv import.
How to go to working directory
Note: Please use TAB as much as possible. RStudio takes advantage of code completion by TAB very well. The TAB can complete library names, function names, data frame column names and file or folder names (when within double quotes). Using TAB not only saves typing but also prevents
No such file or directoy
type errors. If TAB does not complete, then it means the file or folder does NOT exist.
The csv files are supposed to be viewed by MS Excel but that’s another annoying story. If you successfully evade Notepad’s intrusion in file naming and manage to name your file .csv
then its icon changes and looks like MS Excel icon. This means MS Excel registered .csv
extension in system and when a csv file is double-clicked it should be opened with MS Excel. If you double click on csv file, you encounter a frustrating irony. MS Excel opened the file but didn’t recognize comma as separator (in a comma separated values file that it told system it will open!) and imported the file into single column. In such a case, please go to Data --> Text to Columns
and select Delimited
as file type and then select comma as delimiter.
If you successfully import the states.csv
file, you’ll notice that four columns are recognized along with their column headers. Although everything looks fine (below), there’s more than meets the eye.
> read.csv("states.csv")
state capital pop_mill area_sqm
1 South Dakota Pierre 0.853 77116
2 New York Albany 19.746 54555
3 Oregon Salem 3.970 98381
4 Vermont Montpelier 0.627 9616
5 Hawaii Honolulu 1.420 10931
As you might have remembered from DataCamp course content, read.csv()
function is used with stringsAsFactors = FALSE
option. The default value for this option is TRUE
which causes an undesired result.
> states <- read.csv("states.csv")
> str(states)
'data.frame': 5 obs. of 4 variables:
$ state : Factor w/ 5 levels "Hawaii","New York",..: 4 2 3 5 1
$ capital : Factor w/ 5 levels "Albany","Honolulu",..: 4 1 5 3 2
$ pop_mill: num 0.853 19.746 3.97 0.627 1.42
$ area_sqm: int 77116 54555 98381 9616 10931
As you can see, state
and capital
columns are imported as factors since they contain strings. These columns will cause errors or unexpected results during data manipulation and visualization steps. Thus, stringsAsFactors = FALSE
option should be indicated while importing csv files.
> states <- read.csv("states.csv", stringsAsFactors = FALSE)
> str(states)
'data.frame': 5 obs. of 4 variables:
$ state : chr "South Dakota" "New York" "Oregon" "Vermont" ...
$ capital : chr "Pierre" "Albany" "Salem" "Montpelier" ...
$ pop_mill: num 0.853 19.746 3.97 0.627 1.42
$ area_sqm: int 77116 54555 98381 9616 10931
Now, the columns are of character
type.
If you type ?read.csv
and go over the manual page for read.csv()
(read.table()
actually) you’ll notice that there are plenty of default options already defined behind the scenes. For intance;
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
read.csv()
function expects first line to be header, separator to be comma, quote character to be double quotes, decimal separator to be period unless otherwise indicated.
A Side Note: the extension of a file does not define the content of the file. A txt file can very well be a csv file as long as it is a plain text file where columns are separated with comma. Another valid csv file might be named
.dat
as extension andread.csv()
won’t reject opening the file due to unexpected extension.
If your file is not csv and uses a delimeter other than comma, there are other options to import. If the file uses another common delimeter, the tab character, then read.delim()
can be used to import such file. If the delimeter is not comma or tab then you can use read.table()
function. Actually, reac.csv()
and read.delim()
functions are wrappers for read.table()
function and they wrap it with set defaults.