Hands on file import

You already finished first chapter of Data Import course at DataCamp and you might have come accross several resources online about importing files. However, best way to learn programming is to practice what you’ve just learned. Let’s import a csv file that we generate in our Desktop folder.

The data we’ll be importing is the one used in DataCamp course, states.csv file:

state,capital,pop_mill,area_sqm
South Dakota,Pierre,0.853,77116
New York,Albany,19.746,54555
Oregon,Salem,3.970,98381
Vermont,Montpelier,0.627,9616
Hawaii,Honolulu,1.420,10931

Please open a new text file in your Desktop (in Windows, Notepad might be used) and paste the contents and save the file as states.csv in Desktop.

Now, let’s switch over to RStudio and try to import the file we just generated. The code to import the file should be:

read.csv("states.csv")

Most probably you ended up with error message after running this code. What does the error message say? cannot open file ____: No such file or directory.

Note: Unfotunately, most students simply ignore the error message and got stuck. But, the error messages printed on screen give quite helpful information and can help troubleshooting it. So, PLEASE READ THE ERROR MESSAGES CAREFULLY

Why can’t R find the file? Because, R (and most other programs) search for files in a specific and certain folder called working directory. The issue of working directory was mention in Lesson02. If the file is not in working directory then R will complain as file or directory not found. In that case there are two ways to achieve successful file import:

The paths indicated above are for Windows and your_username should be replaced with correct username. Mac or Linux users can use "~/Desktop" as path to their Desktop folder.

Some of students might be still getting same error although they have states.csv file on their Desktop and they have changed the working directory correctly. If you can select “Go to Working Directory” under Files pane’s More menu (see figure below), you’ll be able to view contents of Desktop folder and there you’ll notice something strange, the file you generated was named states.csv.txt instead of states.csv. This is due to the fact that Notepad uses .txt as extension unless otherwise indicated implicity in save dialog. Since, Windows (usually) hides extensions you might be misled to believe that the file is generated as intended. Using TAB might have prevented such a surpise during csv import.

How to go to working directory

How to go to working directory

Note: Please use TAB as much as possible. RStudio takes advantage of code completion by TAB very well. The TAB can complete library names, function names, data frame column names and file or folder names (when within double quotes). Using TAB not only saves typing but also prevents No such file or directoy type errors. If TAB does not complete, then it means the file or folder does NOT exist.

The csv files are supposed to be viewed by MS Excel but that’s another annoying story. If you successfully evade Notepad’s intrusion in file naming and manage to name your file .csv then its icon changes and looks like MS Excel icon. This means MS Excel registered .csv extension in system and when a csv file is double-clicked it should be opened with MS Excel. If you double click on csv file, you encounter a frustrating irony. MS Excel opened the file but didn’t recognize comma as separator (in a comma separated values file that it told system it will open!) and imported the file into single column. In such a case, please go to Data --> Text to Columns and select Delimited as file type and then select comma as delimiter.

If you successfully import the states.csv file, you’ll notice that four columns are recognized along with their column headers. Although everything looks fine (below), there’s more than meets the eye.

> read.csv("states.csv")
         state    capital pop_mill area_sqm
1 South Dakota     Pierre    0.853    77116
2     New York     Albany   19.746    54555
3       Oregon      Salem    3.970    98381
4      Vermont Montpelier    0.627     9616
5       Hawaii   Honolulu    1.420    10931

The Defaults

As you might have remembered from DataCamp course content, read.csv() function is used with stringsAsFactors = FALSE option. The default value for this option is TRUE which causes an undesired result.

> states <- read.csv("states.csv")
> str(states)
'data.frame':   5 obs. of  4 variables:
 $ state   : Factor w/ 5 levels "Hawaii","New York",..: 4 2 3 5 1
 $ capital : Factor w/ 5 levels "Albany","Honolulu",..: 4 1 5 3 2
 $ pop_mill: num  0.853 19.746 3.97 0.627 1.42
 $ area_sqm: int  77116 54555 98381 9616 10931

As you can see, state and capital columns are imported as factors since they contain strings. These columns will cause errors or unexpected results during data manipulation and visualization steps. Thus, stringsAsFactors = FALSE option should be indicated while importing csv files.

> states <- read.csv("states.csv", stringsAsFactors = FALSE)
> str(states)
'data.frame':   5 obs. of  4 variables:
 $ state   : chr  "South Dakota" "New York" "Oregon" "Vermont" ...
 $ capital : chr  "Pierre" "Albany" "Salem" "Montpelier" ...
 $ pop_mill: num  0.853 19.746 3.97 0.627 1.42
 $ area_sqm: int  77116 54555 98381 9616 10931

Now, the columns are of character type.

If you type ?read.csv and go over the manual page for read.csv() (read.table() actually) you’ll notice that there are plenty of default options already defined behind the scenes. For intance;

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

read.csv() function expects first line to be header, separator to be comma, quote character to be double quotes, decimal separator to be period unless otherwise indicated.

A Side Note: the extension of a file does not define the content of the file. A txt file can very well be a csv file as long as it is a plain text file where columns are separated with comma. Another valid csv file might be named .dat as extension and read.csv() won’t reject opening the file due to unexpected extension.

If your file is not csv and uses a delimeter other than comma, there are other options to import. If the file uses another common delimeter, the tab character, then read.delim() can be used to import such file. If the delimeter is not comma or tab then you can use read.table() function. Actually, reac.csv() and read.delim() functions are wrappers for read.table() function and they wrap it with set defaults.

Assignment for next week