Via a link
Let’s start from a very basic case and load a data set from a csv-file via a link. The format csv stands for comma separated values as in this file columns are separated with commas by default. The columns separator can be different, though, so we might need to add an argument specifying it. We will discuss these cases later. To load data from a file we use functions starting from read.
and ending with its extention.
df <- read.csv("http://math-info.hse.ru/f/2018-19/pep/r/Titanic.csv") # don't forget about quotes "" or ''
While loading data via a link, make sure your Internet connection is good. If everything is ok, you will see no error messages in the console and the variable df
with the dataset will appear in the tab Environment.
Here we see general information about this data set, its dimensions: the number of observations and the number of variables. If we click on the
df
(exactly on the name df
, not on the blue circle on the left), we will see a table in a separate tab (so-called View mode):
In this mode we can view the table in a convenient way. We can scroll it down/right and enlarge clicking on the sign in the top right corner.
Before starting to describe a table we have to learn how to load data from a file saved in a local folder.
From a local file
Firstly, let’s look how to know which folder on our computer is a working one. A working directory is the folder from which RStudio launches. By default R sees only files that are stored in this folder. To get the path to the working directory, we need the function getwd()
:
getwd() # wd - from working directory
## [1] "/Users/allat/Desktop"
In my case RStudio launches from the folder Desktop that is in the folder allat. It means that I can place my csv-file to Desktop, and I will be able to indicate its name with the extension in read.csv()
and load it without difficulties:
df <- read.csv("Titanic.csv") # file is in Desktop
If a file is not stored in the working directory, typing its name like this makes no sense, it will certainly result in the error cannot open file: No such file or directory
. So as to overcome this problem (if we do not want to move our file to the working directory), we can write the full path to the file:
df <- read.csv("/Users/allat/Downloads/Titanic.csv")
We can get this path by clicking the right mouse button on this file and choosing Properties (Свойства if you work in class at the computer with cyrillics). In properties there is always a line with the file location (Расположение файла on computers with cyrillics). We should copy the location and paste it into the braces of read.csv()
. Please, mind the slashes. R does not work with back slashes (’') commonly used on Windows, it accepts only direct ones (/
). Change all the slashes in a path or add them if something went wrong (on Mac paths might be copied without slashes or with dots instead). Make sure your path ends with the file name and its correct extension. The function read.csv()
opens namely files, not folders!
Changing a working directory
One more way to access files is to change the path to the working directory. Such an approach can be useful when we have a lot of files to work with and we do not plan to copy/move them to the current working directory (and, of course, we are very lazy to write full paths all the time). So as to change the working directory, we need setwd()
and a path to the directory desired:
setwd("/Users/allat/Downloads")
Before setting a new path, check whether this new directory exists. If not, R will not create it authomatically and will return an error.
Now let’s proceed to more interesting things.
Now we have the table df
with the information on “Titanic” passengers that contains data on people’s characteristics and the indication whether a passanger survived in this notorious shipwreck.
See a detailed information in the codebook for the data set available by this link. It is a good example of a codebook since there is detailed description of all variables with clear explanations of their values.
Now let’s look at the structure of df
:
str(df)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
It returns the following information: a number of observations and variables, variable names and their types (numeric, integer, factor, character) as well as several values of each variable. Often character variables are treated as factor ones due to default R settings, so R assign some numeric values (levels) to texts saved in columns.
Now we will look at more substantial things and ask R for descriptive statistics. It can be done with summary()
command:
summary(df)
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## :687 : 2
## B96 B98 : 4 C:168
## C23 C25 C27: 4 Q: 77
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
For numeric variables this function returns standard descriptive statistics: minimum (Min.
) and maximum (Max.
), lower (1st Qu.
) and upper (3rd Qu.
) quartiles, average (Mean
) and median (Median
).
For non-numeric (character or factor) variables it returns counts, absolute frequences showing how many times every unique value occurs in a column.
For this table we can, for instance, conclude that on average passengers of “Titanic” were not very old (29.7 is the mean age), the oldest passanger was 80 years old, there were people who paid nothing for the ticket (minimum fare is 0), most people embarked in Southampton (644 vs 168 and 77) and there were more males than females (577 vs 314).
So as to get a description of a particular variable, we can access it using the dollar sign $
:
summary(df$Age) # take Age from df
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
Note that R also counts missing values coded as NA
(from Not Applicable). In programming there are two “empty” types that refer to missing values: NA
and NaN
(Not A Number). Often they are interchangeable, but NaN
might stand for non-empty values like infinity (\(\infty\)) that is not a number and not an empty cell.