Now we will work with the table containing data on famous Chilean plebiscite held in 1988 to decide whether to extend the rule of Augusto Pinochet or not. See details about variables here. We will load a table using a link:
df <- read.csv("http://math-info.hse.ru/f/2017-18/ps-ms/Chile.csv")
Let’s look at the data frame using the code:
View(df) # V is capital
And explore its structure:
str(df)
## 'data.frame': 2700 obs. of 9 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : Factor w/ 5 levels "C","M","N","S",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ population: int 175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 2 1 1 2 ...
## $ age : int 65 29 38 49 23 28 26 24 41 41 ...
## $ education : Factor w/ 3 levels "P","PS","S": 1 2 1 1 3 1 2 3 1 1 ...
## $ income : int 35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
## $ statusquo : num 1.01 -1.3 1.23 -1.03 -1.1 ...
## $ vote : Factor w/ 4 levels "A","N","U","Y": 4 2 4 2 2 2 2 2 3 2 ...
We can request the number of rows or columns separately:
nrow(df)
## [1] 2700
ncol(df)
## [1] 9
We can check whether rows contain missing values:
head(complete.cases(df)) # vector of TRUE/FALSE for all rows
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
And count the number of complete cases, so rows with no ‘NAs’.
sum(complete.cases(df)) # rows without NA's
## [1] 2431
Although TRUE
and FALSE
values are logical, the sum()
function works correctly since they are automatically converted into integers: 1 for TRUE
and 0 for FALSE
.
And how to count, vice versa, the number of rows with missing values? Simply using a negation sign before complete.cases()
, which is often represented by !
in programming.
sum(!complete.cases(df)) # rows with NA's
## [1] 269
If we want to look at the rows with NA’s, we can filter them from the whole data set. Filtering is usually done by writing conditions in square brackets. At the first place we should indicate conditions for rows and at the second place - for columns. Now we need to choose rows that do not belong to complete cases, and all columns.
head(df[!complete.cases(df), ])
## X region population sex age education income statusquo vote
## 13 13 N 175000 F 27 PS NA 1.43448 Y
## 15 15 N 175000 M 36 PS 35000 1.49026 <NA>
## 28 28 N 175000 F 43 P NA 0.15489 A
## 76 76 N 125000 F 32 S NA -0.85035 N
## 98 98 N 125000 F 34 P 2500 0.10807 <NA>
## 113 113 N 250000 F 46 S NA 0.15489 <NA>
The second position we left blank since if we want to see all the columns, there is no need to specify them.
So as to take a close look at these rows, we can save them into a separate data frame and then view in a more convenient mode:
with_na <- df[!complete.cases(df), ]
View(with_na)
Sometimes it is important to know the patterns of missing values so as understand whether the lack of data is systematic or random. If it seems to be random, there should not be problems, but if missing values are systematic, i.e. occur in certain columns, it might be a sign of bias. For instance, if we see that there are NA’s in answers to three questions simultaneously, in the same rows, it can serve as an evidence of poorly worded questions or questions on sensitive topics like money or politics.
To visualise the patterns we have to install two libraries: mice
and VIM
. All libraries (or packages) are installed via the function install.packages()
. It is enough to install a library once. Then a library can be launched during any RStudio session.
# make sure your Internet connection is ok
install.packages("mice")
install.packages("VIM")
Once libraries are installed, we can use its functions. However, we usually have to launch a library first by referring to it via library()
. This should be done in any new R session:
library(mice)
library(VIM)
Without these lines R will return an error Could not find function ...
while typing commands from these libraries as it will not understand where to find the commands.
Now everything is ready to plot graphs for missing values.
aggr(df)
This graph is comprised of two parts. The left graph illustrates the frequencies of NA’s in each column. Judging by this bar chart we can conclude that most missing values are concentrated in the columns vote
and income
. However, the percentage of NA’s in vote
is not very high (0.06 or 6%) if we consider the size of our data set (2700 rows). The right graph shows the patterns of missing values, the combinations of variables with NA’s. From this graph we can see that people who did not indicate their income, said nothing about their intentions to vote. And they did not provide any information on their attitude towards statusquo as well. Thus, missing values in this data set are not distributed randomly, there is a certain pattern.
matrixplot(df)
##
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.
Let’s look at another graph. It also illustrates patterns of missing values. By y-axis go numbers of rows (indices of rows), by x-axis go variables. Cell colours range from white to black, where darker shades correspond to larger values. NA’s are represented by red cells. So, if we see big red rectangles, it means that there are a lot of missing values. Besides, again we can see what columns contain NA’s simultaneously.
Now for simplicity we will delete rows with missing values so as to avoid problems with further calculations and graphs:
df <- na.omit(df)
We can check:
sum(!complete.cases(df)) # 0 NA's
## [1] 0
Now let’s see the statistical summary, i.e. descriptive statistics, and proceed to visualisation of variables’ distribution.
summary(df)
## X region population sex age
## Min. : 1.0 C :548 Min. : 3750 F:1250 Min. :18.00
## 1st Qu.: 636.5 M : 75 1st Qu.: 25000 M:1181 1st Qu.:25.00
## Median :1326.0 N :305 Median :175000 Median :36.00
## Mean :1326.9 S :655 Mean :151605 Mean :38.29
## 3rd Qu.:2006.5 SA:848 3rd Qu.:250000 3rd Qu.:49.00
## Max. :2700.0 Max. :250000 Max. :70.00
## education income statusquo vote
## P :1002 Min. : 2500 Min. :-1.72594 A:177
## PS: 419 1st Qu.: 7500 1st Qu.:-1.00974 N:867
## S :1010 Median : 15000 Median :-0.08924 U:551
## Mean : 34020 Mean :-0.01127 Y:836
## 3rd Qu.: 35000 3rd Qu.: 0.96969
## Max. :200000 Max. : 1.71355
First of all, for a nominal variable we can make a frequency table. For instance, we will take the variable vote
and count the number of occurrences of each unique value of vote
. To choose a variable from df
, we need a dollar sign ($
):
table(df$vote)
##
## A N U Y
## 177 867 551 836
By default R computes absolute frequencies, but be can get relative ones by hand:
table(df$vote)/sum(table(df$vote)) # shares, fractions
##
## A N U Y
## 0.07280954 0.35664336 0.22665570 0.34389140
table(df$vote)/sum(table(df$vote)) * 100 # in percent (%)
##
## A N U Y
## 7.280954 35.664336 22.665570 34.389140
res <- table(df$vote)/sum(table(df$vote)) * 100
round_res <- round(res, 2) # round to the 2nd digit after a point
Secondly, now we can plot a basic R bar chart for this variable:
barplot(table(df$vote))
A basic plot is grey and dull. Let’s change its colour:
barplot(table(df$vote), col = "red")
R has a very rich pallete of colours, see the full catalogue here. Most colours have specific non-standard names, but by default R recognizes them all. Besides, if we want to get columns of different colours, we can pass a vector of their names as an argument col
:
barplot(table(df$vote),
col = c("red", "blue", "red", "blue"),
main = "Vote") # add a title as well
Thirdly, we can create a pie chart with frequencies in %:
pie(table(df$vote))
Not an aesthetic thing, though. But we might change default colours:
cols2 <- c("skyblue", "thistle", "hotpink", "grey100")
pie(table(df$vote), col = cols2)
For a quantitative variable we can plot a histogram. Although a histogram sometimes resembles a bar chart, there is a principal difference between them: a histogram cannot be created for nominal data since it is impossible to arrange nominal values in an ascending order.
Let’s plot a histogram for respondents’ income:
hist(df$income)
As we can see, 1) most people in our table earn less than 50000 pesos; 2) there is a small group of respondents that are much richer than others and earn more than 180000 pesos. The distribution of income is right-skewed: there is a long right tail.
We can make our graph more beautiful:
hist(df$income, col = "lightgreen",
main = "Income (in pesos)")
Another way to summarise descriptive statistics of a variable and visualise them is to create a box plot:
boxplot(df$income, col = "yellow")
From this picture it is clear why this type of graph is called a box plot (box-and-whiskers plot). There is a “box” and there are two “whiskers”, lower and upper one.
How is this graph plotted? Horizontal lines that contours a “box” are lower and upper quartiles, and a thick horizontal line inside a box is a median. A median should not be exactly in the middle of a box, it depends on data, so it can be closer either to a lower quartile or to an upper quartile.
So as to understand what whiskers stand for, let us describe how to detect non-typical values of a variable. Typical values of a variable belong to the following interval:
\[[Q_1 - 1.5 \times IQR;\mbox{ } Q_3 + 1.5 \times IQR]\]
Values that fall behind this interval are non-typical and called outliers. In a box plot outliers are marked as dots that lie outside whiskers. Endpoints of whiskers, however, are defined in different ways:
if there are no outliers on the lower side, the endpoint of a lower whisker corresponds to the minimum value;
if there are no outliers on the upper side, the endpoint of an upper whisker corresponds to the maximum value;
if there are outliers on either side, the endpoints of whiskers correspond to \(Q_1 - 1.5 \times IQR\) or \(Q_3 + 1.5 \times IQR\).
Thus, in our case we see that there two outliers that refer to people with very high income. Two outliers usually mean two objects, e.g. two people, but sometimes one outlier might correspond to several objects (values coincide and point overlap). Further we will learn how to calculate a number of outliers and how to know which rows include non-typical values.
Now let’s look how we can use box plots to compare distributions. We will plot graphs by groups based on values of vote
. In other words, we will compare the income distribution of people who voted for or against Pinochet, intended to abstain or were undecided. When we want to plot graphs by groups, we need to add a tile ~
and indicate the grouping variable.
boxplot(df$income ~ df$vote)
Interestingly, but the shape of income distribution of undecided respondents and respondents for Pinochet is the same! People who intended to vote against, in general, are less rich than those who wanted to vote for. Respondents who were undecided or for Pinochet are more diverse in their income rather than ones prone to abstain or vote against (box plots for the latter are more narrow, with no upper whiskers at all). One more remarkable thing: the presence of very rich people is not a distinctive feature of a particular group. In all groups there are chileans with high income.
At the end of this lecture we have some time to learn one more type of graphs that are called violin plots or bean plots (some people see violins, other - beans, nothing special). To create a basic violin plot, we have to install the package vioplot
.
install.packages("vioplot")
Now we can plot a violin plot for the variable income
:
library(vioplot)
vioplot(df$income)
This plot might seem to be strange if you have not seen it before, but it is still useful. It shows the shape of a variable distribution in a more proper way than a box plot, and it can be regarded as a combination of a histogram and a box plot. If we divide it into two symmetric halves, we will see that each half is just a smoothed histogram turned by 90 degrees (from the horizontal mode to a vertical one). If it is not clear, you can compare:
hist(df$income, col="magenta", prob = TRUE) # probabilities instead of frequencies
lines(density(df$income, adjust=5)) # adjust - smoothing parameter
vioplot(df$income, horizontal = TRUE) # horizontal
In the center of a “violin” there is a small box plot with a white dot corresponding to the mean value. Judging by this graph we can conclude that the distribution of income
is right-skewed (long tail on the right), and the lower values prevail. Next time we will discuss when it is more helpful to use a violin plot rather than a box plot and what are the assumptions inherited in each type of graph.