2 Preparing the data

For this tutorial we will use the training set of titanic dataset that can be downloaded here. We need also to import ggplot2 and dplyr package:

library(ggplot2)
library(dplyr)
str(titanic)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

summary(titanic)

##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked  
##  B96 B98    :  4   C   :168  
##  C23 C25 C27:  4   Q   : 77  
##  G6         :  4   S   :644  
##  C22 C26    :  3   NA's:  2  
##  D          :  3             
##  (Other)    :186             
##  NA's       :687

In order to do our analysis easier we will remove the rows with missing values on the Embarked variable:

titanic <- titanic[-which(is.na(titanic$Embarked)) , ]

Now let’s see the distribution of the port of the embarkation:

emb_dist <- ggplot(titanic, aes(x = Embarked)) + geom_bar()
emb_dist

As we can see, the bar plot is ordered by the order of the levels of the factor. The next step is to arrange the bars by the amount of passengers in each category.

3 Changing the order of the bar chart by frequency

In order to arrange the bar chart plotted before by frequency, we must change the order of the factor levels of Embarked variable in a proper way. First, we will need to create a new table with the frequency of every embarked class:

freq_table <- titanic %>% group_by(Embarked) %>% summarise(freq = n())

Next, we will order the levels of the Embarked variable with the frequency of each level:

titanic$Embarked <- factor(titanic$Embarked, levels = freq_table$Embarked[order(freq_table$freq, decreasing = T)])
str(titanic$Embarked)

##  Factor w/ 3 levels "S","C","Q": 1 2 1 1 1 3 1 1 1 2 ...

Finally, we plot the bar chart oredered:

emb_dist_ordered <- ggplot(titanic, aes(x = Embarked)) + geom_bar()
emb_dist_ordered

How to order a factor variable by frequency

Rubén Guerrero

9 de noviembre de 2017

1 Goal

2 Preparing the data

3 Changing the order of the bar chart by frequency

4 Conclusion