Return to Home Page


Dee Chiluiza, PhD
Northeastern University
Introduction to data analysis using R, R Studio and R Markdown
Short manual series: Bar Graphs


As always, I start my R Markdown files by listing all libraries and data sets in the first R chunk.

# Libraries ued in this report

library(markdown)
library(readxl)
library(readr)
library(tidyverse)
library(dplyr)
library(DT)
library(knitr)
library(kableExtra)
library(graphics)    # For arrows
library(magrittr)

# Data sets used in this report

carsData <- read_excel("DataSets/carSales.xlsx")

Here’s a link to the data set. Save it on your computer and use the same code to import it.
https://figshare.com/s/f50ceaec7c8ebd3ed0f1

1 Basic concepts

This data set contains 11 variables and 5844 observations. The information is about car sales in different cities in South America. The first variable “Model” was removed (see code above) because it contains too many categories and they add no important information to our practice. Below there is a list of the first 10 observations in the data set.
The data is not real, it was created for practice purposes only.


Table. List of variables and first 10 observations with their corresponding values.
Location Year FuelType Transmission Owner Efficiency Engine Size Engine Power Seats Km Price
LaPaz 2013 Hybrid Manual Third 18 1500 100 5 145393 7215.0
Lima 2012 Hybrid Automatic Third 18 2000 200 5 179922 6465.0
Lima 2019 Hybrid Manual First 18 1500 100 5 13154 31302.0
Bogota 2019 Hybrid Manual First 18 1000 100 5 14425 47427.2
Bogota 2019 Hybrid Manual First 18 1000 50 8 14572 33676.8
PanamaCity 2019 Hybrid Manual First 18 1500 100 5 15610 40741.0
Bogota 2019 Hybrid Manual First 18 1000 50 5 17178 42499.2
Bogota 2019 Hybrid Manual First 18 1500 100 5 20294 30571.2
Lima 2019 Hybrid Automatic First 18 1000 100 5 20334 39717.0
Bogota 2019 Hybrid Manual First 18 1500 100 5 21620 46417.6


# Variables
Using the code, names() we can list all variables on our data sets. Add codes knitr::kable() and as.data.frame() to improve visualization.

# Obtain list of variables and create a table to visualize information.
# The basic code to obtain this information is names(carsData)

names(carsData) %>%
  as.data.frame()%>%
  knitr::kable(format = "html", 
             table.attr = "style='width:30%;'",
             align = "l",
             col.names = "Variables in data set")%>%
  kable_classic_2(bootstrap_options=c("hover","bordered"),
              html_font = "Cambria",
              position = "center",
              font_size = 16)
Variables in data set
Location
Year
FuelType
Transmission
Owner
Efficiency
Engine Size
Engine Power
Seats
Km
Price



2 Using Bar Graphs to present frequencies

Let’s work with variable Engine size.
To call a variable from a data set, enter the data set name and then variable of interest, separated by the dollar symbol. Pay attention to the code sequences in the two tables below, it allow us to call the data (engine size), obtain the unique values in the variables, sort them, create a data frame, count frequencies, and present the information using a table.

First, let’s check the unique values on the variable engine size. Start y creating an object to store the variable information; the only reason to do this is to reduce the size of the codes. After that, use code unique().

# Create object to store information

engineSize = carsData$"Engine Size"

# Obtain list of unique values

engineSize  %>%
  unique() %>%
  sort()%>%
  as.data.frame() %>%
  knitr::kable(col.names = "Engine size (cc)",
               align = "c",
               format = "html",
               table.attr = "style='width:30%;'") %>%
  kable_paper()
Engine size (cc)
1000
1500
2000
2500
3000
4000
6000

Next, let’s obtain frequencies per each unique observation. Start with code table().

Now it is possible to see how many groups or categories there are (unique values) and how many observations there are per value (frequencies). Review section Working with Categorical and Numerical Variables to understand why we are using engine size as a categorical variable.

engineSize  %>%
  table() %>%
  as.data.frame() %>%
  knitr::kable(col.names = c("Engine size (cc)", "Frequency"),
               align = "c",
               format = "html",
               table.attr = "style='width:30%;'") %>%
  kable_paper()
Engine size (cc) Frequency
1000 675
1500 2832
2000 1168
2500 712
3000 429
4000 16
6000 12

3 Using the barplot() code

Observe bar plot we obtain by calling variable directly. The graph below is not the proper way to present the data.

barplot(engineSize)

Observe the bar plot we obtain by using code table().

# Create an object to store the table
engine_table = table(engineSize)

# Present bar plot using the table, compare with previous graph
barplot(engine_table)

Observe the following codes in sequence of complexity. Pay attention to the changes
obtained by each additional code.

barplot(engine_table, 
                main="Plot 2. Add a title to the plot \n main =")

barplot(engine_table, 
                main="Plot 3. Add x-axis label \n xlab=",
                xlab="Engine size in cubic centimeters")

barplot(engine_table, 
                main="Plot 4. Add y-axis label \n ylab=",
                xlab="Engine size in cubic centimeters",
                ylab = "Car counts")

barplot(engine_table, 
                main="Plot 5. Add colors \n col=",
                xlab="Cylinders per car",
                ylab = "Car counts",
                col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                        "aquamarine3", "coral"))

barplot(engine_table, 
                main="Plot 6. Increase y-axis limits \n ylim=c(,)",
                xlab="Cylinders per car",
                ylab = "Car counts",
                col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                        "aquamarine3", "coral"),
                ylim = c(0,3500))

barplot(engine_table, 
                main="Plot 7. Add border color to bars \n border=",
                xlab="Cylinders per car",
                ylab = "Car counts",
                col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                        "aquamarine3", "coral"),
                ylim = c(0,3500),
                border = "red")

barplot(engine_table, 
                main="Plot 8. Turn y-axis labels horizontally \n las=1",
                xlab="Cylinders per car",
                ylab = "Car counts",
                col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                        "aquamarine3", "coral"),
                ylim = c(0,3500),
                border = "red",
                las = 1)

barplot(engine_table, 
                 main="Plot 9. Change font size \n cex.axis, cex.names",
                 xlab="Cylinders per car",
                 ylab = "Car counts",
                 col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                         "aquamarine3", "coral"),
                 ylim = c(0,3500),
                 border = "red",
                 las = 1,
                 cex.axis = 1.1,
                 cex.names = 1.1)

plot10 = barplot(engine_table, 
                 main="Plot 10. Add data labels \n text()",
                 xlab="Cylinders per car",
                 ylab = "Car counts",
                 col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                         "aquamarine3", "coral"),
                 ylim = c(0,3500),
                 border = "red",
                 las = 1,
                 cex.axis = 1.1,
                 cex.names = 1.1)

text(y=engine_table, 
     plot10, 
     engine_table, 
     cex=0.8, 
     pos = 3)

3.1 Orientation and margins

Change orientation of the graph and fix margins to fit labels.

Notice the use of code horiz = T.
Also, since the labels on the left side now take more space, you need to increase the margin. Use code par(mai=c(1, 1, 1, 1)). On this code, mai is used to set the margins. The numbers follow the sequence: bottom margin, left margin, top margin, right margin. The values can be changed. Use a different number than one, for example, 1.4 to increse the margin or 0.6 to decrease the margin on the left side:
par(mai=c(1, 1.4, 1, 1))
par(mai=c(1, 0.6, 1, 1))

par(mai=c(1, 1.4, 1, 1))

plot11 = barplot(engine_table, 
                 main="Plot 11. Change orientation of graph \n horiz=T",
                 ylab="Cylinders per car",
                 xlab = "Car counts",
                 col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                         "aquamarine3", "coral"),
                 xlim = c(0,3500),
                 border = "red",
                 las = 1,
                 cex.axis = 1.1,
                 cex.names = 1.1,
                 horiz = T)

In plot 11, ylim was changed to xlim since orientation changed. Also, xlab and ylab were interchanged to add labels in the proper axes.

Play with the value on code mai=c(), for example: 0.2, 0.6. 0.8 1.2, etc. and observe the differences.
In the plot below, the size of x- and y-axis fonts is reduced to fit on the left side.

plot12 = barplot(engine_table, 
                 main="Plot 12. Change space between bars \n space=",
                 xlab="Cylinders per car",
                 ylab = "Car counts",
                 col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                         "aquamarine3", "coral"),
                 xlim = c(0,3500),
                 border = "red",
                 las = 1,
                 cex.axis = 0.8,
                 cex.names = 0.8,
                 horiz = T,
                 space = 0.8)  

text(engine_table, 
     plot12, 
     engine_table, 
     cex=0.8, 
     pos=4)

4 Basic concepts 2

Order bars by decreasing value.
Let us go back to the last plot we created.
Use code: [order(data,decreasing = TRUE)] to display in order of counts.
The word “data” inside the code must be changed with the name of the data used, in this case engine_table.
Compare plots 12 to plot 11.

plot13 = barplot(engine_table[order(engine_table,decreasing = TRUE)], 
                 main="Plot 13. Data plotted in order",
                 xlab="Cylinders per car",
                 ylab = "Car counts",
                 col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                         "aquamarine3", "coral"),
                 border = "red",
                 las = 1,
                 cex.axis = 0.8,
                 cex.names = 0.8,
                 horiz = T,
                 space = 0.4)

text(engine_table[order(engine_table,decreasing = TRUE)], 
     plot13, 
     engine_table[order(engine_table,decreasing = TRUE)], 
     cex=0.8, 
     pos=4)

Vertical bar plot, sorted by frequency.

plot14 = barplot(engine_table[order(engine_table,decreasing = TRUE)], 
                 main="Plot 14. Data plotted in order",
                 xlab="Cylinders per car",
                 ylab = "Frequency",
                 col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                         "aquamarine3", "coral"),
                 border = "red",
                 las = 1,
                 cex.axis = 0.8,
                 cex.names = 0.8,
                 space = 0.4,
                 ylim = c(0,3500))

text(y=engine_table[order(engine_table,decreasing = TRUE)], 
     plot14, 
     engine_table[order(engine_table,decreasing = TRUE)], 
     cex=0.8, 
     pos=3)

5 Using color palettes

Color can also be given using color palette codes, creating your own vectors.

Visit the following website: https://www.rapidtables.com/web/color/RGB_Color.html. Once in the website, click on any color inside the RGB color picker or the RGB color codes chart and observe changes inside the box with the hashtag #.
Select any color of your preference and let’s create two color vectors containing 7 color codes. Always use quotations and start with a hashtag:

color1 = c(“#872828”, “#E5FFCC”, “#F272E1”, “#8369CB”, “#29B7B3”, “#B2EC9D”, “#E8DEA1”)

color2 = c(“#BFF6FA”, “#77DCE4”, “#3AAFB7”, “#197E86”, “#36A12E”, “#50DF45”, “#99F692”)

Observe the R chunk below, we will use plot 5 from above.

# Create your custom color vectors
color1 = c("#872828", "#E5FFCC", "#F272E1", "#8369CB", "#29B7B3", "#B2EC9D", "#E8DEA1")

color2 = c("#BFF6FA", "#77DCE4", "#3AAFB7", "#197E86", "#36A12E", "#50DF45", "#99F692")

# Apply color vectors to the plots
par(mfrow=c(1,2))

barplot(engine_table, 
                main="Plot 15A. Add colors \n Color vector 1",
                xlab="Cylinders per car",
                ylab = "Frequency (Car counts)",
                ylim=c(0, 3000),
                las=1,
                col = color1)

barplot(engine_table, 
                main="Plot 15B. Add colors \n Color vector 2",
                xlab="Cylinders per car",
                ylab = "Frequency (Car counts)",
                ylim=c(0, 3000),
                las=1,
                col = color2)

Color can also be given using the terrain.colors(), the rainbow() or several other codes. Search for more information in the console by using the ? character, or by searching on the Internet.

6 Clustered bar chart

What happen when we want to plot two categorical variables in one bar plot?
Observe the following bar plot of locations and their corresponding frequencies.

par(mai=c(1.4,1,0.4,0.4))

location     = carsData$Location

barplot(table(location),
        las=2,
        ylim=c(0,1000),
        col="#DF8E7A")

Now imagine we need to know how many cars, in each location, are automatic and how many are manual. Observe how the two variables are incorporated inside the code table(), practice changing the order, table(location, transmission) versus table(transmission, location).

Location     = carsData$Location

Transmission = carsData$Transmission

tableLT = table(Location, Transmission)

tableLT %>%
    knitr::kable(format = "html", 
             table.attr = "style='width:30%;'",
             align = "c")%>%
  kable_classic_2(bootstrap_options=c("hover","bordered"),
              html_font = "Cambria",
              position = "center",
              font_size = 16)
Automatic Manual
Asuncion 78 441
Bogota 203 440
BuenosAires 51 350
Caracas 141 202
LaPaz 118 357
Lima 244 382
Montevideo 139 451
PanamaCity 295 478
Quito 188 525
SanJose 168 375
Santiago 56 162

Now we can create a bar plot using table(location, transmission). Notice the use of two colors to differentiate between the two categories in variable transmission.
The two graphs are placed next to each other for proper comparisons.

  • Notice that the table use for this graph has the sequence table(transmission, location). Practice creating a bar plot with the other sequence explained above; compare results.
  • Notice the use of legend.text = rownames(tableTL) to add a legend.
  • The code args.legend = list(x=“topleft”) is used to indicate localization of the legend.

tableTL = table(Transmission, Location)

par(mfrow=c(1,2), mai=c(1.4,1,0.5,0.5))

barplot(table(Location),
        las=2,
        col="#DF8E7A",
        ylim=c(0,1000),
        cex.names = 0.8, cex.axis = 0.8,
        ylab = "Frequency")

plot16=barplot(tableTL,
        las=2,
        col=c("#DF8E7A","#C6340F"),
        ylim=c(0,1000),
        cex.names = 0.8, cex.axis = 0.8,
        ylab = "Frequency",
        legend.text = rownames(tableTL),
        args.legend = list(x="topleft"))

Observe in the bar plot below the use of code beside = TRUE to place data for transmission on separate columns. Also, compare the data display using plot().

par(mfrow=c(1,2), 
    mai=c(1.4,1,0.5,0.5),
    mar=c(5,4,2,1))

barplot(tableTL,
        las=2,
        col=c("#DF8E7A","#C6340F"),
        ylim=c(0,600),
        cex.names = 0.8, cex.axis = 0.8,
        ylab = "Frequency",
        legend.text = rownames(tableTL),
        args.legend = list(x="topleft"),
        beside = TRUE)

plot(tableLT,
     las=2,
     col=c("blue","yellow"),
     main="")

6.1 Cluster example 2

Another example for clustered bar plot, compare to density lines graphs.
What is the question we asked to the data that required the following graphs?
In other words, what is the question that is best answered with the two graphs below?
Can you understand and replicate the codes?

par(mfcol=c(2,1),
    mai=c(0.5,0.8,0.5,0.5))

 # PRESENT clustered bar plot with beside=T

barplot(table(carsData$FuelType, carsData$Efficiency),
        beside = T, las=1,
        col=c("red","blue","green"))

# CREATE objects to filter fuel type and to store their densities based on efficiency

diesel = carsData %>%
  filter(FuelType=="Diesel")
dieselDensity = density(diesel$Efficiency, adjust=3)

gasoline = carsData %>%
  filter(FuelType=="Gasoline")
gasolineDensity = density(gasoline$Efficiency, adjust=3)

hybrid = carsData %>%
  filter(FuelType=="Hybrid")
hybridDensity = density(hybrid$Efficiency, adjust=3)

# PRESENT density lines 
# PLOT a first line and add the remaining using lines() 

plot(dieselDensity, las=1,
     xlim=c(0,40), col="red",
     main="", ylab="",
     xaxp=c(0,40,8),
     ylim=c(0,0.2))

lines(gasolineDensity, col="blue")
lines(hybridDensity, col="green")



7 Using bar graphs to plot mean and sd

7.1 Error bars with code segments()

# CREATE objects to store data Km under analysis per location

asuncionKm = carsData %>%
  filter(location=="Asuncion")%>%
  select(Km)

bogotaKm = carsData %>%
  filter(location=="Bogota")%>%
  select(Km)

caracasKm = carsData %>%
  filter(location=="Caracas")%>%
  select(Km)

quitoKm = carsData %>%
  filter(location=="Quito")%>%
  select(Km)

# CREATE objects to store mean and standard deviation values

asuncionKm_mean = mean(asuncionKm$Km)/1000
bogotaKm_mean   = mean(bogotaKm$Km)/1000
caracasKm_mean  = mean(caracasKm$Km)/1000
quitoKm_mean    = mean(quitoKm$Km)/1000

asuncionKm_sd   = sd(asuncionKm$Km)/1000
bogotaKm_sd     = sd(bogotaKm$Km)/1000
caracasKm_sd    = sd(caracasKm$Km)/1000
quitoKm_sd      = sd(quitoKm$Km)/1000

# CREATE a vector to store mean values

meansCities = c(asuncionKm_mean, bogotaKm_mean, caracasKm_mean, quitoKm_mean)

# CREATE a vector to store standard deviation values

sdCities = c(asuncionKm_sd, bogotaKm_sd, caracasKm_sd, quitoKm_sd)

# CREATE vector to store city names, for x-axis values.

nameCities = c("Asuncion", "Bogota", "Caracas", "Quito")

# CREATE color vector 

color3 = c("#BFF6FA", "#77DCE4", "#3AAFB7", "#197E86")
           
# PRESENT plot

par(mai=c(1,1.5,1,1))            # ADJUST margins

plot18 = barplot(meansCities, 
        las=1,
        ylim = c(0,160),
        names.arg = nameCities,
        col=color3,
        ylab="Frequency (x1,000)")

segments(plot18,                  # PLOT error bars
         meansCities-sdCities,
         plot18,
         meansCities+sdCities)

7.2 Error bars with code arrows()

par(mai=c(0.6,1,0.5,0.4))         # ADJUST margins

# PRESENT a bar plot of meansCities as we did in the previous R Chunk 

plot19 = barplot(meansCities, 
        las=1,
        ylim = c(0,160),
        names.arg = nameCities,
        col=color3,
        ylab="Frequency (x1,000)")

arrows(x0=plot19,                  # PLOT error bars
       y0 = meansCities-sdCities, 
       x1 = plot19, 
       y1 = meansCities+sdCities, 
       code = 3, 
       angle = 90, 
       length = 0.1)



Disclaimer: This short series manual project is a work in progress. Until otherwise clearly stated, this material is considered to be draft version.



Dee Chiluiza, PhD
20 August 2021
Last update: 05 June, 2022
Boston, Massachusetts, USA

Bruno Dog