Return to Home Page


Dee Chiluiza, PhD
Northeastern University
Introduction to data analysis using R, R Studio and R Markdown
Short manual series: Bar Graphs


As always, I start my R Markdown files by listing all libraries and data sets in the first R chunk.

#{r libraries data, message=FALSE, warning=FALSE}
# Libraries
library(readxl)
library(readr)
library(tidyverse)
library(dplyr)
library(DT)
library(knitr)

# Data
carsIndia <- read_excel("DataSets/carSalesDCVersion.xlsx")
carsIndia =  carsIndia[-1]

# Here's a link to the data set. Save it on your computer and use the same code to import it.
# https://figshare.com/s/f50ceaec7c8ebd3ed0f1
Basic concepts

This data set contains 11 variables and 5945 observations. The information is about car sales in different cities in India. The first variable “Model” was removed (see code above) because entries are very long and they add no important information to our practice. Below there is a list of the first 10 observations in the data set.

Table. List of variables and first 10 observations of data set carsIndia.


Location Year FuelType Transmission Owner Efficiency Engine_cc Power_bhp Seats Km Price
BuenosAires 2001 Gasoline Manual Third 20 1000 100 5 231735 1640.0
BuenosAires 2001 Gasoline Manual First 20 1000 50 4 184262 4718.4
LaPaz 2001 Gasoline Manual Third 20 1000 100 5 167861 1056.0
Montevideo 2001 Gasoline Manual Second 16 1500 100 5 225012 627.0
LaPaz 2001 Gasoline Manual Third 16 1500 100 5 452805 1031.4
Montevideo 2001 Gasoline Manual Second 16 1500 150 5 96635 2073.0
BuenosAires 2002 Gasoline Manual Third 20 1000 100 5 344020 8315.2
Montevideo 2002 Gasoline Manual First 20 1000 50 4 114576 1516.0
PanamaCity 2002 Gasoline Manual Third 20 1000 100 5 174924 2792.0
BuenosAires 2002 Gasoline Manual Second 20 1000 50 4 225324 6256.0

The following code will list all variables on your data sets. I added code knitr::kable(as.data.frame()) to improve visualization.

# Check the list you obtain
knitr::kable(as.data.frame(names(carsIndia)),
             format = "html", 
             table.attr = "style='width:60%;'")
names(carsIndia)
Location
Year
FuelType
Transmission
Owner
Efficiency
Engine_cc
Power_bhp
Seats
Km
Price



Work with variable Engine_cc.

To call a variable from a data set, call the data set and the variables separated by the dollar symbol. Pay attention to the sequence of codes.

Observe how it is presented as a table Now it is possible to see how many groups there are, unique observations, and how many observations there are on each

table(carsIndia$Engine_cc)   
## 
## 1000 1500 2000 2500 3000 4000 6000 
##  679 2852 1191  742  453   16   12


BAR PLOTS

Observe bar plot by calling variable. This is not the proper way to present the data.

barplot(carsIndia$Engine_cc)

Observe bar plot by calling the table

barplot(table(carsIndia$Engine_cc))

To reduce size of code, provide a name

engine_table = table(carsIndia$Engine_cc)
engine_table
## 
## 1000 1500 2000 2500 3000 4000 6000 
##  679 2852 1191  742  453   16   12
barplot(engine_table)

Add labels, colors, etc. to your bar graph Additional codes are added inside the code barplot() all separated by commas Use code: main = “Text” to add a title to the figure Use code: xlab = “Text” to add a title to the x-axis Use code: ylab = “Text” to add a title to the y-axis Use code: col = c("“,”“,”") to add colors, use quotations for each color Use code: ylim = c(minimum, maximum) to set limits of y-axis. Use code: las = 1 to change orientation of y-axis labels Use code: cex.axis = to change font size on y-axis Use code: cex.names = to change font size on x-axis

Observe the following codes in sequence of complexity. Pay attention to the changes obtained by each additional code.

plot1 = barplot(engine_table)

plot2 = barplot(engine_table, 
                main="Plot 2. Bar plot of numbers of cars per cylinder content"
                )

plot3 = barplot(engine_table, 
                main="Plot 3. Add x-axis label",
                xlab="Engine size in cubic centimeters"
                )

plot4 = barplot(engine_table, 
                main="Plot 4. Add y-axis label",
                xlab="Engine size in cubic centimeters",
                ylab = "Car counts"
                )

plot5 = barplot(engine_table, 
                main="Plot 5. Add colors",
                xlab="Cylinders per car",
                ylab = "Car counts",
                col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                        "aquamarine3", "coral")
                )

plot6 = barplot(engine_table, 
                main="Plot 6. Add y-axis limits",
                xlab="Cylinders per car",
                ylab = "Car counts",
                col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                        "aquamarine3", "coral"),
                )

plot7 = barplot(engine_table, 
                main="Plot 7. Add border color to bars",
                xlab="Cylinders per car",
                ylab = "Car counts",
                col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                        "aquamarine3", "coral"),
                border = "red"
                )

plot8 = barplot(engine_table, 
                main="Plot 8. Turn y-axis labels horizontally",
                xlab="Cylinders per car",
                ylab = "Car counts",
                col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                        "aquamarine3", "coral"),
                border = "red",
                las = 1
                )

plot9 = barplot(engine_table, 
                 main="Plot 9. Change font size, add data labels",
                 xlab="Cylinders per car",
                 ylab = "Car counts",
                 col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                         "aquamarine3", "coral"),
                 border = "red",
                 las = 1,
                 cex.axis = 1.1,
                 cex.names = 1.1
                 )
text(y=engine_table, 
     plot9, 
     engine_table, 
     cex=0.8, 
     pos = 3)

Orientation and margins

Change orientation of the graph and margins to fit labels.

Notice the use of code horiz = T.
Also, since the labels on the left side now take more space, you need to increase the margin. Use code par(mai=c(1, 1, 1, 1)). On this code, mai is used to set the margins. The numbers follow the sequence: bottom margin, left margin, top margin, right margin. The values can be changed. Use a different number than one, for example, 1.4 to increse the margin or 0.6 to decrease the margin on the left side:
par(mai=c(1, 1.4, 1, 1))
par(mai=c(1, 0.6, 1, 1))

par(mai=c(1, 1.4, 1, 1))
plot10 = barplot(engine_table, 
                 main="Plot 10. Change orientation of graph",
                 ylab="Cylinders per car",
                 xlab = "Car counts",
                 col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                         "aquamarine3", "coral"),
                 border = "red",
                 las = 1,
                 cex.axis = 1.1,
                 cex.names = 1.1,
                 horiz = T
                 )

In plot 10, ylim was changed to xlim since orientation changed. Also, xlab and ylab were interchanged to add labels in the proper axes.

can you tell the different code between plot10 and plot 11? Clue: observe changes in the cex.axis and cex.names codes.

Change the value for code space, play with for example: 0.2, 0.6. 0.8 1.2, etc. and see the differences.

plot11 = barplot(engine_table, 
                 main="Change space between bars. Add data labels",
                 xlab="Cylinders per car",
                 ylab = "Car counts",
                 col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                         "aquamarine3", "coral"),
                 border = "red",
                 las = 1,
                 cex.axis = 0.8,
                 cex.names = 0.8,
                 horiz = T,
                 space = 0.4
                 )   
text(engine_table, 
     plot11, 
     engine_table, 
     cex=0.8, 
     pos=4)

EXERCISE
Using the same strategy explained above, prepare a bar plot to show the count of cars based on Owner.
Add labels, colors, etc., as shown above.
END OF EXERCISE

Basic concepts 2

Order bars by decreasing value.
Let us go back to the last plot we created.
Use code: [order(data,decreasing = TRUE)] to display in order of counts.
The word “data” inside the code must be changed with the name of the data used, in this case engine_table.
Compare plots 12 to plot 11.

plot12 = barplot(engine_table[order(engine_table,decreasing = TRUE)], 
                 main="Plot 12. Data plotted in order. Add data labels",
                 xlab="Cylinders per car",
                 ylab = "Car counts",
                 col = c("blue", "yellow", "pink", "green",  "azure", "red", 
                         "aquamarine3", "coral"),
                 border = "red",
                 las = 1,
                 cex.axis = 0.8,
                 cex.names = 0.8,
                 horiz = T,
                 space = 0.4
                 )
text(engine_table[order(engine_table,decreasing = TRUE)], 
     plot11, 
     engine_table[order(engine_table,decreasing = TRUE)], 
     cex=0.8, 
     pos=4)

COLOR PALETTE

Color can also be given using color palette codes

Visit website: https://www.rapidtables.com/web/color/RGB_Color.html In that website, click on any color and observe changes inside the #box That box contains a 6 digits code, including numbers and or letters See example below.

plot13 = barplot(engine_table[order(engine_table,decreasing = TRUE)], 
                 main="Plot 13. Color palette",
                 xlab="Cylinders per car",
                 ylab = "Car counts",
                 border = "red",
                 las = 1,
                 cex.axis = 0.8,
                 cex.names = 0.8,
                 horiz = T,
                 space = 0.4,
                 col = c("#872828", "#E5FFCC", "#F272E1", "#8369CB",
                         "#29B7B3", "#B2EC9D", "#E8DEA1", "#F6865D")
                 )

Color can also be given using the terrain.colors(), the rainbow() or several other codes. Search for more information in the console by using the ? character, or by searching on the Internet.

CLUSTERED BAR CHAT

What happen when we want to plot two categorical variables in one bar plot?

For this exercise We will use public data set mpg, and variables drv (drive train) and class (type of car).

Drive train categories:
All four wheels drive (4), front drive (f), rear drive (r).

Class categories:
2seater, compact, midsize, minivan, pickup, subcompact, SUV

par(mai=c(1,1.4,1,1)) Don’t worry learning this code yet. Activate it.

names(mpg)
##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
##  [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
## [11] "class"
table(mpg$drv)
## 
##   4   f   r 
## 103 106  25
table(mpg$class)
## 
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##          5         47         41         11         33         35         62

Observe how the two variables are called inside the table code, they need to be separated by a comma Observe the new table, it is a 3x7

table(mpg$drv, mpg$class)
##    
##     2seater compact midsize minivan pickup subcompact suv
##   4       0      12       3       0     33          4  51
##   f       0      35      38      11      0         22   0
##   r       5       0       0       0      0          9  11

Create a new table and Provide a name

driveClass = table(mpg$drv, mpg$class)

par(mai=c(1.4,1,0.6,0.6))
plot14 = barplot(driveClass,
                 main = "Plot 14",
                 las=2,
                 legend.text = rownames(driveClass)
                 )

plot15 = barplot(driveClass, 
                 beside = TRUE,
                 main = "Plot 15. beside() code added",
                 legend.text = rownames(driveClass),
                 args.legend = list(x = "topleft"))

Compare plots 14 and 15 and notice the effect of code: beside = TRUE

Analyze the codes below

par(mai=c(1.4,1.4,0.6,0.6))

plot16 = barplot(driveClass, 
                 main = "Plot 16. Horizontal graph",
                 beside = TRUE, 
                 horiz = T,
                 las = 1,
                 col = c("#E75656","#CB1FD6","#0CAD6F","green"),
                 xlab = "Counts",
                 xlim = c(0,60)
                 )

ADD a LEGEND ####

Use code legen.tex, call it by rownames, and add name of the table used

par(mai=c(1.4,1.4,0.6,0.6))

plot17 = barplot(driveClass, 
                 main = "Plot 17. Add an internal legend",
                 beside = TRUE, 
                 horiz = T,
                 las = 1,
                 col = c("#E75656","#CB1FD6","#0CAD6F","green"),
                 legend.text = rownames(driveClass),
                 xlab = "Counts",
                 xlim = c(0,60)
                 )

ADD ERROR BARS 1

Obtain the mean and sd of a data set. Let’s create three vectors.

names = c("Squirrel", "Rabbit", "Chipmunk")

means_set = c(23, 28, 19)

standard_dev = c(2, 4, 6)

plot18Top = max(means_set+standard_dev*2)

plot18 = barplot(means_set,
                 main = "Plot18. Error bars 1",
                 names.arg=names, 
                 col="gray", 
                 las=1, 
                 ylim=c(0,plot18Top))
segments(plot18, 
         means_set-standard_dev, 
         plot18, 
         means_set+standard_dev, 
         lwd=2)

ADD ERROR BARS 2

par(mai=c(1,1,1,0.4))

plot19= barplot(means_set,
                main = "Plot 19. Error bars 2",
                names.arg = names,
                cex.names = 0.6, 
                las=1, 
                col=rainbow(12), 
                ylim=c(0,35))
arrows(x0=plot19, 
       y0=means_set-standard_dev, 
       x1=plot19, 
       y1=means_set+standard_dev, 
       code=3, 
       angle=90, 
       length=0.1)

ADD ERROR BARS 3

Create random data.

m = matrix(runif(1000, min=1, max=10), ncol=10)
y_randomData = apply(m, 2, mean)
y.sd = apply(m, 2, sd)

Create object named plot20 to call bar plot of means, then plot graph and use name plot20 to set error bars using the arrows() code.

plot20 = barplot(y_randomData,
                 main = "Plot 20. Error bars 3",
                 ylim=c(0, 10),
                 col=rainbow(10))

arrows(x0=plot20, 
       y0=y_randomData-y.sd, 
       x1=plot20, 
       y1=y_randomData+y.sd, 
       code=3, 
       angle=90, 
       length=0.1)
text(y=y_randomData, 
     plot20, 
     round(y_randomData,2), 
     cex=0.8, 
     pos=3)

SORT AND CUMMULATIVE

sample_3244 = c(2,4,6,3,8,3,7,4,5,9,2,3,8,7,3)

cumsum(sample_3244)
##  [1]  2  6 12 15 23 26 33 37 42 51 53 56 64 71 74
barplot(sample_3244,
        col=terrain.colors(13))

barplot(sort(sample_3244),
        col=terrain.colors(13))

barplot(cumsum(sample_3244),
        col=terrain.colors(13))

cInd_GasDies = carsIndia %>% 
        filter(FuelType %in% c('Gasoline', 'Diesel'))

cInd_GasDies %>% 
        ggplot(aes(Km/1000)) +
        geom_histogram(fill="#A11515",
                       color="white",
                       binwidth=10)+
        theme_minimal()

cInd_GasDies %>% 
        ggplot(aes(Km/1000, fill=Owner)) +
        geom_density(alpha=0.4)+
        theme_minimal()

cInd_GasDies %>% 
        ggplot(aes(x=Km/1000, y=Owner, fill=Owner)) +
        geom_boxplot() +
        labs(y="", x="Kilometers (x1,000)")+
        theme_minimal() +
        theme(legend.position = "none")


Disclaimer: This short series manual project is a work in progress. Until otherwise clearly stated, this material is considered to be draft version.


Dee Chiluiza, PhD
20 August, 2021
Boston, Massachusetts, USA

Bruno Dog