Bar graph for categorical visualization

Author

Manoj

Prolbem statement: Develop a R script to produce a bar graph displaying the frequency distribution of categorical data, grouped by a specific variable using ggplot2


Steps

In this program, we will follow the following steps

  • Load the required libraries
  • Load and inspect the dataset
  • Perform EDA
  • COnvert the numberical variable into categorical variable
  • Examine the frequency distribution
  • Create a grouped bar chart
  • Examine and interpret the graph.

Step 1: Load required Libraries

#install.packages('ggplot2')
library(ggplot2)

Step 2: Load and inspect the dataset

We load the builtin dataset mtcars and view the first few rows to understand its structure

data=mtcars
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Step 3: Exploratory data analysis

Before creating any visualization, we explore the dataset to understand the variable and types.

str(data)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(data)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  
  • str(data) helps us to identify the data types of each variable
  • summary(data) provides statistical summaries

Step 4: Convert the numberc variables to factors

To correctly visualize categorical data, we convert relavant variables into factors

data$cyl
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
table(data$cyl)

 4  6  8 
11  7 14 
data$gear
 [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
table(data$gear)

 3  4  5 
15 12  5 
class(data$cyl)
[1] "numeric"
class(data$gear)
[1] "numeric"
data$cyl=as.factor(data$cyl)
data$gear=as.factor(data$gear)
class(data$cyl)
[1] "factor"
data$cyl
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Levels: 4 6 8
class(data$gear)
[1] "factor"
data$gear
 [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
Levels: 3 4 5
summary(data)
      mpg        cyl         disp             hp             drat      
 Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
 1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
 Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
 Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
 3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
 Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
       wt             qsec             vs               am         gear  
 Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000   3:15  
 1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000   4:12  
 Median :3.325   Median :17.71   Median :0.0000   Median :0.0000   5: 5  
 Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062         
 3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000         
 Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000         
      carb      
 Min.   :1.000  
 1st Qu.:2.000  
 Median :2.000  
 Mean   :2.812  
 3rd Qu.:4.000  
 Max.   :8.000  
str(data)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Why this step?

  • to ensure the correct categorical interpretation
  • it helps ggplt2 group data consistantly
  • this step prevents treating categories as continuos values

Step 5: Examine Frequency Distribution

Before plotting, we analyze how the data is distiributed across categories

table(data$cyl)

 4  6  8 
11  7 14 
table(data$gear)

 3  4  5 
15 12  5 
table(data$cyl, data$gear)
   
     3  4  5
  4  1  8  2
  6  2  4  1
  8 12  0  2
  • Helps us to understnd the count of each category
  • Provide insight into relationships between variables
  • prepares us for interpreting the visualization

Step 6: Create the bar graph

ggplot(data, aes(x=cyl, fill=gear))

ggplot(data, aes(x=cyl, fill=gear))+geom_bar()

ggplot(data, aes(x=cyl, fill=gear))+geom_bar(position="dodge")

ggplot(data, aes(x=cyl, fill=gear))+
  geom_bar(position="dodge")+
  theme_minimal()+
  theme(legend.position = 'top')

ggplot(data, aes(x=cyl, fill=gear))+
  geom_bar(position="dodge")+
  theme_minimal()+
  theme(legend.position = 'top')+
  labs(title="bar graph displaying the frequency distribution of categorical data", y="Count", x="Number of cylinders")

Step 7: Discussion

After generating graph, we analyze the patterns and relationships between number of cylinders and gear types

  1. Modify the the graph to create stacked bar chart.
  2. What happens if we dont convert them to categorical varialbe
  3. How does the grouping improves the visualization
  4. Why is grouped barchart is useful in this case̦
  5. What insights can be derives from visualization