Program 4

Author

Manoj

Published

March 21, 2026

Objective

Develop a script in R to produce a bar graph displaying the frequency distribution of categorical data, grouped by a specific variable, using the ggplot2 package.


Overview of Steps

In this program, we will follow these steps:

  1. Load the required library
  2. Load and inspect the dataset
  3. Perform exploratory data analysis (EDA)
  4. Convert numerical variables into categorical variables
  5. Examine frequency distributions
  6. Create a grouped bar chart
  7. Interpret the results

Step 1: Load Required Libraries

library(ggplot2)

Step 2: Load and Inspect the Dataset

We load the built-in dataset mtcars and view the first few rows to understand its structure.

data <- mtcars
head(data)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Step 3: Exploratory Data Analysis

Before creating any visualization, we explore the dataset to understand the variables and their types.

str(data)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(data)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

Explanation

  • str(data) helps identify the data types of each variable
  • summary(data) provides statistical summaries

Key Observation

  • Variables like cyl and gear are numeric but represent categories

Step 4: Convert Numeric Variables to Factors

To correctly visualize categorical data, we convert relevant variables into factors.

data$cyl <- as.factor(data$cyl)
data$gear <- as.factor(data$gear)

Why This Step?

  • Ensures proper categorical interpretation
  • Helps ggplot2 group data correctly
  • Prevents treating categories as continuous values

Step 5: Examine Frequency Distribution

Before plotting, we analyze how the data is distributed across categories.

table(data$cyl)

 4  6  8 
11  7 14 
table(data$gear)

 3  4  5 
15 12  5 
table(data$cyl, data$gear)
   
     3  4  5
  4  1  8  2
  6  2  4  1
  8 12  0  2

Explanation

  • Helps understand the count of each category
  • Provides insight into relationships between variables
  • Prepares us for interpreting the visualization

Step 6: Create the Bar Graph

We now create a grouped bar chart using ggplot2.

ggplot(data, aes(x = cyl, fill = gear)) +
geom_bar(position = "dodge") +
labs(title = "Frequency of Cylinders Grouped by Gear Type",
x = "Number of Cylinders",
y = "Count",
fill = "Gears") +
theme_minimal()


Step 7: Interpretation

After generating the graph, we analyze patterns and relationships between the number of cylinders and gear types.


Discussion Points

  1. Why is it necessary to convert numeric variables like cyl into factors?
  2. What happens if we do not convert them?
  3. How does grouping improve interpretation?
  4. Why is a grouped bar chart useful in this case?
  5. What insights can be derived from the visualization?

Follow-up Questions

  1. Modify the graph to create a stacked bar chart
  2. Use another variable such as am or vs
  3. Customize the colors of the bars
  4. Add labels to display counts on bars
  5. Try the same process with a different dataset
  6. Create a percentage-based bar chart

Conclusion

In this program, we:

  • Performed exploratory data analysis
  • Converted variables into categorical form
  • Created a grouped bar chart using ggplot2
  • Interpreted relationships between categorical variables