Program 1 - Part A

Author

Stephen George

USN: 1NT24IS227

Develop a R program to quickly explore a given dataset, including categorical analysis using the group_by() command, visualise

What We Will Do

In this program, we will:

  1. Load the required libraries and dataset.
  2. Explore the structure of the dataset
  3. Convert a numeric variable into a categorical variable.
  4. Perform categorical analysis using ‘group_by()’ and ‘summarise()’.
  5. Visualize the result

Step 1: Load the required Libraries and Dataset

  • tidyverse is a collection of packages for data science.
  • dplyr is used for grouping and summarizing data.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

# Load built-in dataset
data <- mtcars

Step 2: Explore the Dataset

Before performing any analysis, we should understand the dataset.

We will check:

  • Number of rows and columns
  • Column names
  • Data types
  • Summary statistics
  • First few rows
#Dimensions (rows and columns)
dim(data)
[1] 32 11
# Column names
names(data)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
# Structure of dataset
str(data)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Summary statistics
summary(data)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  
# First six rows
head(data)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Step 3: Convert Numeric Value to Categorical

The variable cyl represents the number of cylingers in a car.

Although it is numeric (4,6,8), it represents categories.
For categorical analysis, we convert it into a factor.

# Convert 'cyl' to factor
data$cyl <- as.factor(data$cyl)

# Confirm conversion
str(data$cyl)
 Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
levels(data$cyl)
[1] "4" "6" "8"

Why Is This Important?

Step 4: Perform Categorical Analysis

We calculate the average miles per gallon (mpg) for each cylinder category

How These Functions Work Together

  • %>% passes output from one function to the next.
  • group_by(cyl) splits the dataset into groups.
  • summarise() calculates statistics per group.
  • mean(mpg) computes average mileage.
  • .groups = "drop" removes grouping afterward.
summary_data <- data %>%
  group_by(cyl) %>%
  summarise(
    avg_mpg = mean(mpg),
    .groups = "drop"
  )
summary_data
# A tibble: 3 × 2
  cyl   avg_mpg
  <fct>   <dbl>
1 4        26.7
2 6        19.7
3 8        15.1

Step 6: Visualize Using a Bar Plot

Understanding ggplot Components

  • ggplot(summary_data, aes(...)) defines dataset and mappings.
  • aes(x=cyl, y=avg_mpg, fill=cyl) sets axes and colors.
  • geom_bar(stat = "identity") uses actual values.
  • labs() adds titles and labels.
  • theme_minimal() applies a clean theme.
ggplot(summary_data, aes(x = cyl, y= avg_mpg, fill = cyl)) +
  geom_bar(stat = "identity") +
  labs(
    title="Average MPG by Cylinder Count",
    x= "Number of Cylinders",
    y = "Average MPG"
  ) +
  theme_minimal()