Program 1-Part A

Author

Nethra 1NT24IS138

Develop an R program to quickly explore a given dataset, including categorial analysis uding the group_by() command, and visualize the findings using ggplot2

Aim of the program

The objective of this program is to quickly explore a dataset perfrom Categorical (group-wise) analysis using the group_by() command from dplyr, and visualize the results using ggplot2

What We Will Do

In this program we will:

  1. Load the required libraries and dataset.
  2. Explore the structure of the dataset.
  3. Convert a numeric variable into a categorical variable.
  4. Perform categorical analysis using group_by() and summarise().
  5. Visualize the results using ggplot2.

Step 1: Load the required Libraries and Dataset

  • tidyverse is a collection of the packages for datascience.
  • dplyr is used for grouping and summarizing data.
 library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
 library(dplyr)

# Load built-in dataset
data <- mtcars

Step 2: Explore the Dataset

Before performing an analysis, we should understand the dataset.

We will check:

  • Number of rows and columns
  • Column names
  • Data types
  • Summary statistics
  • First few rows
 # Dimensions (rows and columns)
dim(data)
[1] 32 11
# Column names
names(data)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
# Structure of Dataset
str(data)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Summary statistics
summary(data)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  
# First six rows
head(data)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Step 3: Convert Numeric Variable to Categorical

The variable cyl represents the number of cylinders in the car.

Although it is numeric (4, 6, 8) it represents categories. for categorical analyisis, we convert it into a factor.

# Convert 'cyl' to factor
data$cyl <- as.factor(data$cyl)

#Confirm conversion
str(data$cyl)
 Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
levels(data$cyl)
[1] "4" "6" "8"

Step 4: Perfrom Categorical Analysis

We calculate the average miles per gallon (‘mpg’) fro each cylinder category

How these functions work together

  • %>% passes output from one function to the next.
  • group_by(cyl) splits the dataset into groups.
  • summarise() calculates statistics per group.
  • mean(mpg) computes the average mileage.
  • .groups = "drop" removes grouping afterward.
summary_data <- data %>%
  group_by(cyl) %>%
  summarise(
    avg_mpg = mean(mpg),
    .groups = "drop"
  )

summary_data
# A tibble: 3 × 2
  cyl   avg_mpg
  <fct>   <dbl>
1 4        26.7
2 6        19.7
3 8        15.1

Step 5: Visualize Using a Bar Plot

ggplot(summary_data, aes(x=cyl, y= avg_mpg,fill =cyl))+
geom_bar(stat = "identity") +
  labs(
    title="Average WPG by Cylinder Count",
    x = "Number of Cylinders",
    y = "Average MPG"
  ) +
  theme_minimal()