Program 8

Author

1NT23IS080 - Section B - Harsh Deep B Nair

Develop a report on all 7 previous programs.

Program 1

Develop an R program to quickly explore a given data set, including categorical analysis using the group_by command, and visualize the findings using ggplot2 features.

Step 1: Load the necessary libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)

Step 2: Load the dataset

#Load dataset
data <- mtcars

#Convert 'cyl' toa factor for categorical analysis
data$cyl <- as.factor(data$cyl)

Step 3: Group by categorical variables

#Summarize average mpg by cylinder category
summary_data <- data %>%
  group_by(cyl) %>%
  summarise(avg_mpg = mean(mpg), .groups= 'drop')

#Display summary
print(summary_data)
# A tibble: 3 × 2
  cyl   avg_mpg
  <fct>   <dbl>
1 4        26.7
2 6        19.7
3 8        15.1

Step 4: Visualizing the findings

#Create a bar plot using ggplot2
ggplot(summary_data, aes( x= cyl, y = avg_mpg, fill = cyl))+
  geom_bar(stat = "identity") +
  labs(title = "Average MPG by cylinder count",
       x ="Number of Cylinder",
       y= "Average MPG") +
  theme_minimal()

Program 2

Write an R script to create a scatter plot, incorporating categorical analysis through color-coded data points representing different groups, using ggplot2.

Step 1: Load the necessary libraries

# Load necessary libraries
library(ggplot2)
library(dplyr)

Step 2: Load the dataset

#Load the iris dataset
data <- iris

#Display firstfew rows
head(data)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Step 3: Create a scatter plot

#Create a scatter plot using ggplot2
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species))+
  geom_point(size = 3, alpha = 0.7) + #Increase point size and transaprency
  labs(title = "Scatter Plot of Sepal Dimensions",
       x = "Sepal Length",
       y = "Sepal Width",
       color = "Species") + #Legend title
  theme_minimal()+ #Clean layout
  theme(legend.position = "top") #Move legend to the top

Program 3

Implement an R function to generate a line graph depicting the trend of a time-series dataset, with separate lines for each group, utilizing ggplot2’s group aesthetic.

Introduction

This document demonstrates how to create a time-series line graph using the built-in AirPassengers dataset in R.

The dataset contains monthly airline passenger counts from 1949 to 1960. We will use ggplot2 to visualize trends, with separate lines for each year.

Step 1: Load the necessary libraries

library(ggplot2)
library(dplyr)
library(tidyr)

Step 2: Load the built-in AirPassengers Dataset

The AirPassengers dataset is a time series object in R.

We first convert it into a dataframe to use it with ggplot2.

  • Date: Represents the month and year (from January 1949 to December 1960).

  • Passengers: Monthly airline passenger counts.

  • Year: Extracted year from the date column, which will be used to group the data.

# Convert time-series data to a dataframe
data <- data.frame(
  Date = seq(as.Date("1949-01-01"), by = "month", length.out = length(AirPassengers)),
  
  Passengers = as.numeric(AirPassengers),
  
  Year = as.factor(format(seq(as.Date("1949-01-01"), by = "month", length.out = length(AirPassengers)), "%Y"))
)

# Display first few rows
head(data, n=20)
         Date Passengers Year
1  1949-01-01        112 1949
2  1949-02-01        118 1949
3  1949-03-01        132 1949
4  1949-04-01        129 1949
5  1949-05-01        121 1949
6  1949-06-01        135 1949
7  1949-07-01        148 1949
8  1949-08-01        148 1949
9  1949-09-01        136 1949
10 1949-10-01        119 1949
11 1949-11-01        104 1949
12 1949-12-01        118 1949
13 1950-01-01        115 1950
14 1950-02-01        126 1950
15 1950-03-01        141 1950
16 1950-04-01        135 1950
17 1950-05-01        125 1950
18 1950-06-01        149 1950
19 1950-07-01        170 1950
20 1950-08-01        170 1950

Step 3: Define a Function for Time-Series Line Graph

We define a function to create a time-series line graph where:

  • The x-axis represents time (Date).

  • The y-axis represents the number of passengers (Passengers).

  • Each year has a separate line to compare trends.

Function Inputs

  1. data – The dataset containing time-series data.

  2. x_col – The column representing time (Date).

  3. y_col – The column representing values (Passengers).

  4. group_col – The categorical variable for grouping (Year).

  5. title – Custom plot title.

Features of the Line Graph

  • Group-based Visualization:
  1. Each year has a distinct line color.

  2. The group aesthetic ensures lines are drawn separately for each year.

  • geom_line(size = 1.2)
  1. Adds a smooth line for trend analysis.
  • geom_point(size = 2)
  1. Highlights individual data points.
  • theme_minimal() & theme(legend.position = “top”)
  1. Enhances readability with a clean layout.

  2. Moves legend to the top for better visualization.

# Function to plot time-series trend
plot_time_series <- function(data, x_col, y_col, group_col, title="Air Passenger Trends") {
  ggplot(data, aes_string(x = x_col, y = y_col, color = group_col, group =group_col)) +
    geom_line(size = 1.2) +  # Line graph
    geom_point(size = 2) +   # Add points for clarity
    labs(title = title,
         x = "Year",
         y = "Number of Passengers",
         color = "Year") +  # Legend title
    theme_minimal() +
    theme(legend.position = "top")
}

# Call the function
plot_time_series(data, "Date", "Passengers", "Year", "Trend of Airline Passengers Over Time")
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Program 4

Develop a script in R to produce a bar graph displaying the frequency distribution of categorical data in a given dataset, grouped by a specific variable, using ggplot2.

Step 1: Load the necessary libraries

#load necessary library
library(ggplot2)

Step 2: Load the dataset

#Load dataset
data <- mtcars
head(data)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Step 3: Convert numerical data to categorical data

data$cyl <- as.factor(data$cyl)
data$gear <- as.factor(data$gear)

Step 4: Create a bar graph

#create a bar graph
ggplot(data, aes(x=cyl,fill = gear)) +
  geom_bar(position = "dodge") + #Grouped bar chart
  labs(title = "Frequency of cyclinders groupled byGear Type",
       x = "Number of Cyclinders",
       y = "Count",
       fill = "Gears") + #Legend title
  theme_minimal()

Explanation of the Plot

X-Axis (cyl)

  • Displays cylinder categories (4, 6, 8 cylinders).

Y-Axis (Frequency Count)

  • Represents the number of cars in each category.

Color Fill (gear)

  • Differentiates cars based on number of gears (3, 4, 5 gears).

Grouped Bars (position = "dodge")

  • Ensures bars are side by side instead of stacked.

Minimal Theme (theme_minimal())

  • Provides a clean and readable layout.

Program 5

Implement an R program to create a histogram illustrating the distribution of a continuous variable, with overlays of density curves for each group, using ggplot2.

Step 1: Load Required Library

#Load ggplot2 package for visualization
library(ggplot2)

Step 2: Explore the Inbuilt Dataset

#Use the built-in 'iris' dataset
#'Petal-Length' is a continuous variable
#'Species' is a categorical grouping variable
str(iris)  #shows the structure of the dataset
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)  #view the first few rows of the dataset
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Step 3: Create Histogram with Group-wise Density Curves

Step 3.1: Initialize the ggplot with aesthetic mappings

#Start ggplot with iris dataset
#Map Petal.Length to x-axis and fill by Species (grouping variable)
p <- ggplot(data = iris, aes(x = Petal.Length, fill = Species))
p

Explanation:

This initializes the plot and tells ggplot to map:

Petal.Length (continuous variable) to the x-axis

Species (categorical) to fill aesthetic to distinguish groups

Step 3.2: Add Histogram Layer

# Add histogram with density scaling

p <- p + geom_histogram(aes(y = ..density..),
         alpha = 0.4, # Set transparency
         position = "identity",# Overlap histograms
         bins = 30)            # Number of bins
p
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Explanation:

aes(y = ..density..) normalizes the histogram to density

alpha = 0.4 makes bars semi-transparent so overlaps are visible

position = "identity" lets different group histograms stack on top

bins = 30 controls histogram resolution

Step 3.3: Add Density Curve Layer

#Overlay density curves for each group
p <- p +
  geom_density(aes(color = Species), # Lie color by group
               size = 1.2) #Line thickness
p

Explanation: This overlays smooth density curves for each species using color. The aes(color = Species) ensures each curve is colored by group.

Step 3.4: Add Labels and Theme

#Add title and axis labels, and apply clen theme

p <- p + labs(
  title = "Distribution of Petal Length with Group-wise Density Curves",
  x = "Petal Length",
  y = "Density") +
  theme_minimal()

Explanation:

  • labs() adds a title and axis labels

  • theme_minimal() applies a clean, modern plot style

Step 3.5: Display the Plot

# Finally, render the plot
p

Program 6

Write an R script to construct a box plot showcasing the distribution of a continuous variable, grouped by a categorical variable, using ggplot2’s fill aesthetic.

Step 1: Load the required library

# Load ggplot2 package for visualization
library(ggplot2)

Step 2: Explore the inbuilt dataset

# Use the built-in 'iris' dataset
# 'Petal.Width' is a continuous variable
# 'Species' is a categorical grouping variable

str(iris)  # View structure of the dataset
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris) # View sample data
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Step 3: Construct Box Plot with Grouping

Step 3.1: Initialize the ggplot with aesthetic mappings

# Initialize ggplot with data and aesthetic mappings

p <- ggplot(data = iris, aes(x = Species, y = Petal.Width, fill = Species))

Explanation:

  • x = Species: Grouping variable (categorical)

  • y = Petal.Width: Continuous variable to show distribution

  • fill = Species: Fill box colors by species

Step 3.2: Add box plot layer

# Add the box plot layer

p <- p + geom_boxplot()

Explanation:

  • geom_boxplot() creates box plots for each group

  • Automatically shows median, quartiles, and outliers

Step 3.3: Add labels and theme

# Add title and labels and use a minimal theme

p <- p + labs(title = "Box Plot of Petal Width by Species",
              x = "Species",
              y = "Petal Width") +
         theme_minimal()

Explanation:

  • labs() adds a descriptive title and axis labels

  • theme_minimal() gives a clean, modern look

Step 3.4: Display the plot

# Render the final plot
p

Summary

  • Used the iris dataset

  • Visualized Petal.Width as a box plot

  • Grouped by Species

  • Used fill = Species for colorful grouping

  • Each box represents the distribution of values for one species

Program 7

Develop a function in R to plot a function curve based on a mathematical equation provided as input, with different curve styles for each group, using ggplot2.

Step 1: Load the required library

library(ggplot2)

Step 2: Create data for the functions

# Create a sequence of x values ranging from -2pi to 2pi
x <- seq(-2*pi, 2*pi, length.out = 500)

# Evaluate sin(x) and cos(x) over the x range
y1 <- sin(x)
y2 <- cos(x)

# Combine data into one data frame
df <- data.frame(
  x = rep(x, 2),                        # Repeat x values for both functions
  y = c(y1, y2),                        # Combine y values: first sin(x, then cos(x))
  group = rep(c("sin(x)", "cos(x)"), each = length(x))  # Label each row by function
)

Step 3: Plot the Function Curves

Step 3.1: Initialize the ggplot Object

# Start building the ggplot using the data frame and aesthetics

p <- ggplot(df, aes(x = x, y = y, color = group, linetype = group))

Step 3.2: Add the Line Geometry

# Add smooth lines to represent each function curve
p <- p + geom_line(size = 1.2)

Step 3.3: Add Plot Labels

# Add title, axis labels, and legends

P <- p + labs(title = "Function Curves: sin(x) and cos(x)",
              x = "x",
              y = "y = f(x)",
              color = "function",
              linetype = "Function")
p 

Step 3.4: Apply a Clean Theme

# Use a clean and simple background theme
p <- p + theme_minimal()
p