Program 3

Author

Manoj

Implement an R function to generate a line graph depicting the trend of a time-series dataset, with seperate lines for each group, utilizing ggplot2 group aesthetic.

Steps:


Step 1: Loading the necessary Libraries

We load:

  • ggplot2: used for creating the line plot
  • dplyr: for optional data handling (filtering, and summarization)
  • tidyr: for optional reshaping (it is optioal)
library(ggplot2)
library(dplyr)
Warning: package 'dplyr' was built under R version 4.5.2

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)

Step 2: Load the built in dataset AirPassengers and convert it into dataframe

AirPassengers is a time series object, not a data frame ggplot2 expects the data in tabular format

  • each row should be an observation
  • each column should be one variable

So we build data frame with:

  • Date: A sequence of monthly dates from 01 Jan 1949 to 30 Dec 1960
  • Passengers: is numeric values form the time series
  • Year: Extracted fromt he date, used as grouping varialbe
# Create a monthly date sequence that matches AirPassengers length
date_seq = seq(as.Date("1949-01-01"), by="month", length.out= length(AirPassengers))

# Convert the time-series object into a dataframe for ggplot2 - number of people travelled per month between 1949 to 1960 month wise

data = data.frame(
  Date= date_seq, 
  Passengers= as.numeric(AirPassengers),
  Year=as.factor(format(date_seq, "%Y"))
  )

head(data, n=15)
         Date Passengers Year
1  1949-01-01        112 1949
2  1949-02-01        118 1949
3  1949-03-01        132 1949
4  1949-04-01        129 1949
5  1949-05-01        121 1949
6  1949-06-01        135 1949
7  1949-07-01        148 1949
8  1949-08-01        148 1949
9  1949-09-01        136 1949
10 1949-10-01        119 1949
11 1949-11-01        104 1949
12 1949-12-01        118 1949
13 1950-01-01        115 1950
14 1950-02-01        126 1950
15 1950-03-01        141 1950

Step 3: Understand the data structure

In this step

  • get the types of cloumn (Date, numeric, factor etc.)
  • get the range of each column
  • how many numerical values we have per year, etc
  • misc.
str(data)
'data.frame':   144 obs. of  3 variables:
 $ Date      : Date, format: "1949-01-01" "1949-02-01" ...
 $ Passengers: num  112 118 132 129 121 135 148 148 136 119 ...
 $ Year      : Factor w/ 12 levels "1949","1950",..: 1 1 1 1 1 1 1 1 1 1 ...
range(data$Date)
[1] "1949-01-01" "1960-12-01"
table(data$Year)

1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 
  12   12   12   12   12   12   12   12   12   12   12   12 

Step 4: Define a function to create a grouped Time-Series Line Graph

Why use function?

Function helps us to resuse the same plotrting logic for other time-series later.

Instead of rewriting the plot code again and again, we write it once and call it with different values

Function inputs

  • data: is the data frame with time-series values
  • x_col: name of the time column Date
  • y_col: the name of the numeric column Passengers
  • group_col: name fo the grouing column Year
  • title: custom plot title - custom
plot_time_series = function(data, x_col, y_col, group_col, title="Air passengers trend analysis from 1949 to 1960")
  ggplot(data, 
         aes_string(
           x=x_col,
           y=y_col,
           color=group_col, 
           group=group_col))
plot_time_series(data, "Date", "Passengers", "Year")
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.

plot_time_series = function(data, x_col, y_col, group_col, title="Air passengers trend analysis from 1949 to 1960")
  ggplot(data, 
         aes_string(
           x=x_col,
           y=y_col,
           color=group_col, 
           group=group_col
           )
         )+geom_line(size=3)
plot_time_series(data, "Date", "Passengers", "Year")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

plot_time_series = function(data, x_col, y_col, group_col, title="Air passengers trend analysis from 1949 to 1960")
  ggplot(data, 
         aes_string(
           x=x_col,
           y=y_col,
           color=group_col, 
           group=group_col
           )
         )+
  geom_line(size=1)+
  geom_point(size=3, alpha=0.3)
plot_time_series = function(data, x_col, y_col, group_col, title="Air passengers trend analysis from 1949 to 1960")
  ggplot(data, 
         aes_string(
           x=x_col,
           y=y_col,
           color=group_col, 
           group=group_col
           )
         )+
  geom_line(size=1)+
  geom_point(size=3, alpha=0.3)+
  labs(
    title=title,
    x='Date',
    y="Number of passengers",
    color="Year"
  )
plot_time_series(data, "Date", "Passengers", "Year")

plot_time_series = function(data, x_col, y_col, group_col, title="Air passengers trend analysis from 1949 to 1960")
  ggplot(data, 
         aes_string(
           x=x_col,
           y=y_col,
           color=group_col, 
           group=group_col
           )
         )+
  geom_line(size=1)+
  geom_point(size=3, alpha=0.3)+
  labs(
    title=title,
    x='Date',
    y="Number of passengers",
    color="Year"
  )+theme_minimal()+theme(legend.position = "top")
plot_time_series(data, "Date", "Passengers", "Year")

Discussion questions

  • Why do we need group=Year if we already use color=Year?
  • What changes if we remove geom_point()?
  • What does the plot look like if we group by Month instead of Year (hint: you would need to create month column)?
  • Can you modify the function to allow changing the thickness and point size as extra parameters.