Stock Market Analysis and Prediction by Using a Regression Model

1. Introduction

The stock market is one of the most important components of the global financial system because it reflects the economic performance and investment activities of companies and investors. With the rapid growth of technology and data science, stock market data has become widely used for financial analysis, forecasting, and decision-making.

This project focuses on analyzing the historical stock market data of the NIFTY 50 index from the National Stock Exchange (NSE) India. The dataset contains daily stock prices and trading volumes of fifty major companies from January 1, 2000, to April 30, 2021. Using R programming, the analysis will involve accessing and preparing the dataset, conducting exploratory data analysis, visualizing trends in stock prices and trading activities, and applying correlation and regression analysis techniques.

The study aims to provide insights into stock market behavior, identify relationships between variables, and demonstrate how data analytics and statistical methods can support financial research and predictive analysis.

2. Task description

2.1. Downloading data and load them to R studio

After downloading the NIFTY-50 Stock Market Data (2000–2021), the datasets were successfully imported into RStudio for data management and analysis. The data loading process involved reading the file into R, checking the structure of the dataset, and preparing it for further statistical and financial analysis. This step made it possible to explore trends in stock prices, trading volumes, market performance, and other variables contained in the NIFTY-50 dataset over the 2000–2021 period.

file_1<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\stock_metadata.csv")
file_2<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\TATAMOTORS.csv")
file_3<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\TATASTEEL.csv")
file_4<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\TCS.csv")
file_5<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\TECHM.csv")
file_6<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\TITAN.csv")
file_7<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ULTRACEMCO.csv")
file_8<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\UPL.csv")
file_9<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\VEDL.csv")
file_10<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\WIPRO.csv")
file_11<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ZEEL.csv")
file_12<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\MM.csv")
file_13<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\NESTLEIND.csv")
file_14<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\NTPC.csv")
file_15<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ONGC.csv")
file_16<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\POWERGRID.csv")
file_17<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\RELIANCE.csv")
file_18<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\SBIN.csv")
file_19<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\SHREECEM.csv")
file_20<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\SUNPHARMA.csv")
file_21<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BHARTIARTL.csv")
file_22<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BPCL.csv")
file_23<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BRITANNIA.csv")
file_24<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\CIPLA.csv")
file_25<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\COALINDIA.csv")
file_26<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\DRREDDY.csv")
file_27<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\EICHERMOT.csv")
file_28<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\GAIL.csv")
file_29<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\GRASIM.csv")
file_30<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HCLTECH.csv")
file_31<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HDFC.csv")
file_32<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HDFCBANK.csv")
file_33<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HEROMOTOCO.csv")
file_34<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HINDALCO.csv")
file_35<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HINDUNILVR.csv")
file_36<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ICICIBANK.csv")
file_37<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\INDUSINDBK.csv")
file_38<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\INFRATEL.csv")
file_39<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\INFY.csv")
file_40<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\IOC.csv")
file_41<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ITC.csv")
file_42<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\JSWSTEEL.csv")
file_43<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\KOTAKBANK.csv")
file_44<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\LT.csv")
file_45<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\MARUTI.csv")
file_46<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ADANIPORTS.csv")
file_47<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ASIANPAINT.csv")
file_48<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\AXISBANK.csv")
file_49<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BAJAJ-AUTO.csv")
file_50<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BAJAJFINSV.csv")
file_51<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BAJFINANCE.csv")

2.2. Exploratory data analysis

To better understand the structure and quality of the Zee Entertainment Enterprises Ltd stock dataset, a profiling function was created and implemented in RStudio. The function was designed to provide key descriptive information about the dataset, including the trading date range, identification of missing values, detection of outliers, display of the first five observations, examination of data types, and generation of summary statistics. Applying this function to the dataset helped assess data completeness, consistency, and overall readiness for further financial and statistical analysis.

# Here is to observe all dataset's observations
Total_observations_in_all_datasets <- function(start, end) {
  
  total <- 0
  
  for (i in start:end) {
    
    dataset_name <- paste0("file_", i)
    dataset <- get(dataset_name)
    
    total <- total + nrow(dataset)
  }
  
  return(total)
}

Total_observations_in_all_datasets(2, 51)
## [1] 235192
# Here is to make all stocks datasets in a single dataset
NIFTY50_combined_dataset<-rbind(file_2, file_3, file_4, file_5, file_6, file_7, file_8, file_9, file_10, file_11, file_12, file_13, file_14, file_15, file_16, file_17, file_18, file_19, file_20, file_21, file_22, file_23, file_24, file_25, file_26, file_27, file_28, file_29, file_30, file_31, file_32, file_33, file_34, file_35, file_36, file_37, file_38, file_39, file_40, file_41, file_42, file_43, file_44, file_45, file_46, file_47, file_48, file_49, file_50, file_51)

dim(NIFTY50_combined_dataset)
## [1] 235192     15
#Here is to merge the dataset of all stocks with the stock metadata
NIFTY50_combined_merged_dataset <- left_join(
  NIFTY50_combined_dataset,
  file_1,
  by = c("Symbol", "Series")
)
dim(NIFTY50_combined_merged_dataset)
## [1] 235192     18
# Split dataset by Symbol
all_symbols_data <- split(
  NIFTY50_combined_merged_dataset,
  NIFTY50_combined_merged_dataset$Symbol
)
# Function to profile stock data
profile_stock <- function(df) {

  # Trading date range
  start_date <- min(df$Date)
  end_date <- max(df$Date)

  # Get stock symbol
  stock_name <- unique(df$Symbol)

  # Missing values
  missing_values <- colSums(is.na(df))

  # Function to count outliers using IQR method
  count_outliers <- function(x) {
    if(is.numeric(x)) {
      Q1 <- quantile(x, 0.25, na.rm = TRUE)
      Q3 <- quantile(x, 0.75, na.rm = TRUE)
      IQR <- Q3 - Q1

      lower <- Q1 - 1.5 * IQR
      upper <- Q3 + 1.5 * IQR

      return(sum(x < lower | x > upper, na.rm = TRUE))
    } else {
      return(NA)
    }
  }

  # Apply outlier function to numeric columns
  outliers <- sapply(df, count_outliers)

  # Top 5 rows
  top_rows <- head(df, 5)

  # Data types
  data_types <- sapply(df, class)

  # Summary statistics
  summary_stats <- summary(df)

  # Display results
  cat("************************************************\n")
  cat(" STOCK PROFILE REPORT OF", stock_name, " \n")
  cat("************************************************\n\n")

  cat("1. Trading Date Range\n")
  cat(start_date, "to", end_date, "\n\n")

  cat("2. Missing Values\n")
  print(missing_values)
  cat("\n")

  cat("3. Number of Outliers\n")
  print(outliers)
  cat("\n")

  cat("4. Top 5 Rows\n")
  print(top_rows)
  cat("\n")

  cat("5. Data Types\n")
  print(data_types)
  cat("\n")

  cat("6. Summary Statistics\n")
  print(summary_stats)
}

# Apply the function
profile_stock(all_symbols_data$ZEEL)
## ************************************************
##  STOCK PROFILE REPORT OF ZEEL  
## ************************************************
## 
## 1. Trading Date Range
## 2007-02-28 to 2021-04-30 
## 
## 2. Missing Values
##               Date             Symbol             Series         Prev.Close 
##                  0                  0                  0                  0 
##               Open               High                Low               Last 
##                  0                  0                  0                  0 
##              Close               VWAP             Volume           Turnover 
##                  0                  0                  0                  0 
##             Trades Deliverable.Volume       X.Deliverble       Company.Name 
##               1054                  0                  0                  0 
##           Industry          ISIN.Code 
##                  0                  0 
## 
## 3. Number of Outliers
##               Date             Symbol             Series         Prev.Close 
##                 NA                 NA                 NA                  0 
##               Open               High                Low               Last 
##                  0                  0                  0                  0 
##              Close               VWAP             Volume           Turnover 
##                  0                  0                542                467 
##             Trades Deliverable.Volume       X.Deliverble       Company.Name 
##                227                302                  1                 NA 
##           Industry          ISIN.Code 
##                 NA                 NA 
## 
## 4. Top 5 Rows
##             Date Symbol Series Prev.Close  Open   High   Low   Last  Close
## 44527 2007-02-28   ZEEL     EQ     238.35 230.0 241.90 210.5 232.50 235.35
## 44528 2007-03-01   ZEEL     EQ     235.35 235.0 237.90 222.0 226.50 224.90
## 44529 2007-03-02   ZEEL     EQ     224.90 228.8 229.95 221.2 222.00 222.20
## 44530 2007-03-05   ZEEL     EQ     222.20 220.0 229.00 208.0 220.55 217.85
## 44531 2007-03-06   ZEEL     EQ     217.85 225.0 229.00 215.5 225.80 224.35
##         VWAP  Volume     Turnover Trades Deliverable.Volume X.Deliverble
## 44527 232.84 2481079 5.777055e+13     NA            1760261       0.7095
## 44528 224.80 1784382 4.011238e+13     NA            1269185       0.7113
## 44529 224.70 1656992 3.723216e+13     NA             890257       0.5373
## 44530 212.83 1958025 4.167331e+13     NA            1600050       0.8172
## 44531 220.02 1433199 3.153336e+13     NA             942609       0.6577
##                             Company.Name              Industry    ISIN.Code
## 44527 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT INE256A01028
## 44528 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT INE256A01028
## 44529 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT INE256A01028
## 44530 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT INE256A01028
## 44531 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT INE256A01028
## 
## 5. Data Types
##               Date             Symbol             Series         Prev.Close 
##        "character"        "character"        "character"          "numeric" 
##               Open               High                Low               Last 
##          "numeric"          "numeric"          "numeric"          "numeric" 
##              Close               VWAP             Volume           Turnover 
##          "numeric"          "numeric"          "integer"          "numeric" 
##             Trades Deliverable.Volume       X.Deliverble       Company.Name 
##          "numeric"          "numeric"          "numeric"        "character" 
##           Industry          ISIN.Code 
##        "character"        "character" 
## 
## 6. Summary Statistics
##         Date            Symbol           Series       Prev.Close    
##  Length   :3510   Length   :3510   Length   :3510   Min.   : 90.45  
##  N.unique :3510   N.unique :   1   N.unique :   1   1st Qu.:195.66  
##  N.blank  :   0   N.blank  :   0   N.blank  :   0   Median :276.23  
##  Min.nchar:  10   Min.nchar:   4   Min.nchar:   2   Mean   :297.53  
##  Max.nchar:  10   Max.nchar:   4   Max.nchar:   2   3rd Qu.:389.99  
##                                                     Max.   :613.40  
##                                                                     
##       Open            High             Low             Last       
##  Min.   : 92.0   Min.   : 96.25   Min.   : 88.0   Min.   : 90.15  
##  1st Qu.:196.0   1st Qu.:201.43   1st Qu.:190.7   1st Qu.:196.09  
##  Median :276.5   Median :281.80   Median :271.6   Median :276.27  
##  Mean   :297.9   Mean   :303.20   Mean   :292.2   Mean   :297.46  
##  3rd Qu.:390.7   3rd Qu.:396.02   3rd Qu.:384.2   3rd Qu.:389.68  
##  Max.   :616.0   Max.   :619.00   Max.   :607.5   Max.   :616.50  
##                                                                   
##      Close             VWAP            Volume             Turnover        
##  Min.   : 90.45   Min.   : 91.46   Min.   :     4415   Min.   :7.021e+10  
##  1st Qu.:195.43   1st Qu.:195.99   1st Qu.:  1140787   1st Qu.:2.759e+13  
##  Median :276.23   Median :276.52   Median :  1799690   Median :5.562e+13  
##  Mean   :297.52   Mean   :297.68   Mean   :  4902652   Mean   :1.297e+14  
##  3rd Qu.:389.99   3rd Qu.:389.45   3rd Qu.:  3425994   3rd Qu.:1.150e+14  
##  Max.   :613.40   Max.   :612.97   Max.   :165959680   Max.   :4.286e+15  
##                                                                           
##      Trades        Deliverable.Volume  X.Deliverble       Company.Name 
##  Min.   :    296   Min.   :    4415   Min.   :0.0557   Length   :3510  
##  1st Qu.:  24579   1st Qu.:  570122   1st Qu.:0.3771   N.unique :   1  
##  Median :  41074   Median :  991498   Median :0.5262   N.blank  :   0  
##  Mean   :  62646   Mean   : 1595467   Mean   :0.4973   Min.nchar:  34  
##  3rd Qu.:  71463   3rd Qu.: 1789056   3rd Qu.:0.6262   Max.nchar:  34  
##  Max.   :1088460   Max.   :42891428   Max.   :1.0000                   
##  NAs    :1054                                                          
##       Industry        ISIN.Code   
##  Length   :3510   Length   :3510  
##  N.unique :   1   N.unique :   1  
##  N.blank  :   0   N.blank  :   0  
##  Min.nchar:  21   Min.nchar:  12  
##  Max.nchar:  21   Max.nchar:  12  
##                                   
## 

An analysis of trading volume trends was conducted on the Zee Entertainment Enterprises Ltd stock dataset to identify periods of unusually high market activity and major events influencing stock trading behavior. Using RStudio, trading volume patterns were visualized over time, and abnormal spikes were highlighted to detect significant increases in investor activity. The analysis also identified and annotated important market events visible in the dataset, including the 2008 Global Financial Crisis, the 2016 India Demonetization announcement, and the 2020 COVID-19 market crash, all of which had noticeable impacts on trading volumes and market volatility.

# Convert Date column to Date format
all_symbols_data$ZEEL$Date <- as.Date(all_symbols_data$ZEEL$Date)

# Plot trading volume trends
ggplot(all_symbols_data$ZEEL, aes(x = Date, y = Volume)) +
  geom_line(color = "blue") +

  # Highlight unusual spikes
  geom_point(
    data = subset(
      all_symbols_data$ZEEL,
      Volume > quantile(Volume, 0.99, na.rm = TRUE)
    ),
    aes(x = Date, y = Volume),
    color = "red",
    size = 2
  ) +

  # Annotate major market events
  annotate(
    "text",
    x = as.Date("2008-09-15"),
    y = max(all_symbols_data$ZEEL$Volume, na.rm = TRUE) * 0.85,
    label = "2008 Global Financial Crisis",
    color = "darkred",
    size = 4
  ) +

  annotate(
    "text",
    x = as.Date("2020-03-24"),
    y = max(all_symbols_data$ZEEL$Volume, na.rm = TRUE) * 0.75,
    label = "COVID-19 Market Crash",
    color = "darkgreen",
    size = 4
  ) +

  annotate(
    "text",
    x = as.Date("2016-11-08"),
    y = max(all_symbols_data$ZEEL$Volume, na.rm = TRUE) * 0.65,
    label = "India Demonetization",
    color = "purple",
    size = 4
  ) +

  labs(
    title = "Trading Volume Trends and Unusual Spikes",
    x = "Date",
    y = "Trading Volume"
  ) +

  theme_minimal()

2.3. Visualization

Using ggplot2 in RStudio, several visualizations were developed to explore trends and patterns in the Zee Entertainment Enterprises Ltd stock dataset. The analysis included the visualization of closing price movements alongside percentage changes, examination of sales volume trends over time, and evaluation of short-term and medium-term market behavior using 15-day, 30-day, and 45-day moving averages of the closing price. In addition, a histogram was created to analyze the distribution of percentage changes in the previous closing price, helping to understand the frequency and variability of stock price fluctuations.

Closing Price Trend & Percentage Change
# Convert Date column
all_symbols_data$ZEEL$Date <- as.Date(all_symbols_data$ZEEL$Date)

# Percentage change in closing price
all_symbols_data$ZEEL$Pct_Change <- c(NA, diff(all_symbols_data$ZEEL$Close) / lag(all_symbols_data$ZEEL$Close)[-1] * 100)

# Closing price trend
ggplot(all_symbols_data$ZEEL, aes(x = Date, y = Close)) +
  geom_line(color = "blue") +
  labs(
    title = "Closing Price Trend",
    x = "Date",
    y = "Closing Price"
  ) +
  theme_minimal()

# Percentage change trend
ggplot(all_symbols_data$ZEEL, aes(x = Date, y = Pct_Change)) +
  geom_line(color = "red") +
  labs(
    title = "Percentage Change in Closing Price",
    x = "Date",
    y = "Percentage Change (%)"
  ) +
  theme_minimal()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

Sales Volume Trend
ggplot(all_symbols_data$ZEEL, aes(x = Date, y = Volume)) +
  geom_line(color = "darkgreen") +
  labs(
    title = "Sales Volume Trend",
    x = "Date",
    y = "Trading Volume"
  ) +
  theme_minimal()

15, 30, and 45 Days Moving Average Trend
# Convert Date column
all_symbols_data$ZEEL$Date <- as.Date(all_symbols_data$ZEEL$Date)

# Moving averages
all_symbols_data$ZEEL$MA15 <-
  rollmean(
    all_symbols_data$ZEEL$Close,
    15,
    fill = NA,
    align = "right"
  )

all_symbols_data$ZEEL$MA30 <-
  rollmean(
    all_symbols_data$ZEEL$Close,
    30,
    fill = NA,
    align = "right"
  )

all_symbols_data$ZEEL$MA45 <-
  rollmean(
    all_symbols_data$ZEEL$Close,
    45,
    fill = NA,
    align = "right"
  )

# Remove rows with NA values
zeel_ma <- na.omit(all_symbols_data$ZEEL)

# Plot
ggplot(zeel_ma, aes(x = Date)) +

  geom_line(aes(y = Close, color = "Close Price")) +

  geom_line(aes(y = MA15, color = "15-Day MA")) +

  geom_line(aes(y = MA30, color = "30-Day MA")) +

  geom_line(aes(y = MA45, color = "45-Day MA")) +

  labs(
    title = "Moving Average Trend of Closing Price",
    x = "Date",
    y = "Price",
    color = "Legend"
  ) +

  theme_minimal()

Histogram of Percentage Change for Prev Close
# Calculate percentage change
all_symbols_data$ZEEL$PrevClose_PctChange <-
  c(
    NA,
    diff(all_symbols_data$ZEEL$Prev.Close) /
      lag(all_symbols_data$ZEEL$Prev.Close)[-1] * 100
  )

# Remove NA values
zeel_hist <- na.omit(all_symbols_data$ZEEL)

# Histogram
ggplot(zeel_hist, aes(x = PrevClose_PctChange)) +

  geom_histogram(
    bins = 30,
    fill = "skyblue",
    color = "black"
  ) +

  labs(
    title = "Distribution of Percentage Change for Prev Close",
    x = "Percentage Change (%)",
    y = "Frequency"
  ) +

  theme_minimal()

2.4. Correlation Analysis

To understand the relationships among the numerical variables in the Zee Entertainment Enterprises Ltd stock dataset, a correlation analysis and visualization were conducted in RStudio using the variables Prev Close, Open, High, Low, Volume, and Close. The objective was to identify the strength and direction of relationships between stock market indicators and detect highly correlated variable pairs. Correlation analysis is important in financial modeling because strong relationships among predictors may lead to multicollinearity, which can negatively affect regression model performance and interpretation.

library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
# Select numerical variables
stock_data <- all_symbols_data$ZEEL[, c(
  "Prev.Close",
  "Open",
  "High",
  "Low",
  "Volume",
  "Close"
)]

# Correlation matrix
cor_matrix <- cor(stock_data, use = "complete.obs")

# Display correlation matrix
print(cor_matrix)
##            Prev.Close       Open       High        Low     Volume      Close
## Prev.Close  1.0000000  0.9995896  0.9989190  0.9985924 -0.1250299  0.9982408
## Open        0.9995896  1.0000000  0.9991795  0.9989109 -0.1233187  0.9984654
## High        0.9989190  0.9991795  1.0000000  0.9985761 -0.1134368  0.9992570
## Low         0.9985924  0.9989109  0.9985761  1.0000000 -0.1335865  0.9992283
## Volume     -0.1250299 -0.1233187 -0.1134368 -0.1335865  1.0000000 -0.1225318
## Close       0.9982408  0.9984654  0.9992570  0.9992283 -0.1225318  1.0000000
# Visualize correlations
corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  addCoef.col = "black",
  tl.col = "black",
  tl.srt = 45
)

The correlation analysis revealed strong positive relationships among the stock price variables, particularly between Prev Close, Open, High, Low, and Close. These variables are all directly related to daily stock price movements, so it is expected that they move together over time. In contrast, Volume showed a weaker relationship with the price-related variables, suggesting that trading activity does not always move proportionally with stock prices.

2.5. Regression Analysis

A regression analysis was conducted using the Zee Entertainment Enterprises Ltd stock dataset in RStudio to examine how different market variables influence the closing stock price. In this analysis, Close was treated as the dependent variable, while Prev Close, Open, High, Low, and Volume were used as independent variables. To evaluate the performance of the regression model, the dataset was divided into training and testing subsets using an 80%–20% split. The training set was used to build the model, while the test set was reserved for assessing predictive accuracy on unseen data.

# Set seed for reproducibility
set.seed(123)

# Select relevant variables
reg_data <- all_symbols_data$ZEEL[, c(
  "Close",
  "Prev.Close",
  "Open",
  "High",
  "Low",
  "Volume"
)]

# Remove missing values
reg_data <- na.omit(reg_data)

# Total number of observations
n <- nrow(reg_data)

# Create training indices (80%)
train_index <- sample(1:n, size = 0.8 * n)

# Split data
train_data <- reg_data[train_index, ]
test_data  <- reg_data[-train_index, ]

# Print sizes
cat("Training:", nrow(train_data), "\n")
## Training: 2808
cat("Test Sets:", nrow(test_data), "\n")
## Test Sets: 702

After splitting the Zee Entertainment Enterprises Ltd dataset into training and testing subsets, a multiple linear regression model was developed in RStudio to predict the closing stock price using Prev Close, Open, High, Low, and Volume as independent variables. The model was trained using the training dataset and then evaluated on the test dataset to measure its predictive performance. Three commonly used evaluation metrics were applied: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²). These metrics help assess prediction accuracy, average model error, and the proportion of variation in the closing price explained by the regression model.

# Train regression model
reg_model <- lm(
  Close ~ Prev.Close + Open + High + Low + Volume,
  data = train_data
)

# Model summary
summary(reg_model)
## 
## Call:
## lm(formula = Close ~ Prev.Close + Open + High + Low + Volume, 
##     data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.1459  -1.5783   0.1155   1.5911  28.3699 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.803e-01  1.525e-01  -5.774 8.59e-09 ***
## Prev.Close  -1.652e-02  1.928e-02  -0.857   0.3915    
## Open        -4.435e-01  2.138e-02 -20.745  < 2e-16 ***
## High         8.453e-01  1.200e-02  70.411  < 2e-16 ***
## Low          6.136e-01  1.100e-02  55.795  < 2e-16 ***
## Volume      -1.076e-08  6.236e-09  -1.726   0.0845 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.055 on 2802 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9995 
## F-statistic: 1.072e+06 on 5 and 2802 DF,  p-value: < 2.2e-16
# Predictions on test data
predictions <- predict(reg_model, newdata = test_data)

# Actual values
actual <- test_data$Close

# Performance Metrics

# 1. Mean Absolute Error (MAE)
MAE <- mean(abs(actual - predictions))

# 2. Root Mean Squared Error (RMSE)
RMSE <- sqrt(mean((actual - predictions)^2))

# 3. R-squared (R²)
R2 <- 1 - sum((actual - predictions)^2) /
  sum((actual - mean(actual))^2)

# Print results
cat("MAE :", MAE, "\n")
## MAE : 2.184231
cat("RMSE:", RMSE, "\n")
## RMSE: 3.639329
cat("R-squared:", R2, "\n")
## R-squared: 0.9992422

The regression model was able to predict the closing stock price using historical stock market indicators with a high level of accuracy. The MAE measured the average absolute prediction error, while the RMSE provided insight into the magnitude of larger prediction errors. The R² value indicated how much of the variation in the closing price was explained by the independent variables included in the model.

Because the variables Prev Close, Open, High, and Low are strongly correlated with Close, the model is expected to achieve a high R² value. However, the strong correlations among these predictors may also introduce multicollinearity, which can affect the stability and interpretability of regression coefficients. Despite this limitation, the model remains useful for understanding overall relationships and short-term stock price prediction.

To evaluate the predictive accuracy of the regression model, a scatter plot comparing actual and predicted closing prices was created using the test dataset. A 45-degree reference line representing perfect prediction was added to the visualization to assess how closely the predicted values matched the actual stock closing prices. This graphical analysis helps identify prediction accuracy, systematic bias, and the overall reliability of the regression model.

# Create dataframe for plotting
plot_data <- data.frame(
  Actual = actual,
  Predicted = predictions
)

# Scatter plot
ggplot(plot_data, aes(x = Actual, y = Predicted)) +

  geom_point(color = "blue", alpha = 0.6) +

  # 45-degree reference line
  geom_abline(
    slope = 1,
    intercept = 0,
    color = "red",
    linetype = "dashed",
    linewidth = 1
  ) +

  labs(
    title = "Actual vs Predicted Closing Prices",
    x = "Actual Closing Prices",
    y = "Predicted Closing Prices"
  ) +

  theme_minimal()

The 45-degree reference line represents perfect predictions, where the predicted closing prices are exactly equal to the actual closing prices. Points located very close to this line indicate accurate model predictions, while points farther away represent prediction errors. If the scatter points are tightly clustered around the line, it suggests that the regression model performs well and captures the relationship between the independent variables and the closing price effectively. A wider spread around the line indicates larger prediction errors and lower predictive accuracy. In this analysis, most observations are expected to lie near the reference line because variables such as Prev Close, Open, High, and Low are strongly correlated with the closing price. However, any noticeable deviations from the line may indicate periods of unusual market volatility, sudden trading activity, or limitations caused by multicollinearity among predictors.

To improve comparability among variables and prepare the data for machine learning and regression analysis, the independent variables in the Zee Entertainment Enterprises Ltd dataset were standardized. Standardization transforms each variable so that it has a mean of 0 and a standard deviation of 1. This process is especially useful when variables are measured on different scales, such as stock prices and trading volume, because it prevents variables with larger numerical ranges from dominating the analysis. After scaling, the mean and standard deviation of each variable were examined to verify that the standardization process was successful.

# Select independent variables
independent_vars <- all_symbols_data$ZEEL[, c(
  "Prev.Close",
  "Open",
  "High",
  "Low",
  "Volume"
)]

# Remove missing values
independent_vars <- na.omit(independent_vars)

# Standardize variables
scaled_data <- scale(independent_vars)

# Convert to dataframe
scaled_data <- as.data.frame(scaled_data)

# Means after scaling
scaled_means <- colMeans(scaled_data)

# Standard deviations after scaling
scaled_sds <- apply(scaled_data, 2, sd)

# Print results
cat("Means after scaling:\n")
## Means after scaling:
print(scaled_means)
##    Prev.Close          Open          High           Low        Volume 
## -1.978079e-16  7.296316e-17 -1.707561e-16  5.968635e-17 -1.247024e-17
cat("\nStandard deviations after scaling:\n")
## 
## Standard deviations after scaling:
print(scaled_sds)
## Prev.Close       Open       High        Low     Volume 
##          1          1          1          1          1

The results show that the standardized variables have means very close to 0 and standard deviations close to 1, confirming that the scaling process was correctly applied. Minor differences from exactly 0 and 1 are normal due to rounding errors in computation.

Standardization is important because it ensures that all independent variables contribute proportionally to the analysis, regardless of their original units or magnitudes. This is particularly relevant in stock market datasets where variables such as trading volume may have values that are much larger than stock prices.

To reduce dimensionality and address the issue of multicollinearity among the independent variables, Principal Component Analysis (PCA) was applied to the Zee Entertainment Enterprises Ltd dataset in RStudio. PCA transforms the original correlated variables into a smaller set of uncorrelated components while preserving most of the variation in the data. In this analysis, the independent variables were reduced to two principal components, and a Linear Regression model was trained using the PCA-transformed training dataset. The performance of the model was then evaluated on the test dataset using the R² metric.

# Independent variables
X <- all_symbols_data$ZEEL[, c(
  "Prev.Close",
  "Open",
  "High",
  "Low",
  "Volume"
)]

# Dependent variable
y <- all_symbols_data$ZEEL$Close

# Remove missing values
data_pca <- na.omit(data.frame(X, y))

# Separate X and y
X <- data_pca[, 1:5]
y <- data_pca$y

# Standardize X
X_scaled <- scale(X)

# Split data into training and testing sets
set.seed(123)

n <- nrow(X_scaled)

train_index <- sample(1:n, size = 0.8 * n)

X_train <- X_scaled[train_index, ]
X_test  <- X_scaled[-train_index, ]

y_train <- y[train_index]
y_test  <- y[-train_index]

# Apply PCA
pca_model <- prcomp(X_train, center = TRUE, scale. = TRUE)

# Keep first 2 principal components
train_pca <- pca_model$x[, 1:2]

# Transform test data
test_pca <- predict(pca_model, newdata = X_test)[, 1:2]

# Train Linear Regression model
pca_reg_model <- lm(y_train ~ ., data = as.data.frame(train_pca))

# Predictions
pred_pca <- predict(
  pca_reg_model,
  newdata = as.data.frame(test_pca)
)

# Compute R-squared
R2_pca <- 1 - sum((y_test - pred_pca)^2) /
  sum((y_test - mean(y_test))^2)

# Print R2
cat("R-squared on Test Set:", R2_pca, "\n")
## R-squared on Test Set: 0.9978319

The PCA-based regression model used two principal components to summarize the information contained in the original independent variables. By reducing the dimensionality of the dataset, PCA helped minimize multicollinearity among predictors while retaining most of the important variation in the data.

The R² value obtained on the test set indicates how well the PCA-transformed variables explain variations in the closing stock price. A high R² suggests that the two principal components successfully captured most of the relevant information from the original variables, while a lower R² would indicate some loss of predictive information during dimensionality reduction. Overall, PCA provides a useful approach for simplifying complex financial datasets and improving model stability.

Comparison of the PCA Model and the Raw Regression Model

The performance of the PCA-based regression model was compared with the original multiple linear regression model developed using the raw independent variables (Prev Close, Open, High, Low, and Volume) in RStudio. The comparison was based on the R² metric, which measures the proportion of variation in the closing stock price explained by the model. In most cases, the raw regression model is expected to produce a higher R² value than the PCA-based model. This is because the raw model uses the original variables directly, preserving all available information relevant to predicting the closing stock price. Since variables such as Prev Close, Open, High, and Low are highly correlated with Close, the model can explain a large proportion of the variation in the dependent variable. On the other hand, the PCA model reduces the dataset to only two principal components. Although PCA removes multicollinearity and simplifies the dataset, some information from the original variables may be lost during dimensionality reduction. As a result, the PCA model may produce a slightly lower R² value compared to the raw regression model. However, the PCA model has important advantages:

it reduces multicollinearity among predictors,

improves model stability,

simplifies interpretation of complex datasets,

and may generalize better when working with highly correlated variables.

Therefore:

If the raw model has a higher R², it means the original variables provide more detailed predictive information.

If the PCA model performs similarly, it suggests that the first two principal components successfully captured most of the important information in the dataset.

Overall, the raw regression model often performs better in terms of predictive accuracy, while the PCA model provides a more compact and statistically stable representation of the data.

2.6. Sales volume comparisons

A comparative analysis of annual trading volume was conducted for Zee Entertainment Enterprises Ltd and Vedanta Ltd using RStudio. The total yearly trading volume for each company was calculated and visualized using a clustered bar chart to compare sales activity across different years. This visualization helps identify differences in market participation, trading intensity, and relative performance between the two companies over time. The analysis also focused on identifying the company with the highest trading volume in 2018, the company with the lowest trading volume in 2021, and the ratio of Zee Entertainment Enterprises Ltd sales volume to Vedanta Ltd sales volume during 2021.

# Convert Date columns
all_symbols_data$ZEEL$Date <- as.Date(all_symbols_data$ZEEL$Date)
all_symbols_data$VEDL$Date <- as.Date(all_symbols_data$VEDL$Date)

# Extract Year
all_symbols_data$ZEEL$Year <- format(all_symbols_data$ZEEL$Date, "%Y")
all_symbols_data$VEDL$Year <- format(all_symbols_data$VEDL$Date, "%Y")

# Annual total trading volume
zeel_volume <- all_symbols_data$ZEEL %>%
  group_by(Year) %>%
  summarise(Total_Volume = sum(Volume, na.rm = TRUE)) %>%
  mutate(Company = "ZEEL")

vedl_volume <- all_symbols_data$VEDL %>%
  group_by(Year) %>%
  summarise(Total_Volume = sum(Volume, na.rm = TRUE)) %>%
  mutate(Company = "VEDL")

# Combine datasets
volume_data <- rbind(zeel_volume, vedl_volume)

# Clustered bar chart
ggplot(volume_data,
       aes(x = Year,
           y = Total_Volume,
           fill = Company)) +

  geom_bar(stat = "identity",
           position = "dodge") +

  labs(
    title = "Annual Trading Volume Comparison",
    x = "Year",
    y = "Total Trading Volume"
  ) +

  theme_minimal()

# -----------------------------
# Questions
# -----------------------------

# Highest volume during 2018
volume_2018 <- subset(volume_data, Year == "2018")
print(volume_2018)
## # A tibble: 2 × 3
##   Year  Total_Volume Company
##   <chr>        <dbl> <chr>  
## 1 2018     598775408 ZEEL   
## 2 2018    3276296751 VEDL
# Least volume during 2021
volume_2021 <- subset(volume_data, Year == "2021")
print(volume_2021)
## # A tibble: 2 × 3
##   Year  Total_Volume Company
##   <chr>        <dbl> <chr>  
## 1 2021    1302490317 ZEEL   
## 2 2021    1430971315 VEDL
# Ratio of ZEEL to VEDL sales during 2021
zeel_2021 <- volume_2021$Total_Volume[
  volume_2021$Company == "ZEEL"
]

vedl_2021 <- volume_2021$Total_Volume[
  volume_2021$Company == "VEDL"
]

ratio_2021 <- zeel_2021 / vedl_2021

cat("Ratio of ZEEL to VEDL sales in 2021:",
    ratio_2021)
## Ratio of ZEEL to VEDL sales in 2021: 0.9102141

The clustered bar chart provides a clear comparison of annual trading volumes between Zee Entertainment Enterprises Ltd and Vedanta Ltd across different years. By comparing the heights of the bars, it becomes possible to identify years with unusually high or low trading activity for each company.

The company with the highest trading volume during 2018 can be identified by comparing the 2018 bars in the chart and the printed results.

The company with the lowest trading volume during 2021 is determined from the 2021 annual totals.

The ratio of Zee Entertainment Enterprises Ltd sales volume to Vedanta Ltd sales volume during 2021 indicates how the trading activity of the two companies compares relative to each other during that year.

A ratio:

greater than 1 indicates ZEEL traded more volume than VEDL,

less than 1 indicates VEDL traded more volume than ZEEL,

and equal to 1 would indicate equal trading volume.