The stock market is one of the most important components of the global financial system because it reflects the economic performance and investment activities of companies and investors. With the rapid growth of technology and data science, stock market data has become widely used for financial analysis, forecasting, and decision-making.
This project focuses on analyzing the historical stock market data of the NIFTY 50 index from the National Stock Exchange (NSE) India. The dataset contains daily stock prices and trading volumes of fifty major companies from January 1, 2000, to April 30, 2021. Using R programming, the analysis will involve accessing and preparing the dataset, conducting exploratory data analysis, visualizing trends in stock prices and trading activities, and applying correlation and regression analysis techniques.
The study aims to provide insights into stock market behavior, identify relationships between variables, and demonstrate how data analytics and statistical methods can support financial research and predictive analysis.
After downloading the NIFTY-50 Stock Market Data (2000–2021), the datasets were successfully imported into RStudio for data management and analysis. The data loading process involved reading the file into R, checking the structure of the dataset, and preparing it for further statistical and financial analysis. This step made it possible to explore trends in stock prices, trading volumes, market performance, and other variables contained in the NIFTY-50 dataset over the 2000–2021 period.
file_1<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\stock_metadata.csv")
file_2<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\TATAMOTORS.csv")
file_3<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\TATASTEEL.csv")
file_4<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\TCS.csv")
file_5<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\TECHM.csv")
file_6<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\TITAN.csv")
file_7<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ULTRACEMCO.csv")
file_8<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\UPL.csv")
file_9<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\VEDL.csv")
file_10<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\WIPRO.csv")
file_11<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ZEEL.csv")
file_12<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\MM.csv")
file_13<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\NESTLEIND.csv")
file_14<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\NTPC.csv")
file_15<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ONGC.csv")
file_16<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\POWERGRID.csv")
file_17<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\RELIANCE.csv")
file_18<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\SBIN.csv")
file_19<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\SHREECEM.csv")
file_20<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\SUNPHARMA.csv")
file_21<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BHARTIARTL.csv")
file_22<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BPCL.csv")
file_23<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BRITANNIA.csv")
file_24<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\CIPLA.csv")
file_25<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\COALINDIA.csv")
file_26<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\DRREDDY.csv")
file_27<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\EICHERMOT.csv")
file_28<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\GAIL.csv")
file_29<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\GRASIM.csv")
file_30<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HCLTECH.csv")
file_31<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HDFC.csv")
file_32<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HDFCBANK.csv")
file_33<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HEROMOTOCO.csv")
file_34<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HINDALCO.csv")
file_35<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\HINDUNILVR.csv")
file_36<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ICICIBANK.csv")
file_37<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\INDUSINDBK.csv")
file_38<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\INFRATEL.csv")
file_39<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\INFY.csv")
file_40<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\IOC.csv")
file_41<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ITC.csv")
file_42<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\JSWSTEEL.csv")
file_43<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\KOTAKBANK.csv")
file_44<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\LT.csv")
file_45<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\MARUTI.csv")
file_46<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ADANIPORTS.csv")
file_47<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\ASIANPAINT.csv")
file_48<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\AXISBANK.csv")
file_49<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BAJAJ-AUTO.csv")
file_50<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BAJAJFINSV.csv")
file_51<-read.csv("C:\\Users\\NISR\\Downloads\\archive (1)\\BAJFINANCE.csv")
To better understand the structure and quality of the Zee Entertainment Enterprises Ltd stock dataset, a profiling function was created and implemented in RStudio. The function was designed to provide key descriptive information about the dataset, including the trading date range, identification of missing values, detection of outliers, display of the first five observations, examination of data types, and generation of summary statistics. Applying this function to the dataset helped assess data completeness, consistency, and overall readiness for further financial and statistical analysis.
# Here is to observe all dataset's observations
Total_observations_in_all_datasets <- function(start, end) {
total <- 0
for (i in start:end) {
dataset_name <- paste0("file_", i)
dataset <- get(dataset_name)
total <- total + nrow(dataset)
}
return(total)
}
Total_observations_in_all_datasets(2, 51)
## [1] 235192
# Here is to make all stocks datasets in a single dataset
NIFTY50_combined_dataset<-rbind(file_2, file_3, file_4, file_5, file_6, file_7, file_8, file_9, file_10, file_11, file_12, file_13, file_14, file_15, file_16, file_17, file_18, file_19, file_20, file_21, file_22, file_23, file_24, file_25, file_26, file_27, file_28, file_29, file_30, file_31, file_32, file_33, file_34, file_35, file_36, file_37, file_38, file_39, file_40, file_41, file_42, file_43, file_44, file_45, file_46, file_47, file_48, file_49, file_50, file_51)
dim(NIFTY50_combined_dataset)
## [1] 235192 15
#Here is to merge the dataset of all stocks with the stock metadata
NIFTY50_combined_merged_dataset <- left_join(
NIFTY50_combined_dataset,
file_1,
by = c("Symbol", "Series")
)
dim(NIFTY50_combined_merged_dataset)
## [1] 235192 18
# Split dataset by Symbol
all_symbols_data <- split(
NIFTY50_combined_merged_dataset,
NIFTY50_combined_merged_dataset$Symbol
)
# Function to profile stock data
profile_stock <- function(df) {
# Trading date range
start_date <- min(df$Date)
end_date <- max(df$Date)
# Get stock symbol
stock_name <- unique(df$Symbol)
# Missing values
missing_values <- colSums(is.na(df))
# Function to count outliers using IQR method
count_outliers <- function(x) {
if(is.numeric(x)) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR
return(sum(x < lower | x > upper, na.rm = TRUE))
} else {
return(NA)
}
}
# Apply outlier function to numeric columns
outliers <- sapply(df, count_outliers)
# Top 5 rows
top_rows <- head(df, 5)
# Data types
data_types <- sapply(df, class)
# Summary statistics
summary_stats <- summary(df)
# Display results
cat("************************************************\n")
cat(" STOCK PROFILE REPORT OF", stock_name, " \n")
cat("************************************************\n\n")
cat("1. Trading Date Range\n")
cat(start_date, "to", end_date, "\n\n")
cat("2. Missing Values\n")
print(missing_values)
cat("\n")
cat("3. Number of Outliers\n")
print(outliers)
cat("\n")
cat("4. Top 5 Rows\n")
print(top_rows)
cat("\n")
cat("5. Data Types\n")
print(data_types)
cat("\n")
cat("6. Summary Statistics\n")
print(summary_stats)
}
# Apply the function
profile_stock(all_symbols_data$ZEEL)
## ************************************************
## STOCK PROFILE REPORT OF ZEEL
## ************************************************
##
## 1. Trading Date Range
## 2007-02-28 to 2021-04-30
##
## 2. Missing Values
## Date Symbol Series Prev.Close
## 0 0 0 0
## Open High Low Last
## 0 0 0 0
## Close VWAP Volume Turnover
## 0 0 0 0
## Trades Deliverable.Volume X.Deliverble Company.Name
## 1054 0 0 0
## Industry ISIN.Code
## 0 0
##
## 3. Number of Outliers
## Date Symbol Series Prev.Close
## NA NA NA 0
## Open High Low Last
## 0 0 0 0
## Close VWAP Volume Turnover
## 0 0 542 467
## Trades Deliverable.Volume X.Deliverble Company.Name
## 227 302 1 NA
## Industry ISIN.Code
## NA NA
##
## 4. Top 5 Rows
## Date Symbol Series Prev.Close Open High Low Last Close
## 44527 2007-02-28 ZEEL EQ 238.35 230.0 241.90 210.5 232.50 235.35
## 44528 2007-03-01 ZEEL EQ 235.35 235.0 237.90 222.0 226.50 224.90
## 44529 2007-03-02 ZEEL EQ 224.90 228.8 229.95 221.2 222.00 222.20
## 44530 2007-03-05 ZEEL EQ 222.20 220.0 229.00 208.0 220.55 217.85
## 44531 2007-03-06 ZEEL EQ 217.85 225.0 229.00 215.5 225.80 224.35
## VWAP Volume Turnover Trades Deliverable.Volume X.Deliverble
## 44527 232.84 2481079 5.777055e+13 NA 1760261 0.7095
## 44528 224.80 1784382 4.011238e+13 NA 1269185 0.7113
## 44529 224.70 1656992 3.723216e+13 NA 890257 0.5373
## 44530 212.83 1958025 4.167331e+13 NA 1600050 0.8172
## 44531 220.02 1433199 3.153336e+13 NA 942609 0.6577
## Company.Name Industry ISIN.Code
## 44527 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT INE256A01028
## 44528 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT INE256A01028
## 44529 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT INE256A01028
## 44530 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT INE256A01028
## 44531 Zee Entertainment Enterprises Ltd. MEDIA & ENTERTAINMENT INE256A01028
##
## 5. Data Types
## Date Symbol Series Prev.Close
## "character" "character" "character" "numeric"
## Open High Low Last
## "numeric" "numeric" "numeric" "numeric"
## Close VWAP Volume Turnover
## "numeric" "numeric" "integer" "numeric"
## Trades Deliverable.Volume X.Deliverble Company.Name
## "numeric" "numeric" "numeric" "character"
## Industry ISIN.Code
## "character" "character"
##
## 6. Summary Statistics
## Date Symbol Series Prev.Close
## Length :3510 Length :3510 Length :3510 Min. : 90.45
## N.unique :3510 N.unique : 1 N.unique : 1 1st Qu.:195.66
## N.blank : 0 N.blank : 0 N.blank : 0 Median :276.23
## Min.nchar: 10 Min.nchar: 4 Min.nchar: 2 Mean :297.53
## Max.nchar: 10 Max.nchar: 4 Max.nchar: 2 3rd Qu.:389.99
## Max. :613.40
##
## Open High Low Last
## Min. : 92.0 Min. : 96.25 Min. : 88.0 Min. : 90.15
## 1st Qu.:196.0 1st Qu.:201.43 1st Qu.:190.7 1st Qu.:196.09
## Median :276.5 Median :281.80 Median :271.6 Median :276.27
## Mean :297.9 Mean :303.20 Mean :292.2 Mean :297.46
## 3rd Qu.:390.7 3rd Qu.:396.02 3rd Qu.:384.2 3rd Qu.:389.68
## Max. :616.0 Max. :619.00 Max. :607.5 Max. :616.50
##
## Close VWAP Volume Turnover
## Min. : 90.45 Min. : 91.46 Min. : 4415 Min. :7.021e+10
## 1st Qu.:195.43 1st Qu.:195.99 1st Qu.: 1140787 1st Qu.:2.759e+13
## Median :276.23 Median :276.52 Median : 1799690 Median :5.562e+13
## Mean :297.52 Mean :297.68 Mean : 4902652 Mean :1.297e+14
## 3rd Qu.:389.99 3rd Qu.:389.45 3rd Qu.: 3425994 3rd Qu.:1.150e+14
## Max. :613.40 Max. :612.97 Max. :165959680 Max. :4.286e+15
##
## Trades Deliverable.Volume X.Deliverble Company.Name
## Min. : 296 Min. : 4415 Min. :0.0557 Length :3510
## 1st Qu.: 24579 1st Qu.: 570122 1st Qu.:0.3771 N.unique : 1
## Median : 41074 Median : 991498 Median :0.5262 N.blank : 0
## Mean : 62646 Mean : 1595467 Mean :0.4973 Min.nchar: 34
## 3rd Qu.: 71463 3rd Qu.: 1789056 3rd Qu.:0.6262 Max.nchar: 34
## Max. :1088460 Max. :42891428 Max. :1.0000
## NAs :1054
## Industry ISIN.Code
## Length :3510 Length :3510
## N.unique : 1 N.unique : 1
## N.blank : 0 N.blank : 0
## Min.nchar: 21 Min.nchar: 12
## Max.nchar: 21 Max.nchar: 12
##
##
An analysis of trading volume trends was conducted on the Zee Entertainment Enterprises Ltd stock dataset to identify periods of unusually high market activity and major events influencing stock trading behavior. Using RStudio, trading volume patterns were visualized over time, and abnormal spikes were highlighted to detect significant increases in investor activity. The analysis also identified and annotated important market events visible in the dataset, including the 2008 Global Financial Crisis, the 2016 India Demonetization announcement, and the 2020 COVID-19 market crash, all of which had noticeable impacts on trading volumes and market volatility.
# Convert Date column to Date format
all_symbols_data$ZEEL$Date <- as.Date(all_symbols_data$ZEEL$Date)
# Plot trading volume trends
ggplot(all_symbols_data$ZEEL, aes(x = Date, y = Volume)) +
geom_line(color = "blue") +
# Highlight unusual spikes
geom_point(
data = subset(
all_symbols_data$ZEEL,
Volume > quantile(Volume, 0.99, na.rm = TRUE)
),
aes(x = Date, y = Volume),
color = "red",
size = 2
) +
# Annotate major market events
annotate(
"text",
x = as.Date("2008-09-15"),
y = max(all_symbols_data$ZEEL$Volume, na.rm = TRUE) * 0.85,
label = "2008 Global Financial Crisis",
color = "darkred",
size = 4
) +
annotate(
"text",
x = as.Date("2020-03-24"),
y = max(all_symbols_data$ZEEL$Volume, na.rm = TRUE) * 0.75,
label = "COVID-19 Market Crash",
color = "darkgreen",
size = 4
) +
annotate(
"text",
x = as.Date("2016-11-08"),
y = max(all_symbols_data$ZEEL$Volume, na.rm = TRUE) * 0.65,
label = "India Demonetization",
color = "purple",
size = 4
) +
labs(
title = "Trading Volume Trends and Unusual Spikes",
x = "Date",
y = "Trading Volume"
) +
theme_minimal()
Using ggplot2 in RStudio, several visualizations were developed to explore trends and patterns in the Zee Entertainment Enterprises Ltd stock dataset. The analysis included the visualization of closing price movements alongside percentage changes, examination of sales volume trends over time, and evaluation of short-term and medium-term market behavior using 15-day, 30-day, and 45-day moving averages of the closing price. In addition, a histogram was created to analyze the distribution of percentage changes in the previous closing price, helping to understand the frequency and variability of stock price fluctuations.
# Convert Date column
all_symbols_data$ZEEL$Date <- as.Date(all_symbols_data$ZEEL$Date)
# Percentage change in closing price
all_symbols_data$ZEEL$Pct_Change <- c(NA, diff(all_symbols_data$ZEEL$Close) / lag(all_symbols_data$ZEEL$Close)[-1] * 100)
# Closing price trend
ggplot(all_symbols_data$ZEEL, aes(x = Date, y = Close)) +
geom_line(color = "blue") +
labs(
title = "Closing Price Trend",
x = "Date",
y = "Closing Price"
) +
theme_minimal()
# Percentage change trend
ggplot(all_symbols_data$ZEEL, aes(x = Date, y = Pct_Change)) +
geom_line(color = "red") +
labs(
title = "Percentage Change in Closing Price",
x = "Date",
y = "Percentage Change (%)"
) +
theme_minimal()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
ggplot(all_symbols_data$ZEEL, aes(x = Date, y = Volume)) +
geom_line(color = "darkgreen") +
labs(
title = "Sales Volume Trend",
x = "Date",
y = "Trading Volume"
) +
theme_minimal()
# Convert Date column
all_symbols_data$ZEEL$Date <- as.Date(all_symbols_data$ZEEL$Date)
# Moving averages
all_symbols_data$ZEEL$MA15 <-
rollmean(
all_symbols_data$ZEEL$Close,
15,
fill = NA,
align = "right"
)
all_symbols_data$ZEEL$MA30 <-
rollmean(
all_symbols_data$ZEEL$Close,
30,
fill = NA,
align = "right"
)
all_symbols_data$ZEEL$MA45 <-
rollmean(
all_symbols_data$ZEEL$Close,
45,
fill = NA,
align = "right"
)
# Remove rows with NA values
zeel_ma <- na.omit(all_symbols_data$ZEEL)
# Plot
ggplot(zeel_ma, aes(x = Date)) +
geom_line(aes(y = Close, color = "Close Price")) +
geom_line(aes(y = MA15, color = "15-Day MA")) +
geom_line(aes(y = MA30, color = "30-Day MA")) +
geom_line(aes(y = MA45, color = "45-Day MA")) +
labs(
title = "Moving Average Trend of Closing Price",
x = "Date",
y = "Price",
color = "Legend"
) +
theme_minimal()
# Calculate percentage change
all_symbols_data$ZEEL$PrevClose_PctChange <-
c(
NA,
diff(all_symbols_data$ZEEL$Prev.Close) /
lag(all_symbols_data$ZEEL$Prev.Close)[-1] * 100
)
# Remove NA values
zeel_hist <- na.omit(all_symbols_data$ZEEL)
# Histogram
ggplot(zeel_hist, aes(x = PrevClose_PctChange)) +
geom_histogram(
bins = 30,
fill = "skyblue",
color = "black"
) +
labs(
title = "Distribution of Percentage Change for Prev Close",
x = "Percentage Change (%)",
y = "Frequency"
) +
theme_minimal()
To understand the relationships among the numerical variables in the Zee Entertainment Enterprises Ltd stock dataset, a correlation analysis and visualization were conducted in RStudio using the variables Prev Close, Open, High, Low, Volume, and Close. The objective was to identify the strength and direction of relationships between stock market indicators and detect highly correlated variable pairs. Correlation analysis is important in financial modeling because strong relationships among predictors may lead to multicollinearity, which can negatively affect regression model performance and interpretation.
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
# Select numerical variables
stock_data <- all_symbols_data$ZEEL[, c(
"Prev.Close",
"Open",
"High",
"Low",
"Volume",
"Close"
)]
# Correlation matrix
cor_matrix <- cor(stock_data, use = "complete.obs")
# Display correlation matrix
print(cor_matrix)
## Prev.Close Open High Low Volume Close
## Prev.Close 1.0000000 0.9995896 0.9989190 0.9985924 -0.1250299 0.9982408
## Open 0.9995896 1.0000000 0.9991795 0.9989109 -0.1233187 0.9984654
## High 0.9989190 0.9991795 1.0000000 0.9985761 -0.1134368 0.9992570
## Low 0.9985924 0.9989109 0.9985761 1.0000000 -0.1335865 0.9992283
## Volume -0.1250299 -0.1233187 -0.1134368 -0.1335865 1.0000000 -0.1225318
## Close 0.9982408 0.9984654 0.9992570 0.9992283 -0.1225318 1.0000000
# Visualize correlations
corrplot(
cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45
)
The correlation analysis revealed strong positive relationships among the stock price variables, particularly between Prev Close, Open, High, Low, and Close. These variables are all directly related to daily stock price movements, so it is expected that they move together over time. In contrast, Volume showed a weaker relationship with the price-related variables, suggesting that trading activity does not always move proportionally with stock prices.
A regression analysis was conducted using the Zee Entertainment Enterprises Ltd stock dataset in RStudio to examine how different market variables influence the closing stock price. In this analysis, Close was treated as the dependent variable, while Prev Close, Open, High, Low, and Volume were used as independent variables. To evaluate the performance of the regression model, the dataset was divided into training and testing subsets using an 80%–20% split. The training set was used to build the model, while the test set was reserved for assessing predictive accuracy on unseen data.
# Set seed for reproducibility
set.seed(123)
# Select relevant variables
reg_data <- all_symbols_data$ZEEL[, c(
"Close",
"Prev.Close",
"Open",
"High",
"Low",
"Volume"
)]
# Remove missing values
reg_data <- na.omit(reg_data)
# Total number of observations
n <- nrow(reg_data)
# Create training indices (80%)
train_index <- sample(1:n, size = 0.8 * n)
# Split data
train_data <- reg_data[train_index, ]
test_data <- reg_data[-train_index, ]
# Print sizes
cat("Training:", nrow(train_data), "\n")
## Training: 2808
cat("Test Sets:", nrow(test_data), "\n")
## Test Sets: 702
After splitting the Zee Entertainment Enterprises Ltd dataset into training and testing subsets, a multiple linear regression model was developed in RStudio to predict the closing stock price using Prev Close, Open, High, Low, and Volume as independent variables. The model was trained using the training dataset and then evaluated on the test dataset to measure its predictive performance. Three commonly used evaluation metrics were applied: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²). These metrics help assess prediction accuracy, average model error, and the proportion of variation in the closing price explained by the regression model.
# Train regression model
reg_model <- lm(
Close ~ Prev.Close + Open + High + Low + Volume,
data = train_data
)
# Model summary
summary(reg_model)
##
## Call:
## lm(formula = Close ~ Prev.Close + Open + High + Low + Volume,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.1459 -1.5783 0.1155 1.5911 28.3699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.803e-01 1.525e-01 -5.774 8.59e-09 ***
## Prev.Close -1.652e-02 1.928e-02 -0.857 0.3915
## Open -4.435e-01 2.138e-02 -20.745 < 2e-16 ***
## High 8.453e-01 1.200e-02 70.411 < 2e-16 ***
## Low 6.136e-01 1.100e-02 55.795 < 2e-16 ***
## Volume -1.076e-08 6.236e-09 -1.726 0.0845 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.055 on 2802 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995
## F-statistic: 1.072e+06 on 5 and 2802 DF, p-value: < 2.2e-16
# Predictions on test data
predictions <- predict(reg_model, newdata = test_data)
# Actual values
actual <- test_data$Close
# Performance Metrics
# 1. Mean Absolute Error (MAE)
MAE <- mean(abs(actual - predictions))
# 2. Root Mean Squared Error (RMSE)
RMSE <- sqrt(mean((actual - predictions)^2))
# 3. R-squared (R²)
R2 <- 1 - sum((actual - predictions)^2) /
sum((actual - mean(actual))^2)
# Print results
cat("MAE :", MAE, "\n")
## MAE : 2.184231
cat("RMSE:", RMSE, "\n")
## RMSE: 3.639329
cat("R-squared:", R2, "\n")
## R-squared: 0.9992422
The regression model was able to predict the closing stock price using historical stock market indicators with a high level of accuracy. The MAE measured the average absolute prediction error, while the RMSE provided insight into the magnitude of larger prediction errors. The R² value indicated how much of the variation in the closing price was explained by the independent variables included in the model.
Because the variables Prev Close, Open, High, and Low are strongly correlated with Close, the model is expected to achieve a high R² value. However, the strong correlations among these predictors may also introduce multicollinearity, which can affect the stability and interpretability of regression coefficients. Despite this limitation, the model remains useful for understanding overall relationships and short-term stock price prediction.
To evaluate the predictive accuracy of the regression model, a scatter plot comparing actual and predicted closing prices was created using the test dataset. A 45-degree reference line representing perfect prediction was added to the visualization to assess how closely the predicted values matched the actual stock closing prices. This graphical analysis helps identify prediction accuracy, systematic bias, and the overall reliability of the regression model.
# Create dataframe for plotting
plot_data <- data.frame(
Actual = actual,
Predicted = predictions
)
# Scatter plot
ggplot(plot_data, aes(x = Actual, y = Predicted)) +
geom_point(color = "blue", alpha = 0.6) +
# 45-degree reference line
geom_abline(
slope = 1,
intercept = 0,
color = "red",
linetype = "dashed",
linewidth = 1
) +
labs(
title = "Actual vs Predicted Closing Prices",
x = "Actual Closing Prices",
y = "Predicted Closing Prices"
) +
theme_minimal()
The 45-degree reference line represents perfect predictions, where the predicted closing prices are exactly equal to the actual closing prices. Points located very close to this line indicate accurate model predictions, while points farther away represent prediction errors. If the scatter points are tightly clustered around the line, it suggests that the regression model performs well and captures the relationship between the independent variables and the closing price effectively. A wider spread around the line indicates larger prediction errors and lower predictive accuracy. In this analysis, most observations are expected to lie near the reference line because variables such as Prev Close, Open, High, and Low are strongly correlated with the closing price. However, any noticeable deviations from the line may indicate periods of unusual market volatility, sudden trading activity, or limitations caused by multicollinearity among predictors.
To improve comparability among variables and prepare the data for machine learning and regression analysis, the independent variables in the Zee Entertainment Enterprises Ltd dataset were standardized. Standardization transforms each variable so that it has a mean of 0 and a standard deviation of 1. This process is especially useful when variables are measured on different scales, such as stock prices and trading volume, because it prevents variables with larger numerical ranges from dominating the analysis. After scaling, the mean and standard deviation of each variable were examined to verify that the standardization process was successful.
# Select independent variables
independent_vars <- all_symbols_data$ZEEL[, c(
"Prev.Close",
"Open",
"High",
"Low",
"Volume"
)]
# Remove missing values
independent_vars <- na.omit(independent_vars)
# Standardize variables
scaled_data <- scale(independent_vars)
# Convert to dataframe
scaled_data <- as.data.frame(scaled_data)
# Means after scaling
scaled_means <- colMeans(scaled_data)
# Standard deviations after scaling
scaled_sds <- apply(scaled_data, 2, sd)
# Print results
cat("Means after scaling:\n")
## Means after scaling:
print(scaled_means)
## Prev.Close Open High Low Volume
## -1.978079e-16 7.296316e-17 -1.707561e-16 5.968635e-17 -1.247024e-17
cat("\nStandard deviations after scaling:\n")
##
## Standard deviations after scaling:
print(scaled_sds)
## Prev.Close Open High Low Volume
## 1 1 1 1 1
The results show that the standardized variables have means very close to 0 and standard deviations close to 1, confirming that the scaling process was correctly applied. Minor differences from exactly 0 and 1 are normal due to rounding errors in computation.
Standardization is important because it ensures that all independent variables contribute proportionally to the analysis, regardless of their original units or magnitudes. This is particularly relevant in stock market datasets where variables such as trading volume may have values that are much larger than stock prices.
To reduce dimensionality and address the issue of multicollinearity among the independent variables, Principal Component Analysis (PCA) was applied to the Zee Entertainment Enterprises Ltd dataset in RStudio. PCA transforms the original correlated variables into a smaller set of uncorrelated components while preserving most of the variation in the data. In this analysis, the independent variables were reduced to two principal components, and a Linear Regression model was trained using the PCA-transformed training dataset. The performance of the model was then evaluated on the test dataset using the R² metric.
# Independent variables
X <- all_symbols_data$ZEEL[, c(
"Prev.Close",
"Open",
"High",
"Low",
"Volume"
)]
# Dependent variable
y <- all_symbols_data$ZEEL$Close
# Remove missing values
data_pca <- na.omit(data.frame(X, y))
# Separate X and y
X <- data_pca[, 1:5]
y <- data_pca$y
# Standardize X
X_scaled <- scale(X)
# Split data into training and testing sets
set.seed(123)
n <- nrow(X_scaled)
train_index <- sample(1:n, size = 0.8 * n)
X_train <- X_scaled[train_index, ]
X_test <- X_scaled[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]
# Apply PCA
pca_model <- prcomp(X_train, center = TRUE, scale. = TRUE)
# Keep first 2 principal components
train_pca <- pca_model$x[, 1:2]
# Transform test data
test_pca <- predict(pca_model, newdata = X_test)[, 1:2]
# Train Linear Regression model
pca_reg_model <- lm(y_train ~ ., data = as.data.frame(train_pca))
# Predictions
pred_pca <- predict(
pca_reg_model,
newdata = as.data.frame(test_pca)
)
# Compute R-squared
R2_pca <- 1 - sum((y_test - pred_pca)^2) /
sum((y_test - mean(y_test))^2)
# Print R2
cat("R-squared on Test Set:", R2_pca, "\n")
## R-squared on Test Set: 0.9978319
The PCA-based regression model used two principal components to summarize the information contained in the original independent variables. By reducing the dimensionality of the dataset, PCA helped minimize multicollinearity among predictors while retaining most of the important variation in the data.
The R² value obtained on the test set indicates how well the PCA-transformed variables explain variations in the closing stock price. A high R² suggests that the two principal components successfully captured most of the relevant information from the original variables, while a lower R² would indicate some loss of predictive information during dimensionality reduction. Overall, PCA provides a useful approach for simplifying complex financial datasets and improving model stability.
Comparison of the PCA Model and the Raw Regression Model
The performance of the PCA-based regression model was compared with the original multiple linear regression model developed using the raw independent variables (Prev Close, Open, High, Low, and Volume) in RStudio. The comparison was based on the R² metric, which measures the proportion of variation in the closing stock price explained by the model. In most cases, the raw regression model is expected to produce a higher R² value than the PCA-based model. This is because the raw model uses the original variables directly, preserving all available information relevant to predicting the closing stock price. Since variables such as Prev Close, Open, High, and Low are highly correlated with Close, the model can explain a large proportion of the variation in the dependent variable. On the other hand, the PCA model reduces the dataset to only two principal components. Although PCA removes multicollinearity and simplifies the dataset, some information from the original variables may be lost during dimensionality reduction. As a result, the PCA model may produce a slightly lower R² value compared to the raw regression model. However, the PCA model has important advantages:
it reduces multicollinearity among predictors,
improves model stability,
simplifies interpretation of complex datasets,
and may generalize better when working with highly correlated variables.
Therefore:
If the raw model has a higher R², it means the original variables provide more detailed predictive information.
If the PCA model performs similarly, it suggests that the first two principal components successfully captured most of the important information in the dataset.
Overall, the raw regression model often performs better in terms of predictive accuracy, while the PCA model provides a more compact and statistically stable representation of the data.
A comparative analysis of annual trading volume was conducted for Zee Entertainment Enterprises Ltd and Vedanta Ltd using RStudio. The total yearly trading volume for each company was calculated and visualized using a clustered bar chart to compare sales activity across different years. This visualization helps identify differences in market participation, trading intensity, and relative performance between the two companies over time. The analysis also focused on identifying the company with the highest trading volume in 2018, the company with the lowest trading volume in 2021, and the ratio of Zee Entertainment Enterprises Ltd sales volume to Vedanta Ltd sales volume during 2021.
# Convert Date columns
all_symbols_data$ZEEL$Date <- as.Date(all_symbols_data$ZEEL$Date)
all_symbols_data$VEDL$Date <- as.Date(all_symbols_data$VEDL$Date)
# Extract Year
all_symbols_data$ZEEL$Year <- format(all_symbols_data$ZEEL$Date, "%Y")
all_symbols_data$VEDL$Year <- format(all_symbols_data$VEDL$Date, "%Y")
# Annual total trading volume
zeel_volume <- all_symbols_data$ZEEL %>%
group_by(Year) %>%
summarise(Total_Volume = sum(Volume, na.rm = TRUE)) %>%
mutate(Company = "ZEEL")
vedl_volume <- all_symbols_data$VEDL %>%
group_by(Year) %>%
summarise(Total_Volume = sum(Volume, na.rm = TRUE)) %>%
mutate(Company = "VEDL")
# Combine datasets
volume_data <- rbind(zeel_volume, vedl_volume)
# Clustered bar chart
ggplot(volume_data,
aes(x = Year,
y = Total_Volume,
fill = Company)) +
geom_bar(stat = "identity",
position = "dodge") +
labs(
title = "Annual Trading Volume Comparison",
x = "Year",
y = "Total Trading Volume"
) +
theme_minimal()
# -----------------------------
# Questions
# -----------------------------
# Highest volume during 2018
volume_2018 <- subset(volume_data, Year == "2018")
print(volume_2018)
## # A tibble: 2 × 3
## Year Total_Volume Company
## <chr> <dbl> <chr>
## 1 2018 598775408 ZEEL
## 2 2018 3276296751 VEDL
# Least volume during 2021
volume_2021 <- subset(volume_data, Year == "2021")
print(volume_2021)
## # A tibble: 2 × 3
## Year Total_Volume Company
## <chr> <dbl> <chr>
## 1 2021 1302490317 ZEEL
## 2 2021 1430971315 VEDL
# Ratio of ZEEL to VEDL sales during 2021
zeel_2021 <- volume_2021$Total_Volume[
volume_2021$Company == "ZEEL"
]
vedl_2021 <- volume_2021$Total_Volume[
volume_2021$Company == "VEDL"
]
ratio_2021 <- zeel_2021 / vedl_2021
cat("Ratio of ZEEL to VEDL sales in 2021:",
ratio_2021)
## Ratio of ZEEL to VEDL sales in 2021: 0.9102141
The clustered bar chart provides a clear comparison of annual trading volumes between Zee Entertainment Enterprises Ltd and Vedanta Ltd across different years. By comparing the heights of the bars, it becomes possible to identify years with unusually high or low trading activity for each company.
The company with the highest trading volume during 2018 can be identified by comparing the 2018 bars in the chart and the printed results.
The company with the lowest trading volume during 2021 is determined from the 2021 annual totals.
The ratio of Zee Entertainment Enterprises Ltd sales volume to Vedanta Ltd sales volume during 2021 indicates how the trading activity of the two companies compares relative to each other during that year.
A ratio:
greater than 1 indicates ZEEL traded more volume than VEDL,
less than 1 indicates VEDL traded more volume than ZEEL,
and equal to 1 would indicate equal trading volume.