Introduction

One of the datasets that will be tidy, transformed, and analyzed in this report is a dataset that contains the information for well performing Indian stocks in the NSE. This dataset was pulled from kaggle. There are 63 variables in this dataset. Unfortunately, the author of the dataset did not provide an ancillary document providing a description of what each variable means. This fact also means that the actual currency for the prices is unknown. Since the data is for the Indian stock market, the currency is assumed to be in Indian Rupees. For this dataset, an analysis detailed in this report was done to address the following:

  1. Finding the mean trading high and low prices for each sector.
  2. The distribution of those means for both high and low prices.

Importing of the Data

The .csv containing the data was stored onto a Github repository which was then imported into our workspace with the following code.

url <- 'https://raw.githubusercontent.com/peterphung2043/DATA-607---Project-2/main/Stock%20Market%20data%20.csv'

stock_market_data <- read.csv(url(url), stringsAsFactors = FALSE, na.strings = "#N/A")
knitr::kable(stock_market_data[1:5, 1:5])
Share Category Sector RM Up
20MICRONS Exit CHEMICALS NA NA
3IINFOTECH BWLV IT NA NA
3MINDIA Exit DIVERSIFIED 0 0
63MOONS RS IT NA NA
8KMILES HWV IT 0 0

The output above only shows the first five variables and the first five observations. The rest of the data can be viewed on the kaggle hyperlink in the Introduction.

Tidying and Transforming of the Data

The columns that were analyzed in this dataset were the Sector, High.Price, and Low.Price.

A check was done to see if the number of missing values for the high prices was the same as the number of missing values for the low prices. This is to see if any of the non-missing high prices had any corresponding missing low prices for each observation and vice versa.

stock_market_data %>%
  summarise(missing_high_prices = sum(is.na(High.Price)),
            missing_low_prices = sum(is.na(Low.Price)))
##   missing_high_prices missing_low_prices
## 1                 599                599

The output above shows that the High.Price column and the Low.Price column have the same amount of missing values.

The code block below selects the 3 variables that we need for the analysis. then discards observations containing missing values in the dataset. The resulting dataframe is then stored as parsed_stock_market_data. Only the first 5 observations are shown below.

parsed_stock_market_data <- stock_market_data %>%
  select(Sector, High.Price, Low.Price) %>%
  drop_na()
knitr::kable(parsed_stock_market_data[1:5,])
Sector High.Price Low.Price
DIVERSIFIED 12399.95 12115.0
IT 76.70 75.6
IT 640.00 624.0
INFRA-CONSTRUCTION ENGINEERING AND MATERIALS 50.50 47.8
PHARMA 543.75 531.0

The code block below does the following to the parsed_stock_market_data dataframe.

  1. Grouping by the Sector column. Since the analysis calls for determining the mean high and low price by sector.

  2. Using a mutate operation to assign a row number to each sector in each group. For example, for the first observation, assume that sector = "IT". For the first observation, row = 1. For the 2nd observation, sector = "DIVERSIFIED", so row = 1 for the 2nd observation. For the 3rd observation sector = "IT", so ’row = 2` for the 3rd observation. Notice that the sectors for the first and third observations were the same, so when assigning a row number, the row number increments if subsequent observations contain the same sector.

  3. pivot_wider makes it so that all of the High.Prices corresponding to a unique Sector gets put into a column. Similarily, all the Low.Prices corresponding to a unique Sector gets put into a column.

  4. The row column that was generated from step 2 was deleted.

parsed_stock_market_data <- parsed_stock_market_data %>%
  group_by(Sector) %>%
  mutate(row = row_number()) %>%
  pivot_wider(names_from = Sector, values_from = c(High.Price, Low.Price)) %>%
  select(-row)

knitr::kable((parsed_stock_market_data %>%
               select(High.Price_DIVERSIFIED, Low.Price_DIVERSIFIED,
                      High.Price_IT, Low.Price_IT))[1:5,])
High.Price_DIVERSIFIED Low.Price_DIVERSIFIED High.Price_IT Low.Price_IT
12399.95 12115.00 76.70 75.6
1713.45 1684.35 640.00 624.0
233.65 231.10 383.85 370.6
1212.70 1192.00 221.45 215.1
381.70 376.00 131.80 124.9

The output above shows just the first five values for two of the sectors in parsed_stock_market_data.

The data was then stored in a nested dataframe. In this nested dataframe, the High.Price columns and the Low.Price columns in the parsed stock market data were grouped together separately then moved into list-columns. The resulting nested dataframe is stored as nested_stock_market_data.

nested_stock_market_data <- parsed_stock_market_data %>%
  nest(high_price = starts_with("High.Price"), low_price = starts_with("Low.Price"))

The high_price column in nested_stock_market_data contains all of the High.Price values arranged by Sector. Similarly, the low_price column contains all of the Low.Price values arranged by Sector. The output for nested_stock_market_data is shown below, each list column has 36 variables, which means that there are 36 sectors with multiple high prices and 36 sectors and low prices.

On RStudio, clicking on an element on each of the list-columns in nested_stock_market_data will output the data with respect to the list-column. The outputs for each list-column are shown in the subsequent dataframe.

high_price low_price
36 variables 36 variables

We can index the list-columns in nested_stock_market_data in order to bring up all of the High.Price values for each of the Sectors and similarly for the Low.Price``Sectors.

knitr::kable((nested_stock_market_data$high_price[[1]] %>%
                select(High.Price_DIVERSIFIED, High.Price_IT))[1:5,])
High.Price_DIVERSIFIED High.Price_IT
12399.95 76.70
1713.45 640.00
233.65 383.85
1212.70 221.45
381.70 131.80
knitr::kable((nested_stock_market_data$low_price[[1]] %>%
                select(Low.Price_DIVERSIFIED, Low.Price_IT))[1:5,])
Low.Price_DIVERSIFIED Low.Price_IT
12115.00 75.6
1684.35 624.0
231.10 370.6
1192.00 215.1
376.00 124.9

The output for the code block above shows the high and low prices for just the DIVERSIFIED and IT sectors in nested_stock_market_data. There are 36 sectors in total so only two are shown.

Displaying the Means of the High and Low Prices for Each Sector

The following block of code computes the means for each of the sectors for Both high and low prices from the nested_stock_market_data dataframe. The high price and low price means for each sector are stored in the trading_means dataframe.

trading_means <- data.frame(
  high_price_means = colMeans(nested_stock_market_data$high_price[[1]], na.rm = TRUE),
  low_price_means = colMeans(nested_stock_market_data$low_price[[1]], na.rm = TRUE)
)
rownames(trading_means) <- str_extract(names(nested_stock_market_data$high_price[[1]]), '(?<=_).+')

knitr::kable(trading_means)
high_price_means low_price_means
DIVERSIFIED 1710.0864 1672.5409
IT 456.5485 447.1265
INFRA-CONSTRUCTION ENGINEERING AND MATERIALS 237.6061 231.1000
PHARMA 761.4545 745.0689
CHEMICALS 587.5079 573.4456
ENERGY-OIL & GAS 678.6229 668.0771
INFRA-MACHINERY EQUIPMENT 647.1958 628.6992
TEXTILES AND APPAREL 577.4724 564.8362
CEMENTS 1380.3732 1324.1214
INFRA-POWER 307.8364 300.5318
TRANSPORTATION LOGISTICS 459.6792 446.8792
FMCG-FOOD 553.7417 538.5354
ENTERTAINMENT-HOTELS AND LEISURE 244.4423 237.7462
REALTY 202.5462 195.9981
REALTY-HOUSEHOLD 626.4444 609.9491
FINANCIAL SERVICES 668.2844 651.2984
BANKS-PSU 144.4000 140.6895
METALS AND MINING 254.3843 247.3588
AUTO-AUTO ANCL AND COMPONENTS 1412.6091 1377.3561
FMCG-SUGAR 149.4000 144.8938
AUTO-TYRES AND TUBES 12516.9583 12245.5833
AUTO-AUTOMOBILES AND AUTO PARTS 1732.0577 1650.6551
FMCG-BEVERAGES 416.9969 390.2125
DEFENCE 140.4375 135.4250
BANKS-PRIVATE 478.6176 471.2441
FMCG 1694.8804 1659.8283
FINANCIAL SERVICES-NBFC 803.6853 781.1368
MEDIA-ENTERTAINMENT 313.7667 304.4714
PAPER 164.2063 157.3000
FOOTWEAR 300.8000 293.8333
INFRA-TELECOM 216.8643 212.3786
FERTILISERS 238.3562 229.5938
FMCG-PACKAGING 239.1867 233.0367
JEWELLERY 313.7312 306.1375
MEDIA 152.9417 148.2583
AIRLINES 838.2750 800.4250

The Distribution of the Means for the High and Low prices

ggplot(data = trading_means, aes(x = high_price_means)) +
  geom_histogram() + xlab("High Price Means (Indian Rupees)") + ylab("Count")

Fig. 1: High price means histogram.

ggplot(data = trading_means, aes(x = low_price_means)) +
  geom_histogram() + xlab("Low Price Means (Indian Rupees)") + ylab("Count")

Fig. 2: Low price means histogram.

Figure 1 and Figure 2 show that there is a huge outlier for both the high prices and low prices.

Hmisc::describe(trading_means)
## trading_means 
## 
##  2  Variables      36  Observations
## --------------------------------------------------------------------------------
## high_price_means 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       36        0       36        1    906.2     1121    148.2    158.6 
##      .25      .50      .75      .90      .95 
##    239.0    458.1    699.3   1553.7   1715.6 
## 
## lowest :   140.4375   144.4000   149.4000   152.9417   164.2062
## highest:  1412.6091  1694.8804  1710.0864  1732.0577 12516.9583
## --------------------------------------------------------------------------------
## low_price_means 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       36        0       36        1    882.4     1093    143.8    152.8 
##      .25      .50      .75      .90      .95 
##    232.6    447.0    687.3   1514.0   1663.0 
## 
## lowest :   135.4250   140.6895   144.8938   148.2583   157.3000
## highest:  1377.3561  1650.6551  1659.8283  1672.5409 12245.5833
## --------------------------------------------------------------------------------

The Hmisc library has a describe function which shows the means, the 5th, 10th, 25th, 50th, 75th, 90th, 95th percentiles, the 5 lowest values, and 5 highest values for each variable for a given dataframe. The output above reveals that the highest value for both the high_price_means and low_price_means are a order of magnitude larger than the 2nd highest, which is why there a huge outlier for both graphs.

trading_means %>%
  filter(high_price_means > 10000 | low_price_means > 10000)
##                      high_price_means low_price_means
## AUTO-TYRES AND TUBES         12516.96        12245.58

The output for the code block above shows that the AUTO-TYRES AND TUBES Sector has the highest high and low price means. In fact, this is the only sector with a high or low price mean of over 10,000. This is a significant outlier. Therefore, it was omitted in the following two graphs.

ggplot(data = trading_means %>% filter(high_price_means < 10000), aes(x = high_price_means)) +
  geom_histogram() + xlab("High Price Means (Indian Rupees)") + ylab("Count")

Fig. 3: High price means histogram. Mean high prices above 10,000 Indian rupees were filtered out of the data.

ggplot(data = trading_means %>% filter(low_price_means < 10000), aes(x = low_price_means)) +
  geom_histogram() + xlab("Low Price Means (Indian Rupees)") + ylab("Count")

Fig. 4: Low price means histogram. Mean low prices above 10,000 Indian rupees were filtered out of the data.

After removing the huge outlier from both the high price and low price means, it is shown that the distribution for both graphs looks to be unimodal and right skewed. The histograms for both the high and low trade price means imply that trade prices in the Indian stock market typically tend to stay below 1000 Indian Rupees a share.

Conclusions

By analyzing the Indian stock market data from Kaggle, it has been shown that this data is unimodal and right skewed. It would have been interesting to see the distribution of the data for those high and low prices that were not missing from the original dataset, since there were 599 missing stock prices from the original data. A future analysis could involve finding an association between the high price and low price and other variables in the dataset.