Project 2

Introduction

One of the datasets that will be tidy, transformed, and analyzed in this report is a dataset that contains the information for well performing Indian stocks in the NSE. This dataset was pulled from kaggle. There are 63 variables in this dataset. Unfortunately, the author of the dataset did not provide an ancillary document providing a description of what each variable means. This fact also means that the actual currency for the prices is unknown. Since the data is for the Indian stock market, the currency is assumed to be in Indian Rupees. For this dataset, an analysis detailed in this report was done to address the following:

Finding the mean trading high and low prices for each sector.
The distribution of those means for both high and low prices.

Importing of the Data

The .csv containing the data was stored onto a Github repository which was then imported into our workspace with the following code.

url <- 'https://raw.githubusercontent.com/peterphung2043/DATA-607---Project-2/main/Stock%20Market%20data%20.csv'

stock_market_data <- read.csv(url(url), stringsAsFactors = FALSE, na.strings = "#N/A")
knitr::kable(stock_market_data[1:5, 1:5])

Share	Category	Sector	RM	Up
20MICRONS	Exit	CHEMICALS	NA	NA
3IINFOTECH	BWLV	IT	NA	NA
3MINDIA	Exit	DIVERSIFIED	0	0
63MOONS	RS	IT	NA	NA
8KMILES	HWV	IT	0	0

The output above only shows the first five variables and the first five observations. The rest of the data can be viewed on the kaggle hyperlink in the Introduction.

Tidying and Transforming of the Data

The columns that were analyzed in this dataset were the Sector, High.Price, and Low.Price.

A check was done to see if the number of missing values for the high prices was the same as the number of missing values for the low prices. This is to see if any of the non-missing high prices had any corresponding missing low prices for each observation and vice versa.

stock_market_data %>%
  summarise(missing_high_prices = sum(is.na(High.Price)),
            missing_low_prices = sum(is.na(Low.Price)))

##   missing_high_prices missing_low_prices
## 1                 599                599

The output above shows that the High.Price column and the Low.Price column have the same amount of missing values.

The code block below selects the 3 variables that we need for the analysis. then discards observations containing missing values in the dataset. The resulting dataframe is then stored as parsed_stock_market_data. Only the first 5 observations are shown below.

parsed_stock_market_data <- stock_market_data %>%
  select(Sector, High.Price, Low.Price) %>%
  drop_na()
knitr::kable(parsed_stock_market_data[1:5,])

Sector	High.Price	Low.Price
DIVERSIFIED	12399.95	12115.0
IT	76.70	75.6
IT	640.00	624.0
INFRA-CONSTRUCTION ENGINEERING AND MATERIALS	50.50	47.8
PHARMA	543.75	531.0

The code block below does the following to the parsed_stock_market_data dataframe.

Grouping by the Sector column. Since the analysis calls for determining the mean high and low price by sector.
Using a mutate operation to assign a row number to each sector in each group. For example, for the first observation, assume that sector = "IT". For the first observation, row = 1. For the 2nd observation, sector = "DIVERSIFIED", so row = 1 for the 2nd observation. For the 3rd observation sector = "IT", so ’row = 2` for the 3rd observation. Notice that the sectors for the first and third observations were the same, so when assigning a row number, the row number increments if subsequent observations contain the same sector.
pivot_wider makes it so that all of the High.Prices corresponding to a unique Sector gets put into a column. Similarily, all the Low.Prices corresponding to a unique Sector gets put into a column.
The row column that was generated from step 2 was deleted.

parsed_stock_market_data <- parsed_stock_market_data %>%
  group_by(Sector) %>%
  mutate(row = row_number()) %>%
  pivot_wider(names_from = Sector, values_from = c(High.Price, Low.Price)) %>%
  select(-row)

knitr::kable((parsed_stock_market_data %>%
               select(High.Price_DIVERSIFIED, Low.Price_DIVERSIFIED,
                      High.Price_IT, Low.Price_IT))[1:5,])

High.Price_DIVERSIFIED	Low.Price_DIVERSIFIED	High.Price_IT	Low.Price_IT
12399.95	12115.00	76.70	75.6
1713.45	1684.35	640.00	624.0
233.65	231.10	383.85	370.6
1212.70	1192.00	221.45	215.1
381.70	376.00	131.80	124.9

The output above shows just the first five values for two of the sectors in parsed_stock_market_data.

The data was then stored in a nested dataframe. In this nested dataframe, the High.Price columns and the Low.Price columns in the parsed stock market data were grouped together separately then moved into list-columns. The resulting nested dataframe is stored as nested_stock_market_data.

nested_stock_market_data <- parsed_stock_market_data %>%
  nest(high_price = starts_with("High.Price"), low_price = starts_with("Low.Price"))

The high_price column in nested_stock_market_data contains all of the High.Price values arranged by Sector. Similarly, the low_price column contains all of the Low.Price values arranged by Sector. The output for nested_stock_market_data is shown below, each list column has 36 variables, which means that there are 36 sectors with multiple high prices and 36 sectors and low prices.

On RStudio, clicking on an element on each of the list-columns in nested_stock_market_data will output the data with respect to the list-column. The outputs for each list-column are shown in the subsequent dataframe.

high_price	low_price
36 variables	36 variables

We can index the list-columns in nested_stock_market_data in order to bring up all of the High.Price values for each of the Sectors and similarly for the Low.Price``Sectors.

knitr::kable((nested_stock_market_data$high_price[[1]] %>%
                select(High.Price_DIVERSIFIED, High.Price_IT))[1:5,])

High.Price_DIVERSIFIED	High.Price_IT
12399.95	76.70
1713.45	640.00
233.65	383.85
1212.70	221.45
381.70	131.80

knitr::kable((nested_stock_market_data$low_price[[1]] %>%
                select(Low.Price_DIVERSIFIED, Low.Price_IT))[1:5,])

Low.Price_DIVERSIFIED	Low.Price_IT
12115.00	75.6
1684.35	624.0
231.10	370.6
1192.00	215.1
376.00	124.9

The output for the code block above shows the high and low prices for just the DIVERSIFIED and IT sectors in nested_stock_market_data. There are 36 sectors in total so only two are shown.

Displaying the Means of the High and Low Prices for Each Sector

The following block of code computes the means for each of the sectors for Both high and low prices from the nested_stock_market_data dataframe. The high price and low price means for each sector are stored in the trading_means dataframe.

trading_means <- data.frame(
  high_price_means = colMeans(nested_stock_market_data$high_price[[1]], na.rm = TRUE),
  low_price_means = colMeans(nested_stock_market_data$low_price[[1]], na.rm = TRUE)
)
rownames(trading_means) <- str_extract(names(nested_stock_market_data$high_price[[1]]), '(?<=_).+')

knitr::kable(trading_means)

	high_price_means	low_price_means
DIVERSIFIED	1710.0864	1672.5409
IT	456.5485	447.1265
INFRA-CONSTRUCTION ENGINEERING AND MATERIALS	237.6061	231.1000
PHARMA	761.4545	745.0689
CHEMICALS	587.5079	573.4456
ENERGY-OIL & GAS	678.6229	668.0771
INFRA-MACHINERY EQUIPMENT	647.1958	628.6992
TEXTILES AND APPAREL	577.4724	564.8362
CEMENTS	1380.3732	1324.1214
INFRA-POWER	307.8364	300.5318
TRANSPORTATION LOGISTICS	459.6792	446.8792
FMCG-FOOD	553.7417	538.5354
ENTERTAINMENT-HOTELS AND LEISURE	244.4423	237.7462
REALTY	202.5462	195.9981
REALTY-HOUSEHOLD	626.4444	609.9491
FINANCIAL SERVICES	668.2844	651.2984
BANKS-PSU	144.4000	140.6895
METALS AND MINING	254.3843	247.3588
AUTO-AUTO ANCL AND COMPONENTS	1412.6091	1377.3561
FMCG-SUGAR	149.4000	144.8938
AUTO-TYRES AND TUBES	12516.9583	12245.5833
AUTO-AUTOMOBILES AND AUTO PARTS	1732.0577	1650.6551
FMCG-BEVERAGES	416.9969	390.2125
DEFENCE	140.4375	135.4250
BANKS-PRIVATE	478.6176	471.2441
FMCG	1694.8804	1659.8283
FINANCIAL SERVICES-NBFC	803.6853	781.1368
MEDIA-ENTERTAINMENT	313.7667	304.4714
PAPER	164.2063	157.3000
FOOTWEAR	300.8000	293.8333
INFRA-TELECOM	216.8643	212.3786
FERTILISERS	238.3562	229.5938
FMCG-PACKAGING	239.1867	233.0367
JEWELLERY	313.7312	306.1375
MEDIA	152.9417	148.2583
AIRLINES	838.2750	800.4250

The Distribution of the Means for the High and Low prices

ggplot(data = trading_means, aes(x = high_price_means)) +
  geom_histogram() + xlab("High Price Means (Indian Rupees)") + ylab("Count")

Fig. 1: High price means histogram.

ggplot(data = trading_means, aes(x = low_price_means)) +
  geom_histogram() + xlab("Low Price Means (Indian Rupees)") + ylab("Count")

Fig. 2: Low price means histogram.

Figure 1 and Figure 2 show that there is a huge outlier for both the high prices and low prices.

Hmisc::describe(trading_means)

## trading_means 
## 
##  2  Variables      36  Observations
## --------------------------------------------------------------------------------
## high_price_means 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       36        0       36        1    906.2     1121    148.2    158.6 
##      .25      .50      .75      .90      .95 
##    239.0    458.1    699.3   1553.7   1715.6 
## 
## lowest :   140.4375   144.4000   149.4000   152.9417   164.2062
## highest:  1412.6091  1694.8804  1710.0864  1732.0577 12516.9583
## --------------------------------------------------------------------------------
## low_price_means 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       36        0       36        1    882.4     1093    143.8    152.8 
##      .25      .50      .75      .90      .95 
##    232.6    447.0    687.3   1514.0   1663.0 
## 
## lowest :   135.4250   140.6895   144.8938   148.2583   157.3000
## highest:  1377.3561  1650.6551  1659.8283  1672.5409 12245.5833
## --------------------------------------------------------------------------------

The Hmisc library has a describe function which shows the means, the 5th, 10th, 25th, 50th, 75th, 90th, 95th percentiles, the 5 lowest values, and 5 highest values for each variable for a given dataframe. The output above reveals that the highest value for both the high_price_means and low_price_means are a order of magnitude larger than the 2nd highest, which is why there a huge outlier for both graphs.

trading_means %>%
  filter(high_price_means > 10000 | low_price_means > 10000)

##                      high_price_means low_price_means
## AUTO-TYRES AND TUBES         12516.96        12245.58

The output for the code block above shows that the AUTO-TYRES AND TUBES Sector has the highest high and low price means. In fact, this is the only sector with a high or low price mean of over 10,000. This is a significant outlier. Therefore, it was omitted in the following two graphs.

ggplot(data = trading_means %>% filter(high_price_means < 10000), aes(x = high_price_means)) +
  geom_histogram() + xlab("High Price Means (Indian Rupees)") + ylab("Count")

Fig. 3: High price means histogram. Mean high prices above 10,000 Indian rupees were filtered out of the data.

ggplot(data = trading_means %>% filter(low_price_means < 10000), aes(x = low_price_means)) +
  geom_histogram() + xlab("Low Price Means (Indian Rupees)") + ylab("Count")

Fig. 4: Low price means histogram. Mean low prices above 10,000 Indian rupees were filtered out of the data.

After removing the huge outlier from both the high price and low price means, it is shown that the distribution for both graphs looks to be unimodal and right skewed. The histograms for both the high and low trade price means imply that trade prices in the Indian stock market typically tend to stay below 1000 Indian Rupees a share.

Conclusions

By analyzing the Indian stock market data from Kaggle, it has been shown that this data is unimodal and right skewed. It would have been interesting to see the distribution of the data for those high and low prices that were not missing from the original dataset, since there were 599 missing stock prices from the original data. A future analysis could involve finding an association between the high price and low price and other variables in the dataset.