Introduction
One of the datasets that will be tidy, transformed, and analyzed in this report is a dataset that contains the information for well performing Indian stocks in the NSE. This dataset was pulled from kaggle. There are 63 variables in this dataset. Unfortunately, the author of the dataset did not provide an ancillary document providing a description of what each variable means. This fact also means that the actual currency for the prices is unknown. Since the data is for the Indian stock market, the currency is assumed to be in Indian Rupees. For this dataset, an analysis detailed in this report was done to address the following:
- Finding the mean trading high and low prices for each sector.
- The distribution of those means for both high and low prices.
Importing of the Data
The .csv containing the data was stored onto a Github repository which was then imported into our workspace with the following code.
url <- 'https://raw.githubusercontent.com/peterphung2043/DATA-607---Project-2/main/Stock%20Market%20data%20.csv'
stock_market_data <- read.csv(url(url), stringsAsFactors = FALSE, na.strings = "#N/A")
knitr::kable(stock_market_data[1:5, 1:5])| Share | Category | Sector | RM | Up |
|---|---|---|---|---|
| 20MICRONS | Exit | CHEMICALS | NA | NA |
| 3IINFOTECH | BWLV | IT | NA | NA |
| 3MINDIA | Exit | DIVERSIFIED | 0 | 0 |
| 63MOONS | RS | IT | NA | NA |
| 8KMILES | HWV | IT | 0 | 0 |
The output above only shows the first five variables and the first five observations. The rest of the data can be viewed on the kaggle hyperlink in the Introduction.
Tidying and Transforming of the Data
The columns that were analyzed in this dataset were the Sector, High.Price, and Low.Price.
A check was done to see if the number of missing values for the high prices was the same as the number of missing values for the low prices. This is to see if any of the non-missing high prices had any corresponding missing low prices for each observation and vice versa.
stock_market_data %>%
summarise(missing_high_prices = sum(is.na(High.Price)),
missing_low_prices = sum(is.na(Low.Price)))## missing_high_prices missing_low_prices
## 1 599 599
The output above shows that the High.Price column and the Low.Price column have the same amount of missing values.
The code block below selects the 3 variables that we need for the analysis. then discards observations containing missing values in the dataset. The resulting dataframe is then stored as parsed_stock_market_data. Only the first 5 observations are shown below.
parsed_stock_market_data <- stock_market_data %>%
select(Sector, High.Price, Low.Price) %>%
drop_na()
knitr::kable(parsed_stock_market_data[1:5,])| Sector | High.Price | Low.Price |
|---|---|---|
| DIVERSIFIED | 12399.95 | 12115.0 |
| IT | 76.70 | 75.6 |
| IT | 640.00 | 624.0 |
| INFRA-CONSTRUCTION ENGINEERING AND MATERIALS | 50.50 | 47.8 |
| PHARMA | 543.75 | 531.0 |
The code block below does the following to the parsed_stock_market_data dataframe.
Grouping by the
Sectorcolumn. Since the analysis calls for determining the mean high and low price by sector.Using a
mutateoperation to assign a row number to each sector in each group. For example, for the first observation, assume thatsector = "IT". For the first observation,row = 1. For the 2nd observation,sector = "DIVERSIFIED", sorow = 1for the 2nd observation. For the 3rd observationsector = "IT", so ’row = 2` for the 3rd observation. Notice that the sectors for the first and third observations were the same, so when assigning a row number, the row number increments if subsequent observations contain the same sector.pivot_widermakes it so that all of theHigh.Prices corresponding to a uniqueSectorgets put into a column. Similarily, all theLow.Prices corresponding to a uniqueSectorgets put into a column.The
rowcolumn that was generated from step 2 was deleted.
parsed_stock_market_data <- parsed_stock_market_data %>%
group_by(Sector) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = Sector, values_from = c(High.Price, Low.Price)) %>%
select(-row)
knitr::kable((parsed_stock_market_data %>%
select(High.Price_DIVERSIFIED, Low.Price_DIVERSIFIED,
High.Price_IT, Low.Price_IT))[1:5,])| High.Price_DIVERSIFIED | Low.Price_DIVERSIFIED | High.Price_IT | Low.Price_IT |
|---|---|---|---|
| 12399.95 | 12115.00 | 76.70 | 75.6 |
| 1713.45 | 1684.35 | 640.00 | 624.0 |
| 233.65 | 231.10 | 383.85 | 370.6 |
| 1212.70 | 1192.00 | 221.45 | 215.1 |
| 381.70 | 376.00 | 131.80 | 124.9 |
The output above shows just the first five values for two of the sectors in parsed_stock_market_data.
The data was then stored in a nested dataframe. In this nested dataframe, the High.Price columns and the Low.Price columns in the parsed stock market data were grouped together separately then moved into list-columns. The resulting nested dataframe is stored as nested_stock_market_data.
nested_stock_market_data <- parsed_stock_market_data %>%
nest(high_price = starts_with("High.Price"), low_price = starts_with("Low.Price"))The high_price column in nested_stock_market_data contains all of the High.Price values arranged by Sector. Similarly, the low_price column contains all of the Low.Price values arranged by Sector. The output for nested_stock_market_data is shown below, each list column has 36 variables, which means that there are 36 sectors with multiple high prices and 36 sectors and low prices.
On RStudio, clicking on an element on each of the list-columns in nested_stock_market_data will output the data with respect to the list-column. The outputs for each list-column are shown in the subsequent dataframe.
| high_price | low_price |
|---|---|
| 36 variables | 36 variables |
We can index the list-columns in nested_stock_market_data in order to bring up all of the High.Price values for each of the Sectors and similarly for the Low.Price``Sectors.
knitr::kable((nested_stock_market_data$high_price[[1]] %>%
select(High.Price_DIVERSIFIED, High.Price_IT))[1:5,])| High.Price_DIVERSIFIED | High.Price_IT |
|---|---|
| 12399.95 | 76.70 |
| 1713.45 | 640.00 |
| 233.65 | 383.85 |
| 1212.70 | 221.45 |
| 381.70 | 131.80 |
knitr::kable((nested_stock_market_data$low_price[[1]] %>%
select(Low.Price_DIVERSIFIED, Low.Price_IT))[1:5,])| Low.Price_DIVERSIFIED | Low.Price_IT |
|---|---|
| 12115.00 | 75.6 |
| 1684.35 | 624.0 |
| 231.10 | 370.6 |
| 1192.00 | 215.1 |
| 376.00 | 124.9 |
The output for the code block above shows the high and low prices for just the DIVERSIFIED and IT sectors in nested_stock_market_data. There are 36 sectors in total so only two are shown.
Displaying the Means of the High and Low Prices for Each Sector
The following block of code computes the means for each of the sectors for Both high and low prices from the nested_stock_market_data dataframe. The high price and low price means for each sector are stored in the trading_means dataframe.
trading_means <- data.frame(
high_price_means = colMeans(nested_stock_market_data$high_price[[1]], na.rm = TRUE),
low_price_means = colMeans(nested_stock_market_data$low_price[[1]], na.rm = TRUE)
)
rownames(trading_means) <- str_extract(names(nested_stock_market_data$high_price[[1]]), '(?<=_).+')
knitr::kable(trading_means)| high_price_means | low_price_means | |
|---|---|---|
| DIVERSIFIED | 1710.0864 | 1672.5409 |
| IT | 456.5485 | 447.1265 |
| INFRA-CONSTRUCTION ENGINEERING AND MATERIALS | 237.6061 | 231.1000 |
| PHARMA | 761.4545 | 745.0689 |
| CHEMICALS | 587.5079 | 573.4456 |
| ENERGY-OIL & GAS | 678.6229 | 668.0771 |
| INFRA-MACHINERY EQUIPMENT | 647.1958 | 628.6992 |
| TEXTILES AND APPAREL | 577.4724 | 564.8362 |
| CEMENTS | 1380.3732 | 1324.1214 |
| INFRA-POWER | 307.8364 | 300.5318 |
| TRANSPORTATION LOGISTICS | 459.6792 | 446.8792 |
| FMCG-FOOD | 553.7417 | 538.5354 |
| ENTERTAINMENT-HOTELS AND LEISURE | 244.4423 | 237.7462 |
| REALTY | 202.5462 | 195.9981 |
| REALTY-HOUSEHOLD | 626.4444 | 609.9491 |
| FINANCIAL SERVICES | 668.2844 | 651.2984 |
| BANKS-PSU | 144.4000 | 140.6895 |
| METALS AND MINING | 254.3843 | 247.3588 |
| AUTO-AUTO ANCL AND COMPONENTS | 1412.6091 | 1377.3561 |
| FMCG-SUGAR | 149.4000 | 144.8938 |
| AUTO-TYRES AND TUBES | 12516.9583 | 12245.5833 |
| AUTO-AUTOMOBILES AND AUTO PARTS | 1732.0577 | 1650.6551 |
| FMCG-BEVERAGES | 416.9969 | 390.2125 |
| DEFENCE | 140.4375 | 135.4250 |
| BANKS-PRIVATE | 478.6176 | 471.2441 |
| FMCG | 1694.8804 | 1659.8283 |
| FINANCIAL SERVICES-NBFC | 803.6853 | 781.1368 |
| MEDIA-ENTERTAINMENT | 313.7667 | 304.4714 |
| PAPER | 164.2063 | 157.3000 |
| FOOTWEAR | 300.8000 | 293.8333 |
| INFRA-TELECOM | 216.8643 | 212.3786 |
| FERTILISERS | 238.3562 | 229.5938 |
| FMCG-PACKAGING | 239.1867 | 233.0367 |
| JEWELLERY | 313.7312 | 306.1375 |
| MEDIA | 152.9417 | 148.2583 |
| AIRLINES | 838.2750 | 800.4250 |
The Distribution of the Means for the High and Low prices
ggplot(data = trading_means, aes(x = high_price_means)) +
geom_histogram() + xlab("High Price Means (Indian Rupees)") + ylab("Count")Fig. 1: High price means histogram.
ggplot(data = trading_means, aes(x = low_price_means)) +
geom_histogram() + xlab("Low Price Means (Indian Rupees)") + ylab("Count")Fig. 2: Low price means histogram.
Figure 1 and Figure 2 show that there is a huge outlier for both the high prices and low prices.
Hmisc::describe(trading_means)## trading_means
##
## 2 Variables 36 Observations
## --------------------------------------------------------------------------------
## high_price_means
## n missing distinct Info Mean Gmd .05 .10
## 36 0 36 1 906.2 1121 148.2 158.6
## .25 .50 .75 .90 .95
## 239.0 458.1 699.3 1553.7 1715.6
##
## lowest : 140.4375 144.4000 149.4000 152.9417 164.2062
## highest: 1412.6091 1694.8804 1710.0864 1732.0577 12516.9583
## --------------------------------------------------------------------------------
## low_price_means
## n missing distinct Info Mean Gmd .05 .10
## 36 0 36 1 882.4 1093 143.8 152.8
## .25 .50 .75 .90 .95
## 232.6 447.0 687.3 1514.0 1663.0
##
## lowest : 135.4250 140.6895 144.8938 148.2583 157.3000
## highest: 1377.3561 1650.6551 1659.8283 1672.5409 12245.5833
## --------------------------------------------------------------------------------
The Hmisc library has a describe function which shows the means, the 5th, 10th, 25th, 50th, 75th, 90th, 95th percentiles, the 5 lowest values, and 5 highest values for each variable for a given dataframe. The output above reveals that the highest value for both the high_price_means and low_price_means are a order of magnitude larger than the 2nd highest, which is why there a huge outlier for both graphs.
trading_means %>%
filter(high_price_means > 10000 | low_price_means > 10000)## high_price_means low_price_means
## AUTO-TYRES AND TUBES 12516.96 12245.58
The output for the code block above shows that the AUTO-TYRES AND TUBES Sector has the highest high and low price means. In fact, this is the only sector with a high or low price mean of over 10,000. This is a significant outlier. Therefore, it was omitted in the following two graphs.
ggplot(data = trading_means %>% filter(high_price_means < 10000), aes(x = high_price_means)) +
geom_histogram() + xlab("High Price Means (Indian Rupees)") + ylab("Count")Fig. 3: High price means histogram. Mean high prices above 10,000 Indian rupees were filtered out of the data.
ggplot(data = trading_means %>% filter(low_price_means < 10000), aes(x = low_price_means)) +
geom_histogram() + xlab("Low Price Means (Indian Rupees)") + ylab("Count")Fig. 4: Low price means histogram. Mean low prices above 10,000 Indian rupees were filtered out of the data.
After removing the huge outlier from both the high price and low price means, it is shown that the distribution for both graphs looks to be unimodal and right skewed. The histograms for both the high and low trade price means imply that trade prices in the Indian stock market typically tend to stay below 1000 Indian Rupees a share.
Conclusions
By analyzing the Indian stock market data from Kaggle, it has been shown that this data is unimodal and right skewed. It would have been interesting to see the distribution of the data for those high and low prices that were not missing from the original dataset, since there were 599 missing stock prices from the original data. A future analysis could involve finding an association between the high price and low price and other variables in the dataset.