Project 2 Team Breakdown

Peter Phung - Jiho Kim’s Indian Stock Market Data (Section 1)

Coffy Andrews - So Much Candy Data (Section 2)

Chinedu Onyeka - Eric Lehmphul’s Student Testing Dataset (Section 2)


Jiho Kim’s Indian Stock Market Data

Introduction

One of the datasets that will be tidy, transformed, and analyzed in this report is a dataset that contains the information for well performing Indian stocks in the NSE. This dataset was pulled from kaggle. There are 63 variables in this dataset. Unfortunately, the author of the dataset did not provide an ancillary document providing a description of what each variable means. This fact also means that the actual currency for the prices is unknown. Since the data is for the Indian stock market, the currency is assumed to be in Indian Rupees. For this dataset, an analysis detailed in this report was done to address the following:

  1. Finding the mean trading high and low prices for each sector.
  2. The distribution of those means for both high and low prices.

Importing of the Data

The .csv containing the data was stored onto a Github repository which was then imported into our workspace with the following code.

url <- 'https://raw.githubusercontent.com/peterphung2043/DATA-607---Project-2/main/Stock%20Market%20data%20.csv'

stock_market_data <- read.csv(url(url), stringsAsFactors = FALSE, na.strings = "#N/A")
knitr::kable(stock_market_data[1:5, 1:5])
Share Category Sector RM Up
20MICRONS Exit CHEMICALS NA NA
3IINFOTECH BWLV IT NA NA
3MINDIA Exit DIVERSIFIED 0 0
63MOONS RS IT NA NA
8KMILES HWV IT 0 0

The output above only shows the first five variables and the first five observations. The rest of the data can be viewed on the kaggle hyperlink in the Introduction.

Tidying and Transforming of the Data

The columns that were analyzed in this dataset were the Sector, High.Price, and Low.Price.

A check was done to see if the number of missing values for the high prices was the same as the number of missing values for the low prices. This is to see if any of the non-missing high prices had any corresponding missing low prices for each observation and vice versa.

stock_market_data %>%
  summarise(missing_high_prices = sum(is.na(High.Price)),
            missing_low_prices = sum(is.na(Low.Price)))
##   missing_high_prices missing_low_prices
## 1                 599                599

The output above shows that the High.Price column and the Low.Price column have the same amount of missing values.

The code block below selects the 3 variables that we need for the analysis. then discards observations containing missing values in the dataset. The resulting dataframe is then stored as parsed_stock_market_data. Only the first 5 observations are shown below.

parsed_stock_market_data <- stock_market_data %>%
  select(Sector, High.Price, Low.Price) %>%
  drop_na()
knitr::kable(parsed_stock_market_data[1:5,])
Sector High.Price Low.Price
DIVERSIFIED 12399.95 12115.0
IT 76.70 75.6
IT 640.00 624.0
INFRA-CONSTRUCTION ENGINEERING AND MATERIALS 50.50 47.8
PHARMA 543.75 531.0

The code block below does the following to the parsed_stock_market_data dataframe.

  1. Grouping by the Sector column. Since the analysis calls for determining the mean high and low price by sector.

  2. Using a mutate operation to assign a row number to each sector in each group. For example, for the first observation, assume that sector = "IT". For the first observation, row = 1. For the 2nd observation, sector = "DIVERSIFIED", so row = 1 for the 2nd observation. For the 3rd observation sector = "IT", so row = 2 for the 3rd observation. Notice that the sectors for the first and third observations were the same, so when assigning a row number, the row number increments if subsequent observations contain the same sector.

  3. pivot_wider makes it so that all of the High.Prices corresponding to a unique Sector gets put into a column. Similarily, all the Low.Prices corresponding to a unique Sector gets put into a column.

  4. The row column that was generated from step 2 was deleted.

parsed_stock_market_data <- parsed_stock_market_data %>%
  group_by(Sector) %>%
  mutate(row = row_number()) %>%
  pivot_wider(names_from = Sector, values_from = c(High.Price, Low.Price)) %>%
  select(-row)

knitr::kable((parsed_stock_market_data %>%
               select(High.Price_DIVERSIFIED, Low.Price_DIVERSIFIED,
                      High.Price_IT, Low.Price_IT))[1:5,])
High.Price_DIVERSIFIED Low.Price_DIVERSIFIED High.Price_IT Low.Price_IT
12399.95 12115.00 76.70 75.6
1713.45 1684.35 640.00 624.0
233.65 231.10 383.85 370.6
1212.70 1192.00 221.45 215.1
381.70 376.00 131.80 124.9

The output above shows just the first five values for two of the sectors in parsed_stock_market_data.

The data was then stored in a nested dataframe. In this nested dataframe, the High.Price columns and the Low.Price columns in the parsed stock market data were grouped together separately then moved into list-columns. The resulting nested dataframe is stored as nested_stock_market_data.

nested_stock_market_data <- parsed_stock_market_data %>%
  nest(high_price = starts_with("High.Price"), low_price = starts_with("Low.Price"))

The high_price column in nested_stock_market_data contains all of the High.Price values arranged by Sector. Similarly, the low_price column contains all of the Low.Price values arranged by Sector. The output for nested_stock_market_data is shown below, each list column has 36 variables, which means that there are 36 sectors with multiple high prices and 36 sectors and low prices.

On RStudio, clicking on an element on each of the list-columns in nested_stock_market_data will output the data with respect to the list-column. The outputs for each list-column are shown in the subsequent dataframe.

high_price low_price
36 variables 36 variables

We can index the list-columns in nested_stock_market_data in order to bring up all of the High.Price values for each of the Sectors and similarly for the Low.Price``Sectors.

knitr::kable((nested_stock_market_data$high_price[[1]] %>%
                select(High.Price_DIVERSIFIED, High.Price_IT))[1:5,])
High.Price_DIVERSIFIED High.Price_IT
12399.95 76.70
1713.45 640.00
233.65 383.85
1212.70 221.45
381.70 131.80
knitr::kable((nested_stock_market_data$low_price[[1]] %>%
                select(Low.Price_DIVERSIFIED, Low.Price_IT))[1:5,])
Low.Price_DIVERSIFIED Low.Price_IT
12115.00 75.6
1684.35 624.0
231.10 370.6
1192.00 215.1
376.00 124.9

The output for the code block above shows the high and low prices for just the DIVERSIFIED and IT sectors in nested_stock_market_data. There are 36 sectors in total so only two are shown.

Displaying the Means of the High and Low Prices for Each Sector

The following block of code computes the means for each of the sectors for Both high and low prices from the nested_stock_market_data dataframe. The high price and low price means for each sector are stored in the trading_means dataframe.

trading_means <- data.frame(
  high_price_means = colMeans(nested_stock_market_data$high_price[[1]], na.rm = TRUE),
  low_price_means = colMeans(nested_stock_market_data$low_price[[1]], na.rm = TRUE)
)
rownames(trading_means) <- str_extract(names(nested_stock_market_data$high_price[[1]]), '(?<=_).+')

knitr::kable(trading_means)
high_price_means low_price_means
DIVERSIFIED 1710.0864 1672.5409
IT 456.5485 447.1265
INFRA-CONSTRUCTION ENGINEERING AND MATERIALS 237.6061 231.1000
PHARMA 761.4545 745.0689
CHEMICALS 587.5079 573.4456
ENERGY-OIL & GAS 678.6229 668.0771
INFRA-MACHINERY EQUIPMENT 647.1958 628.6992
TEXTILES AND APPAREL 577.4724 564.8362
CEMENTS 1380.3732 1324.1214
INFRA-POWER 307.8364 300.5318
TRANSPORTATION LOGISTICS 459.6792 446.8792
FMCG-FOOD 553.7417 538.5354
ENTERTAINMENT-HOTELS AND LEISURE 244.4423 237.7462
REALTY 202.5462 195.9981
REALTY-HOUSEHOLD 626.4444 609.9491
FINANCIAL SERVICES 668.2844 651.2984
BANKS-PSU 144.4000 140.6895
METALS AND MINING 254.3843 247.3588
AUTO-AUTO ANCL AND COMPONENTS 1412.6091 1377.3561
FMCG-SUGAR 149.4000 144.8938
AUTO-TYRES AND TUBES 12516.9583 12245.5833
AUTO-AUTOMOBILES AND AUTO PARTS 1732.0577 1650.6551
FMCG-BEVERAGES 416.9969 390.2125
DEFENCE 140.4375 135.4250
BANKS-PRIVATE 478.6176 471.2441
FMCG 1694.8804 1659.8283
FINANCIAL SERVICES-NBFC 803.6853 781.1368
MEDIA-ENTERTAINMENT 313.7667 304.4714
PAPER 164.2063 157.3000
FOOTWEAR 300.8000 293.8333
INFRA-TELECOM 216.8643 212.3786
FERTILISERS 238.3562 229.5938
FMCG-PACKAGING 239.1867 233.0367
JEWELLERY 313.7312 306.1375
MEDIA 152.9417 148.2583
AIRLINES 838.2750 800.4250

The Distribution of the Means for the High and Low prices

ggplot(data = trading_means, aes(x = high_price_means)) +
  geom_histogram() + xlab("High Price Means (Indian Rupees)") + ylab("Count")

Fig. 1: High price means histogram.

ggplot(data = trading_means, aes(x = low_price_means)) +
  geom_histogram() + xlab("Low Price Means (Indian Rupees)") + ylab("Count")

Fig. 2: Low price means histogram.

Figure 1 and Figure 2 show that there is a huge outlier for both the high prices and low prices.

Hmisc::describe(trading_means)
## trading_means 
## 
##  2  Variables      36  Observations
## --------------------------------------------------------------------------------
## high_price_means 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       36        0       36        1    906.2     1121    148.2    158.6 
##      .25      .50      .75      .90      .95 
##    239.0    458.1    699.3   1553.7   1715.6 
## 
## lowest :   140.4375   144.4000   149.4000   152.9417   164.2062
## highest:  1412.6091  1694.8804  1710.0864  1732.0577 12516.9583
## --------------------------------------------------------------------------------
## low_price_means 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       36        0       36        1    882.4     1093    143.8    152.8 
##      .25      .50      .75      .90      .95 
##    232.6    447.0    687.3   1514.0   1663.0 
## 
## lowest :   135.4250   140.6895   144.8938   148.2583   157.3000
## highest:  1377.3561  1650.6551  1659.8283  1672.5409 12245.5833
## --------------------------------------------------------------------------------

The Hmisc library has a describe function which shows the means, the 5th, 10th, 25th, 50th, 75th, 90th, 95th percentiles, the 5 lowest values, and 5 highest values for each variable for a given dataframe. The output above reveals that the highest value for both the high_price_means and low_price_means are a order of magnitude larger than the 2nd highest, which is why there a huge outlier for both graphs.

trading_means %>%
  filter(high_price_means > 10000 | low_price_means > 10000)
##                      high_price_means low_price_means
## AUTO-TYRES AND TUBES         12516.96        12245.58

The output for the code block above shows that the AUTO-TYRES AND TUBES Sector has the highest high and low price means. In fact, this is the only sector with a high or low price mean of over 10,000. This is a significant outlier. Therefore, it was omitted in the following two graphs.

ggplot(data = trading_means %>% filter(high_price_means < 10000), aes(x = high_price_means)) +
  geom_histogram() + xlab("High Price Means (Indian Rupees)") + ylab("Count")

Fig. 3: High price means histogram. Mean high prices above 10,000 Indian rupees were filtered out of the data.

ggplot(data = trading_means %>% filter(low_price_means < 10000), aes(x = low_price_means)) +
  geom_histogram() + xlab("Low Price Means (Indian Rupees)") + ylab("Count")

Fig. 4: Low price means histogram. Mean low prices above 10,000 Indian rupees were filtered out of the data.

After removing the huge outlier from both the high price and low price means, it is shown that the distribution for both graphs looks to be unimodal and right skewed. The histograms for both the high and low trade price means imply that trade prices in the Indian stock market typically tend to stay below 1000 Indian Rupees a share.

Conclusions

By analyzing the Indian stock market data from Kaggle, it has been shown that this data is unimodal and right skewed. It would have been interesting to see the distribution of the data for those high and low prices that were not missing from the original dataset, since there were 599 missing stock prices from the original data. A future analysis could involve finding an association between the high price and low price and other variables in the dataset.


So Much Candy Data

This analysis was sourced from The Science Creative Quarterly written by "David Ng and published at BoingBoing. Halloween is right around the corner and let’s say everyone is stocking up on sweets. Why would individuals continue to buy bulks of candy during a pandemic? Simple! We all love candy … Kit Kat, Hershey, Snickers, Twix, and a handful of others. The researchers, Ng & Cohen, compiled years of hierarchy of candy preference for years as a geology joke.

The “Candy Hierarchy Data 2017” data was complied on survey responses with ratings on how you feel when you receive this item in your Halloween haul.

Reading in Data

# load the data from GitHub
data <- read.csv(url("https://raw.githubusercontent.com/peterphung2043/DATA-607---Project-2-Final/main/candyhierarchy2017.csv"), header = FALSE, stringsAsFactors = FALSE)
# show the first parts of the dataframe
data <- data %>% mutate_all(na_if,"") # change the blank cells to "NA"
head(data[1:4,1:5, drop = FALSE]) # show the first four rows of the data
##            V1             V2         V3      V4          V5
## 1 Internal ID Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY
## 2    90258773           <NA>       <NA>    <NA>        <NA>
## 3    90272821             No       Male      44        USA 
## 4    90272829           <NA>       Male      49         USA

Clean candy names

newDf <- rbind(df, data)   # combine rows from two dataframes into new 
names(newDf) <- df[1, ] # copy first row to the header
newDf <- newDf[-1:-2,]     # delete first and second rows
newDf[!apply(newDf == "", 1, all), ]
newDf[rowSums(is.na(newDf)) !=ncol(newDf), ]

Analysis - Type of Feelings when Candy is Recieved

The data has 2459 individuals completed the survey. The survey informed individuals to they can skipped a option/question, leave the question blank, or indicate “they don’t know the candy”.

Feeling Values:

JOY - Does it make you happy?

DESPAIR - Is it something that you automatically place in the junk pile?

MEH - Indifference

BLANK - No idea what the item is.

A interest to see the preference for “Plot1: Butterfinger and Plot2: Snickers” candy. In the bar plot the distribution on feelings, “JOY, DESPAIR, MEH, BLANK”.

# First, clean up age
newDf$AGE <- as.numeric(newDf$AGE)
newDf$AGE[is.na(newDf$AGE)] <- 0

age_candy = newDf %>% select(AGE, Butterfinger, Snickers, `Heath Bar`)
library(ggplot2)
library(gcookbook)
# First plot
ggplot(age_candy, aes(x = Butterfinger)) +
geom_histogram(position = "identity", stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

# Second plot
ggplot(age_candy, aes(x = Snickers)) +
  geom_histogram(position = "identity", stat = "count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad


Student Testing

Objective: To determine if study time impacts test scores from a sample of 11 students

Data Source: Eric Lehmphul

library(tidyverse)

Read the file

url <- "https://raw.githubusercontent.com/peterphung2043/DATA-607---Project-2-Final/main/test_scores"
test_scores <- read_csv(url)
test_scores
## # A tibble: 11 × 7
##     ...1 Student  `Test1, TimeStudiedTest1` `Test2, TimeStudi… `Test3, TimeStud…
##    <dbl> <chr>    <chr>                     <chr>              <chr>            
##  1     1 Bob      95, 45                    88, 40             92, 50           
##  2     2 John     85, 35                    60, 8              75, 10           
##  3     3 Sam      78, 15                    75, 16             80, 17           
##  4     4 Jenna    92, 60                    94, 65             84, 60           
##  5     5 Sara     97, 40                    98, 50             95, 45           
##  6     6 Jacob    50, 5                     40, 2              <NA>             
##  7     7 Melinda  <NA>                      90, 47             92, 55           
##  8     8 Billy    78, 15                    80, 25             81, 36           
##  9     9 Kayla    100, 40                   100,40             100, 45          
## 10    10 Nick     90, 35                    94, 32             94, 30           
## 11    11 Nicolete 75, 20                    80, 20             85, 23           
## # … with 2 more variables: Test4, TimeStudiedTest4 <chr>, Gender <chr>

Separate the columns

#Separate the  columns
test_scores <- test_scores %>% separate(`Test1, TimeStudiedTest1`, 
                                        into = c("Test1", "TimeStudiedTest1"))
test_scores <- test_scores %>% separate(`Test2, TimeStudiedTest2`, 
                                        into = c("Test2", "TimeStudiedTest2"))
test_scores <- test_scores %>% separate(`Test3, TimeStudiedTest3`, 
                                        into = c("Test3", "TimeStudiedTest3"))
#select the columns in order
test <- test_scores %>% select(Student, Test1, Test2, Test3, 
                               TimeStudiedTest1, TimeStudiedTest2, TimeStudiedTest3)
test
## # A tibble: 11 × 7
##    Student  Test1 Test2 Test3 TimeStudiedTest1 TimeStudiedTest2 TimeStudiedTest3
##    <chr>    <chr> <chr> <chr> <chr>            <chr>            <chr>           
##  1 Bob      95    88    92    45               40               50              
##  2 John     85    60    75    35               8                10              
##  3 Sam      78    75    80    15               16               17              
##  4 Jenna    92    94    84    60               65               60              
##  5 Sara     97    98    95    40               50               45              
##  6 Jacob    50    40    <NA>  5                2                <NA>            
##  7 Melinda  <NA>  90    92    <NA>             47               55              
##  8 Billy    78    80    81    15               25               36              
##  9 Kayla    100   100   100   40               40               45              
## 10 Nick     90    94    94    35               32               30              
## 11 Nicolete 75    80    85    20               20               23

Convert all the columns from Test1 to TimeStudiedTest3 to numeric

test_student <- test %>% select(Student)
test_others <- test %>% select(-Student)
dat_df <- unlist(sapply(test_others, as.numeric)) #convert all the columns except the Student column to numeric
dat_daf <- as.data.frame(dat_df)
test_results <- cbind(test_student, dat_daf)
# Replace NA with 0
test_results <- test_results %>% replace(is.na(.), 0)
test_results
##     Student Test1 Test2 Test3 TimeStudiedTest1 TimeStudiedTest2
## 1       Bob    95    88    92               45               40
## 2      John    85    60    75               35                8
## 3       Sam    78    75    80               15               16
## 4     Jenna    92    94    84               60               65
## 5      Sara    97    98    95               40               50
## 6     Jacob    50    40     0                5                2
## 7   Melinda     0    90    92                0               47
## 8     Billy    78    80    81               15               25
## 9     Kayla   100   100   100               40               40
## 10     Nick    90    94    94               35               32
## 11 Nicolete    75    80    85               20               20
##    TimeStudiedTest3
## 1                50
## 2                10
## 3                17
## 4                60
## 5                45
## 6                 0
## 7                55
## 8                36
## 9                45
## 10               30
## 11               23

Gather the Tests and the time studied

test_resultsa <- test_results %>% select(Student:Test3)
#gather Test1 to Test 3 into Test
test_resultsa <- test_resultsa %>% gather(key = "Test", value = "Score", Test1:Test3) %>% arrange(Student)
#gather StudyTimeTest1 to StudyTimeTest 3 into TestType
test_resultsb <- test_results %>% select(Student, TimeStudiedTest1:TimeStudiedTest3)
test_resultsb <- test_resultsb %>% gather(key = "TestType", value = "StudyTime",
                                          TimeStudiedTest1:TimeStudiedTest3) %>% arrange(Student) %>% select(-Student)
test_results_long <- cbind(test_resultsa, test_resultsb)
test_results_long <- test_results_long %>% select(-TestType)
test_results_long
##     Student  Test Score StudyTime
## 1     Billy Test1    78        15
## 2     Billy Test2    80        25
## 3     Billy Test3    81        36
## 4       Bob Test1    95        45
## 5       Bob Test2    88        40
## 6       Bob Test3    92        50
## 7     Jacob Test1    50         5
## 8     Jacob Test2    40         2
## 9     Jacob Test3     0         0
## 10    Jenna Test1    92        60
## 11    Jenna Test2    94        65
## 12    Jenna Test3    84        60
## 13     John Test1    85        35
## 14     John Test2    60         8
## 15     John Test3    75        10
## 16    Kayla Test1   100        40
## 17    Kayla Test2   100        40
## 18    Kayla Test3   100        45
## 19  Melinda Test1     0         0
## 20  Melinda Test2    90        47
## 21  Melinda Test3    92        55
## 22     Nick Test1    90        35
## 23     Nick Test2    94        32
## 24     Nick Test3    94        30
## 25 Nicolete Test1    75        20
## 26 Nicolete Test2    80        20
## 27 Nicolete Test3    85        23
## 28      Sam Test1    78        15
## 29      Sam Test2    75        16
## 30      Sam Test3    80        17
## 31     Sara Test1    97        40
## 32     Sara Test2    98        50
## 33     Sara Test3    95        45

We now have a long table with four(4) columns: Student, Test, Score and StudyTime.

Next we find the average test score and average study time for each student.

test_res <- test_results_long %>% group_by(Student) %>%
  summarise(Avg_Score = round(mean(Score), 0), Avg_StudyTime = round(mean(StudyTime),0))
student_test_results <- cbind(test_res, Gender = test_scores$Gender) %>% 
  select(Student, Gender, Avg_Score, Avg_StudyTime)
student_test_results
##     Student Gender Avg_Score Avg_StudyTime
## 1     Billy   Male        80            25
## 2       Bob   Male        92            45
## 3     Jacob Female        30             2
## 4     Jenna Female        90            62
## 5      John Female        73            18
## 6     Kayla   Male       100            42
## 7   Melinda Female        61            34
## 8      Nick   Male        93            32
## 9  Nicolete Female        80            21
## 10      Sam   Male        78            16
## 11     Sara Female        97            45

Plot a scatter plot of the Avg_StudyTime vs Avg_Score

base <- ggplot(data = student_test_results, aes(Avg_StudyTime, Avg_Score)) + geom_point() +
  geom_smooth(se = FALSE) + ylab("Average Score") + xlab("Average Study Time") + theme_bw() + 
  labs(title = "Average Test Score vs Average Study Time")
base
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Find correlation

x <- student_test_results$Avg_StudyTime
y <- student_test_results$Avg_Score
cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 3.1843, df = 9, p-value = 0.01111
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2271646 0.9242026
## sample estimates:
##       cor 
## 0.7278546

Conclusion: From the scatter plot above and from the correlation coefficient of about 0.7 calculated, we can infer that as Average Study Time of students increases, their corresponding Average Test Score increases. Hence, we can say that study time impacts test scores.