Project 2 Team Breakdown
Peter Phung - Jiho Kim’s Indian Stock Market Data (Section 1)
Coffy Andrews - So Much Candy Data (Section 2)
Chinedu Onyeka - Eric Lehmphul’s Student Testing Dataset (Section 2)
Jiho Kim’s Indian Stock Market Data
Introduction
One of the datasets that will be tidy, transformed, and analyzed in this report is a dataset that contains the information for well performing Indian stocks in the NSE. This dataset was pulled from kaggle. There are 63 variables in this dataset. Unfortunately, the author of the dataset did not provide an ancillary document providing a description of what each variable means. This fact also means that the actual currency for the prices is unknown. Since the data is for the Indian stock market, the currency is assumed to be in Indian Rupees. For this dataset, an analysis detailed in this report was done to address the following:
- Finding the mean trading high and low prices for each sector.
- The distribution of those means for both high and low prices.
Importing of the Data
The .csv containing the data was stored onto a Github repository which was then imported into our workspace with the following code.
url <- 'https://raw.githubusercontent.com/peterphung2043/DATA-607---Project-2/main/Stock%20Market%20data%20.csv'
stock_market_data <- read.csv(url(url), stringsAsFactors = FALSE, na.strings = "#N/A")
knitr::kable(stock_market_data[1:5, 1:5])| Share | Category | Sector | RM | Up |
|---|---|---|---|---|
| 20MICRONS | Exit | CHEMICALS | NA | NA |
| 3IINFOTECH | BWLV | IT | NA | NA |
| 3MINDIA | Exit | DIVERSIFIED | 0 | 0 |
| 63MOONS | RS | IT | NA | NA |
| 8KMILES | HWV | IT | 0 | 0 |
The output above only shows the first five variables and the first five observations. The rest of the data can be viewed on the kaggle hyperlink in the Introduction.
Tidying and Transforming of the Data
The columns that were analyzed in this dataset were the Sector, High.Price, and Low.Price.
A check was done to see if the number of missing values for the high prices was the same as the number of missing values for the low prices. This is to see if any of the non-missing high prices had any corresponding missing low prices for each observation and vice versa.
stock_market_data %>%
summarise(missing_high_prices = sum(is.na(High.Price)),
missing_low_prices = sum(is.na(Low.Price)))## missing_high_prices missing_low_prices
## 1 599 599
The output above shows that the High.Price column and the Low.Price column have the same amount of missing values.
The code block below selects the 3 variables that we need for the analysis. then discards observations containing missing values in the dataset. The resulting dataframe is then stored as parsed_stock_market_data. Only the first 5 observations are shown below.
parsed_stock_market_data <- stock_market_data %>%
select(Sector, High.Price, Low.Price) %>%
drop_na()
knitr::kable(parsed_stock_market_data[1:5,])| Sector | High.Price | Low.Price |
|---|---|---|
| DIVERSIFIED | 12399.95 | 12115.0 |
| IT | 76.70 | 75.6 |
| IT | 640.00 | 624.0 |
| INFRA-CONSTRUCTION ENGINEERING AND MATERIALS | 50.50 | 47.8 |
| PHARMA | 543.75 | 531.0 |
The code block below does the following to the parsed_stock_market_data dataframe.
Grouping by the
Sectorcolumn. Since the analysis calls for determining the mean high and low price by sector.Using a
mutateoperation to assign a row number to each sector in each group. For example, for the first observation, assume thatsector = "IT". For the first observation,row = 1. For the 2nd observation,sector = "DIVERSIFIED", sorow = 1for the 2nd observation. For the 3rd observationsector = "IT", sorow = 2for the 3rd observation. Notice that the sectors for the first and third observations were the same, so when assigning a row number, the row number increments if subsequent observations contain the same sector.pivot_widermakes it so that all of theHigh.Prices corresponding to a uniqueSectorgets put into a column. Similarily, all theLow.Prices corresponding to a uniqueSectorgets put into a column.The
rowcolumn that was generated from step 2 was deleted.
parsed_stock_market_data <- parsed_stock_market_data %>%
group_by(Sector) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = Sector, values_from = c(High.Price, Low.Price)) %>%
select(-row)
knitr::kable((parsed_stock_market_data %>%
select(High.Price_DIVERSIFIED, Low.Price_DIVERSIFIED,
High.Price_IT, Low.Price_IT))[1:5,])| High.Price_DIVERSIFIED | Low.Price_DIVERSIFIED | High.Price_IT | Low.Price_IT |
|---|---|---|---|
| 12399.95 | 12115.00 | 76.70 | 75.6 |
| 1713.45 | 1684.35 | 640.00 | 624.0 |
| 233.65 | 231.10 | 383.85 | 370.6 |
| 1212.70 | 1192.00 | 221.45 | 215.1 |
| 381.70 | 376.00 | 131.80 | 124.9 |
The output above shows just the first five values for two of the sectors in parsed_stock_market_data.
The data was then stored in a nested dataframe. In this nested dataframe, the High.Price columns and the Low.Price columns in the parsed stock market data were grouped together separately then moved into list-columns. The resulting nested dataframe is stored as nested_stock_market_data.
nested_stock_market_data <- parsed_stock_market_data %>%
nest(high_price = starts_with("High.Price"), low_price = starts_with("Low.Price"))The high_price column in nested_stock_market_data contains all of the High.Price values arranged by Sector. Similarly, the low_price column contains all of the Low.Price values arranged by Sector. The output for nested_stock_market_data is shown below, each list column has 36 variables, which means that there are 36 sectors with multiple high prices and 36 sectors and low prices.
On RStudio, clicking on an element on each of the list-columns in nested_stock_market_data will output the data with respect to the list-column. The outputs for each list-column are shown in the subsequent dataframe.
| high_price | low_price |
|---|---|
| 36 variables | 36 variables |
We can index the list-columns in nested_stock_market_data in order to bring up all of the High.Price values for each of the Sectors and similarly for the Low.Price``Sectors.
knitr::kable((nested_stock_market_data$high_price[[1]] %>%
select(High.Price_DIVERSIFIED, High.Price_IT))[1:5,])| High.Price_DIVERSIFIED | High.Price_IT |
|---|---|
| 12399.95 | 76.70 |
| 1713.45 | 640.00 |
| 233.65 | 383.85 |
| 1212.70 | 221.45 |
| 381.70 | 131.80 |
knitr::kable((nested_stock_market_data$low_price[[1]] %>%
select(Low.Price_DIVERSIFIED, Low.Price_IT))[1:5,])| Low.Price_DIVERSIFIED | Low.Price_IT |
|---|---|
| 12115.00 | 75.6 |
| 1684.35 | 624.0 |
| 231.10 | 370.6 |
| 1192.00 | 215.1 |
| 376.00 | 124.9 |
The output for the code block above shows the high and low prices for just the DIVERSIFIED and IT sectors in nested_stock_market_data. There are 36 sectors in total so only two are shown.
Displaying the Means of the High and Low Prices for Each Sector
The following block of code computes the means for each of the sectors for Both high and low prices from the nested_stock_market_data dataframe. The high price and low price means for each sector are stored in the trading_means dataframe.
trading_means <- data.frame(
high_price_means = colMeans(nested_stock_market_data$high_price[[1]], na.rm = TRUE),
low_price_means = colMeans(nested_stock_market_data$low_price[[1]], na.rm = TRUE)
)
rownames(trading_means) <- str_extract(names(nested_stock_market_data$high_price[[1]]), '(?<=_).+')
knitr::kable(trading_means)| high_price_means | low_price_means | |
|---|---|---|
| DIVERSIFIED | 1710.0864 | 1672.5409 |
| IT | 456.5485 | 447.1265 |
| INFRA-CONSTRUCTION ENGINEERING AND MATERIALS | 237.6061 | 231.1000 |
| PHARMA | 761.4545 | 745.0689 |
| CHEMICALS | 587.5079 | 573.4456 |
| ENERGY-OIL & GAS | 678.6229 | 668.0771 |
| INFRA-MACHINERY EQUIPMENT | 647.1958 | 628.6992 |
| TEXTILES AND APPAREL | 577.4724 | 564.8362 |
| CEMENTS | 1380.3732 | 1324.1214 |
| INFRA-POWER | 307.8364 | 300.5318 |
| TRANSPORTATION LOGISTICS | 459.6792 | 446.8792 |
| FMCG-FOOD | 553.7417 | 538.5354 |
| ENTERTAINMENT-HOTELS AND LEISURE | 244.4423 | 237.7462 |
| REALTY | 202.5462 | 195.9981 |
| REALTY-HOUSEHOLD | 626.4444 | 609.9491 |
| FINANCIAL SERVICES | 668.2844 | 651.2984 |
| BANKS-PSU | 144.4000 | 140.6895 |
| METALS AND MINING | 254.3843 | 247.3588 |
| AUTO-AUTO ANCL AND COMPONENTS | 1412.6091 | 1377.3561 |
| FMCG-SUGAR | 149.4000 | 144.8938 |
| AUTO-TYRES AND TUBES | 12516.9583 | 12245.5833 |
| AUTO-AUTOMOBILES AND AUTO PARTS | 1732.0577 | 1650.6551 |
| FMCG-BEVERAGES | 416.9969 | 390.2125 |
| DEFENCE | 140.4375 | 135.4250 |
| BANKS-PRIVATE | 478.6176 | 471.2441 |
| FMCG | 1694.8804 | 1659.8283 |
| FINANCIAL SERVICES-NBFC | 803.6853 | 781.1368 |
| MEDIA-ENTERTAINMENT | 313.7667 | 304.4714 |
| PAPER | 164.2063 | 157.3000 |
| FOOTWEAR | 300.8000 | 293.8333 |
| INFRA-TELECOM | 216.8643 | 212.3786 |
| FERTILISERS | 238.3562 | 229.5938 |
| FMCG-PACKAGING | 239.1867 | 233.0367 |
| JEWELLERY | 313.7312 | 306.1375 |
| MEDIA | 152.9417 | 148.2583 |
| AIRLINES | 838.2750 | 800.4250 |
The Distribution of the Means for the High and Low prices
ggplot(data = trading_means, aes(x = high_price_means)) +
geom_histogram() + xlab("High Price Means (Indian Rupees)") + ylab("Count")Fig. 1: High price means histogram.
ggplot(data = trading_means, aes(x = low_price_means)) +
geom_histogram() + xlab("Low Price Means (Indian Rupees)") + ylab("Count")Fig. 2: Low price means histogram.
Figure 1 and Figure 2 show that there is a huge outlier for both the high prices and low prices.
Hmisc::describe(trading_means)## trading_means
##
## 2 Variables 36 Observations
## --------------------------------------------------------------------------------
## high_price_means
## n missing distinct Info Mean Gmd .05 .10
## 36 0 36 1 906.2 1121 148.2 158.6
## .25 .50 .75 .90 .95
## 239.0 458.1 699.3 1553.7 1715.6
##
## lowest : 140.4375 144.4000 149.4000 152.9417 164.2062
## highest: 1412.6091 1694.8804 1710.0864 1732.0577 12516.9583
## --------------------------------------------------------------------------------
## low_price_means
## n missing distinct Info Mean Gmd .05 .10
## 36 0 36 1 882.4 1093 143.8 152.8
## .25 .50 .75 .90 .95
## 232.6 447.0 687.3 1514.0 1663.0
##
## lowest : 135.4250 140.6895 144.8938 148.2583 157.3000
## highest: 1377.3561 1650.6551 1659.8283 1672.5409 12245.5833
## --------------------------------------------------------------------------------
The Hmisc library has a describe function which shows the means, the 5th, 10th, 25th, 50th, 75th, 90th, 95th percentiles, the 5 lowest values, and 5 highest values for each variable for a given dataframe. The output above reveals that the highest value for both the high_price_means and low_price_means are a order of magnitude larger than the 2nd highest, which is why there a huge outlier for both graphs.
trading_means %>%
filter(high_price_means > 10000 | low_price_means > 10000)## high_price_means low_price_means
## AUTO-TYRES AND TUBES 12516.96 12245.58
The output for the code block above shows that the AUTO-TYRES AND TUBES Sector has the highest high and low price means. In fact, this is the only sector with a high or low price mean of over 10,000. This is a significant outlier. Therefore, it was omitted in the following two graphs.
ggplot(data = trading_means %>% filter(high_price_means < 10000), aes(x = high_price_means)) +
geom_histogram() + xlab("High Price Means (Indian Rupees)") + ylab("Count")Fig. 3: High price means histogram. Mean high prices above 10,000 Indian rupees were filtered out of the data.
ggplot(data = trading_means %>% filter(low_price_means < 10000), aes(x = low_price_means)) +
geom_histogram() + xlab("Low Price Means (Indian Rupees)") + ylab("Count")Fig. 4: Low price means histogram. Mean low prices above 10,000 Indian rupees were filtered out of the data.
After removing the huge outlier from both the high price and low price means, it is shown that the distribution for both graphs looks to be unimodal and right skewed. The histograms for both the high and low trade price means imply that trade prices in the Indian stock market typically tend to stay below 1000 Indian Rupees a share.
Conclusions
By analyzing the Indian stock market data from Kaggle, it has been shown that this data is unimodal and right skewed. It would have been interesting to see the distribution of the data for those high and low prices that were not missing from the original dataset, since there were 599 missing stock prices from the original data. A future analysis could involve finding an association between the high price and low price and other variables in the dataset.
So Much Candy Data
This analysis was sourced from The Science Creative Quarterly written by "David Ng and published at BoingBoing. Halloween is right around the corner and let’s say everyone is stocking up on sweets. Why would individuals continue to buy bulks of candy during a pandemic? Simple! We all love candy … Kit Kat, Hershey, Snickers, Twix, and a handful of others. The researchers, Ng & Cohen, compiled years of hierarchy of candy preference for years as a geology joke.
The “Candy Hierarchy Data 2017” data was complied on survey responses with ratings on how you feel when you receive this item in your Halloween haul.
Reading in Data
# load the data from GitHub
data <- read.csv(url("https://raw.githubusercontent.com/peterphung2043/DATA-607---Project-2-Final/main/candyhierarchy2017.csv"), header = FALSE, stringsAsFactors = FALSE)
# show the first parts of the dataframe
data <- data %>% mutate_all(na_if,"") # change the blank cells to "NA"
head(data[1:4,1:5, drop = FALSE]) # show the first four rows of the data## V1 V2 V3 V4 V5
## 1 Internal ID Q1: GOING OUT? Q2: GENDER Q3: AGE Q4: COUNTRY
## 2 90258773 <NA> <NA> <NA> <NA>
## 3 90272821 No Male 44 USA
## 4 90272829 <NA> Male 49 USA
Clean candy names
newDf <- rbind(df, data) # combine rows from two dataframes into new
names(newDf) <- df[1, ] # copy first row to the header
newDf <- newDf[-1:-2,] # delete first and second rowsnewDf[!apply(newDf == "", 1, all), ]
newDf[rowSums(is.na(newDf)) !=ncol(newDf), ]Analysis - Type of Feelings when Candy is Recieved
The data has 2459 individuals completed the survey. The survey informed individuals to they can skipped a option/question, leave the question blank, or indicate “they don’t know the candy”.
Feeling Values:
JOY - Does it make you happy?
DESPAIR - Is it something that you automatically place in the junk pile?
MEH - Indifference
BLANK - No idea what the item is.
A interest to see the preference for “Plot1: Butterfinger and Plot2: Snickers” candy. In the bar plot the distribution on feelings, “JOY, DESPAIR, MEH, BLANK”.
# First, clean up age
newDf$AGE <- as.numeric(newDf$AGE)
newDf$AGE[is.na(newDf$AGE)] <- 0
age_candy = newDf %>% select(AGE, Butterfinger, Snickers, `Heath Bar`)library(ggplot2)
library(gcookbook)
# First plot
ggplot(age_candy, aes(x = Butterfinger)) +
geom_histogram(position = "identity", stat = "count")## Warning: Ignoring unknown parameters: binwidth, bins, pad
# Second plot
ggplot(age_candy, aes(x = Snickers)) +
geom_histogram(position = "identity", stat = "count")## Warning: Ignoring unknown parameters: binwidth, bins, pad
Student Testing
Objective: To determine if study time impacts test scores from a sample of 11 students
Data Source: Eric Lehmphul
library(tidyverse)Read the file
url <- "https://raw.githubusercontent.com/peterphung2043/DATA-607---Project-2-Final/main/test_scores"
test_scores <- read_csv(url)test_scores## # A tibble: 11 × 7
## ...1 Student `Test1, TimeStudiedTest1` `Test2, TimeStudi… `Test3, TimeStud…
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 Bob 95, 45 88, 40 92, 50
## 2 2 John 85, 35 60, 8 75, 10
## 3 3 Sam 78, 15 75, 16 80, 17
## 4 4 Jenna 92, 60 94, 65 84, 60
## 5 5 Sara 97, 40 98, 50 95, 45
## 6 6 Jacob 50, 5 40, 2 <NA>
## 7 7 Melinda <NA> 90, 47 92, 55
## 8 8 Billy 78, 15 80, 25 81, 36
## 9 9 Kayla 100, 40 100,40 100, 45
## 10 10 Nick 90, 35 94, 32 94, 30
## 11 11 Nicolete 75, 20 80, 20 85, 23
## # … with 2 more variables: Test4, TimeStudiedTest4 <chr>, Gender <chr>
Separate the columns
#Separate the columns
test_scores <- test_scores %>% separate(`Test1, TimeStudiedTest1`,
into = c("Test1", "TimeStudiedTest1"))
test_scores <- test_scores %>% separate(`Test2, TimeStudiedTest2`,
into = c("Test2", "TimeStudiedTest2"))
test_scores <- test_scores %>% separate(`Test3, TimeStudiedTest3`,
into = c("Test3", "TimeStudiedTest3"))
#select the columns in order
test <- test_scores %>% select(Student, Test1, Test2, Test3,
TimeStudiedTest1, TimeStudiedTest2, TimeStudiedTest3)
test## # A tibble: 11 × 7
## Student Test1 Test2 Test3 TimeStudiedTest1 TimeStudiedTest2 TimeStudiedTest3
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Bob 95 88 92 45 40 50
## 2 John 85 60 75 35 8 10
## 3 Sam 78 75 80 15 16 17
## 4 Jenna 92 94 84 60 65 60
## 5 Sara 97 98 95 40 50 45
## 6 Jacob 50 40 <NA> 5 2 <NA>
## 7 Melinda <NA> 90 92 <NA> 47 55
## 8 Billy 78 80 81 15 25 36
## 9 Kayla 100 100 100 40 40 45
## 10 Nick 90 94 94 35 32 30
## 11 Nicolete 75 80 85 20 20 23
Convert all the columns from Test1 to TimeStudiedTest3 to numeric
test_student <- test %>% select(Student)
test_others <- test %>% select(-Student)
dat_df <- unlist(sapply(test_others, as.numeric)) #convert all the columns except the Student column to numeric
dat_daf <- as.data.frame(dat_df)
test_results <- cbind(test_student, dat_daf)
# Replace NA with 0
test_results <- test_results %>% replace(is.na(.), 0)
test_results## Student Test1 Test2 Test3 TimeStudiedTest1 TimeStudiedTest2
## 1 Bob 95 88 92 45 40
## 2 John 85 60 75 35 8
## 3 Sam 78 75 80 15 16
## 4 Jenna 92 94 84 60 65
## 5 Sara 97 98 95 40 50
## 6 Jacob 50 40 0 5 2
## 7 Melinda 0 90 92 0 47
## 8 Billy 78 80 81 15 25
## 9 Kayla 100 100 100 40 40
## 10 Nick 90 94 94 35 32
## 11 Nicolete 75 80 85 20 20
## TimeStudiedTest3
## 1 50
## 2 10
## 3 17
## 4 60
## 5 45
## 6 0
## 7 55
## 8 36
## 9 45
## 10 30
## 11 23
Gather the Tests and the time studied
test_resultsa <- test_results %>% select(Student:Test3)
#gather Test1 to Test 3 into Test
test_resultsa <- test_resultsa %>% gather(key = "Test", value = "Score", Test1:Test3) %>% arrange(Student)
#gather StudyTimeTest1 to StudyTimeTest 3 into TestType
test_resultsb <- test_results %>% select(Student, TimeStudiedTest1:TimeStudiedTest3)
test_resultsb <- test_resultsb %>% gather(key = "TestType", value = "StudyTime",
TimeStudiedTest1:TimeStudiedTest3) %>% arrange(Student) %>% select(-Student)
test_results_long <- cbind(test_resultsa, test_resultsb)
test_results_long <- test_results_long %>% select(-TestType)
test_results_long## Student Test Score StudyTime
## 1 Billy Test1 78 15
## 2 Billy Test2 80 25
## 3 Billy Test3 81 36
## 4 Bob Test1 95 45
## 5 Bob Test2 88 40
## 6 Bob Test3 92 50
## 7 Jacob Test1 50 5
## 8 Jacob Test2 40 2
## 9 Jacob Test3 0 0
## 10 Jenna Test1 92 60
## 11 Jenna Test2 94 65
## 12 Jenna Test3 84 60
## 13 John Test1 85 35
## 14 John Test2 60 8
## 15 John Test3 75 10
## 16 Kayla Test1 100 40
## 17 Kayla Test2 100 40
## 18 Kayla Test3 100 45
## 19 Melinda Test1 0 0
## 20 Melinda Test2 90 47
## 21 Melinda Test3 92 55
## 22 Nick Test1 90 35
## 23 Nick Test2 94 32
## 24 Nick Test3 94 30
## 25 Nicolete Test1 75 20
## 26 Nicolete Test2 80 20
## 27 Nicolete Test3 85 23
## 28 Sam Test1 78 15
## 29 Sam Test2 75 16
## 30 Sam Test3 80 17
## 31 Sara Test1 97 40
## 32 Sara Test2 98 50
## 33 Sara Test3 95 45
We now have a long table with four(4) columns: Student, Test, Score and StudyTime.
Next we find the average test score and average study time for each student.
test_res <- test_results_long %>% group_by(Student) %>%
summarise(Avg_Score = round(mean(Score), 0), Avg_StudyTime = round(mean(StudyTime),0))
student_test_results <- cbind(test_res, Gender = test_scores$Gender) %>%
select(Student, Gender, Avg_Score, Avg_StudyTime)
student_test_results## Student Gender Avg_Score Avg_StudyTime
## 1 Billy Male 80 25
## 2 Bob Male 92 45
## 3 Jacob Female 30 2
## 4 Jenna Female 90 62
## 5 John Female 73 18
## 6 Kayla Male 100 42
## 7 Melinda Female 61 34
## 8 Nick Male 93 32
## 9 Nicolete Female 80 21
## 10 Sam Male 78 16
## 11 Sara Female 97 45
Plot a scatter plot of the Avg_StudyTime vs Avg_Score
base <- ggplot(data = student_test_results, aes(Avg_StudyTime, Avg_Score)) + geom_point() +
geom_smooth(se = FALSE) + ylab("Average Score") + xlab("Average Study Time") + theme_bw() +
labs(title = "Average Test Score vs Average Study Time")
base## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Find correlation
x <- student_test_results$Avg_StudyTime
y <- student_test_results$Avg_Score
cor.test(x, y)##
## Pearson's product-moment correlation
##
## data: x and y
## t = 3.1843, df = 9, p-value = 0.01111
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2271646 0.9242026
## sample estimates:
## cor
## 0.7278546
Conclusion: From the scatter plot above and from the correlation coefficient of about 0.7 calculated, we can infer that as Average Study Time of students increases, their corresponding Average Test Score increases. Hence, we can say that study time impacts test scores.