Probem Statement

This is a project for all the members of an assigned group to work on together, since being able to work effectively on a virtual team is a key “soft skill” for data scientists. Use data to answer the question, “Which are the most valued data science skills?” Please note especially the requirement about making a presentation during the class meetup on the following Wednesday.

Effective communication skill is essential for data scientists, and the ability to communicate the findings to team and explain what value these insights can bring to technical or non-technical stakeholders also are very important. Curiosity is a trait that makes data science more interesting and beneficial. Being curious about business problems will let you to the solution as well. Team works in a team setting and being a team player is critical. We should always listen to their team members for their input and take into consideration to work as a team project.

library(RCurl)
library(kableExtra)
library(DBI)
library(RMySQL)
library(RODBC)
conn <- dbConnect(MySQL(), user= "root", password = "M2max2023", dbname='MyDB1', host='localhost', port=3306)

Load realational database from MySQL Workbench to R

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggpubr)
## Loading required package: ggplot2
library(ggplot2)
library(ggcorrplot)
cryto_dbms <- dbGetQuery(conn , "select theDate,opening, opening, highest, lowest, closing, volume, marketcap, DailyPrice.CoinSymbol, coalesce(CoinDesc,'') CoinDesc
from DailyPrice
left join CoinList
on DailyPrice.CoinSymbol = CoinList.CoinSymbol")
glimpse(cryto_dbms)
## Rows: 17,333
## Columns: 10
## $ theDate    <chr> "2013-04-01", "2013-05-01", "2013-06-01", "2013-07-01", "20…
## $ opening    <dbl> 135.300003, 139.000000, 128.815002, 97.510002, 106.212997, …
## $ opening    <dbl> 135.300003, 139.000000, 128.815002, 97.510002, 106.212997, …
## $ highest    <dbl> 147.48801, 139.89000, 129.78000, 111.34400, 140.89000, 146.…
## $ lowest     <dbl> 132.100006, 79.099998, 88.500000, 65.526001, 101.210999, 11…
## $ closing    <dbl> 139.000000, 129.000000, 96.613998, 106.089996, 135.350006, …
## $ volume     <dbl> 0, 0, 0, 0, 0, 0, 619926, 673892, 784439, 663994, 5847500, …
## $ marketcap  <dbl> 1542813125, 1447870200, 1096957771, 1218777834, 1574672051,…
## $ CoinSymbol <chr> "BTC", "BTC", "BTC", "BTC", "BTC", "BTC", "ETH", "ETH", "ET…
## $ CoinDesc   <chr> "Bitcoin", "Bitcoin", "Bitcoin", "Bitcoin", "Bitcoin", "Bit…
# Export the data frame to a CSV file local directory
write.csv(cryto_dbms, "DATA 607 Group Project3.csv", row.names = FALSE)
#Read the .csv file 
cryptocurrencies <- read.csv("DATA 607 Group Project3.csv")
head(cryptocurrencies)
##      theDate opening opening.1 highest  lowest closing volume  marketcap
## 1 2013-04-01 135.300   135.300 147.488 132.100 139.000      0 1542813125
## 2 2013-05-01 139.000   139.000 139.890  79.100 129.000      0 1447870200
## 3 2013-06-01 128.815   128.815 129.780  88.500  96.614      0 1096957771
## 4 2013-07-01  97.510    97.510 111.344  65.526 106.090      0 1218777834
## 5 2013-08-01 106.213   106.213 140.890 101.211 135.350      0 1574672051
## 6 2013-09-01 135.140   135.140 146.500 118.460 133.000      0 1566038425
##   CoinSymbol CoinDesc
## 1        BTC  Bitcoin
## 2        BTC  Bitcoin
## 3        BTC  Bitcoin
## 4        BTC  Bitcoin
## 5        BTC  Bitcoin
## 6        BTC  Bitcoin

What would be the total monthly volume of each coin in the year 2023?

total_volume_in_month <- dbGetQuery(conn, "select theMonth, CoinSymbol, sum(volume) TotalVolumeInMonth
from
(
select monthname(theDate) theMonth, CoinSymbol, volume 
from DailyPrice
where year(theDate) = 2021 and CoinSymbol in ('ADA', 'ETH') 
) A
group by theMonth, CoinSymbol;")
total_volume_in_month
##     theMonth CoinSymbol TotalVolumeInMonth
## 1    January        ADA       1.822667e+11
## 2   February        ADA       2.426997e+11
## 3      March        ADA       1.675218e+11
## 4      April        ADA       1.262510e+11
## 5        May        ADA       2.257907e+11
## 6       June        ADA       9.218966e+10
## 7       July        ADA       5.194009e+10
## 8     August        ADA       1.604260e+11
## 9  September        ADA       1.521636e+11
## 10   October        ADA       8.590928e+10
## 11  November        ADA       7.988261e+10
## 12  December        ADA       5.562197e+10
## 13   January        ETH       2.224801e+12
## 14  February        ETH       1.007718e+12
## 15     March        ETH       7.308322e+11
## 16     April        ETH       9.953342e+11
## 17       May        ETH       1.510330e+12
## 18      June        ETH       8.135231e+11
## 19      July        ETH       6.299535e+11
## 20    August        ETH       7.502077e+11
## 21 September        ETH       6.539849e+11
## 22   October        ETH       5.747985e+11
## 23  November        ETH       5.735468e+11
## 24  December        ETH       6.210420e+11
cryptocurrencies %>%
count(CoinSymbol) # count number of coins types 
##    CoinSymbol    n
## 1         ADA   71
## 2       ALEPH 1201
## 3        ALGO  225
## 4         APT  368
## 5          AR  173
## 6       AUDIO  155
## 7        AVAX  160
## 8         AXL   55
## 9         BCH   74
## 10       BICO   97
## 11        BNB   74
## 12        BTC    6
## 13      BTRST  109
## 14        CCD   87
## 15       CELO  175
## 16        CHZ  223
## 17       COMP  153
## 18       DODO  156
## 19        DOT  164
## 20        EDG  187
## 21       EGLD  162
## 22        ERG   75
## 23        ETH   97
## 24       FIDA 1033
## 25       FLOW  141
## 26       FORT   69
## 27        FOX  144
## 28        FTM  258
## 29        FTT 1539
## 30       FUSE  140
## 31         GF   97
## 32        GRT  113
## 33         GT 1518
## 34       HBAR  212
## 35        HFT  344
## 36        KAS   71
## 37        KDA  176
## 38       KEEP  178
## 39        LUN    2
## 40        LYX   12
## 41       MAPS  993
## 42      MATIC  104
## 43       MINA  124
## 44       MSOL  740
## 45       NEAR   89
## 46        OXY  950
## 47        RAD  138
## 48       RARI  168
## 49        RBN  105
## 50        RLY  146
## 51      ROUTE  142
## 52        RSR  228
## 53       RSS3   87
## 54        RSV  163
## 55        RXD  316
## 56        SEI   64
## 57        SFI  152
## 58        SNX   66
## 59        SOL  183
## 60       SOUL   64
## 61        SRM 1165
## 62        SUI   24
## 63       TRAC   68
## 64        UMA  176
## 65        UNI   16
## 66       USDC   22
## 67       VEGA  110
## 68        XDC   65
## 69        XEC  118
## 70        XRD  108
## 71        XTZ   63
## 72        ZBC   82
summary(cryto_dbms)
##    theDate             opening            opening            highest         
##  Length:17333       Min.   :   0.000   Min.   :   0.000   Min.   :    0.000  
##  Class :character   1st Qu.:   0.159   1st Qu.:   0.159   1st Qu.:    0.174  
##  Mode  :character   Median :   0.849   Median :   0.849   Median :    0.927  
##                     Mean   :  19.357   Mean   :  19.357   Mean   :   24.979  
##                     3rd Qu.:   4.454   3rd Qu.:   4.454   3rd Qu.:    4.792  
##                     Max.   :4623.679   Max.   :4623.679   Max.   :29455.032  
##      lowest            closing             volume            marketcap        
##  Min.   :   0.000   Min.   :   0.000   Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.:   0.146   1st Qu.:   0.158   1st Qu.:1.440e+06   1st Qu.:1.313e+07  
##  Median :   0.772   Median :   0.845   Median :8.988e+06   Median :5.311e+07  
##  Mean   :  16.361   Mean   :  19.391   Mean   :1.966e+09   Mean   :1.446e+09  
##  3rd Qu.:   4.218   3rd Qu.:   4.446   3rd Qu.:9.791e+07   3rd Qu.:3.538e+08  
##  Max.   :3933.506   Max.   :4631.479   Max.   :2.225e+12   Max.   :5.491e+11  
##   CoinSymbol          CoinDesc        
##  Length:17333       Length:17333      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
filter(cryptocurrencies,CoinSymbol == "ADA", theDate == "2023-09-13")
##  [1] theDate    opening    opening.1  highest    lowest     closing   
##  [7] volume     marketcap  CoinSymbol CoinDesc  
## <0 rows> (or 0-length row.names)

Since there are many coins in this data set they were also split up by marketcap size and a new column called coin_size was created. Those under the median Market cap value are labeled “Low marketcap” and those above the median value are labeled “High marketcap”. We can also see which coins fell into each category and how many observations of daily prices we have of each coin.

median(cryptocurrencies$marketcap)
## [1] 53111044
cryptocurrencies <- cryptocurrencies %>% group_by(CoinSymbol) %>% 
  mutate(coin_size = ifelse(mean(marketcap) > 5334422528,"high marketcap", "low marketcap")) 
head(cryptocurrencies)
## # A tibble: 6 × 11
## # Groups:   CoinSymbol [1]
##   theDate   opening opening.1 highest lowest closing volume marketcap CoinSymbol
##   <chr>       <dbl>     <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl> <chr>     
## 1 2013-04-…   135.      135.     147.  132.    139        0    1.54e9 BTC       
## 2 2013-05-…   139       139      140.   79.1   129        0    1.45e9 BTC       
## 3 2013-06-…   129.      129.     130.   88.5    96.6      0    1.10e9 BTC       
## 4 2013-07-…    97.5      97.5    111.   65.5   106.       0    1.22e9 BTC       
## 5 2013-08-…   106.      106.     141.  101.    135.       0    1.57e9 BTC       
## 6 2013-09-…   135.      135.     146.  118.    133        0    1.57e9 BTC       
## # ℹ 2 more variables: CoinDesc <chr>, coin_size <chr>
cryptocurrencies %>% group_by(CoinSymbol) %>% count(CoinSymbol)
## # A tibble: 72 × 2
## # Groups:   CoinSymbol [72]
##    CoinSymbol     n
##    <chr>      <int>
##  1 ADA           71
##  2 ALEPH       1201
##  3 ALGO         225
##  4 APT          368
##  5 AR           173
##  6 AUDIO        155
##  7 AVAX         160
##  8 AXL           55
##  9 BCH           74
## 10 BICO          97
## # ℹ 62 more rows

We also want to add a numeric column that takes the value from calculating the percent change in the closing price and opening price to use later in our exploratory data analysis.

cryptocurrencies <- cryptocurrencies %>% group_by(CoinSymbol) %>% 
  mutate(pct_change = round(((closing - opening)/opening)*100, 2))

# Delete the 'opening.1'column
cryptocurrencies <- cryptocurrencies[, !names(cryptocurrencies) %in% "opening.1"]
head(cryptocurrencies)
## # A tibble: 6 × 11
## # Groups:   CoinSymbol [1]
##   theDate    opening highest lowest closing volume marketcap CoinSymbol CoinDesc
##   <chr>        <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl> <chr>      <chr>   
## 1 2013-04-01   135.     147.  132.    139        0    1.54e9 BTC        Bitcoin 
## 2 2013-05-01   139      140.   79.1   129        0    1.45e9 BTC        Bitcoin 
## 3 2013-06-01   129.     130.   88.5    96.6      0    1.10e9 BTC        Bitcoin 
## 4 2013-07-01    97.5    111.   65.5   106.       0    1.22e9 BTC        Bitcoin 
## 5 2013-08-01   106.     141.  101.    135.       0    1.57e9 BTC        Bitcoin 
## 6 2013-09-01   135.     146.  118.    133        0    1.57e9 BTC        Bitcoin 
## # ℹ 2 more variables: coin_size <chr>, pct_change <dbl>

Exploratory Data Analysis

1. To get an initial idea of how the coins did over time we can visualize the coins closing price. Below, the coins are grouped by the coin size for comparison using the facet_grid fucntion in ggplot2.

high_market_captilization <- cryptocurrencies %>% filter(coin_size == "high marketcap") 
count(high_market_captilization)
## # A tibble: 7 × 2
## # Groups:   CoinSymbol [7]
##   CoinSymbol     n
##   <chr>      <int>
## 1 ADA           71
## 2 AVAX         160
## 3 BCH           74
## 4 BNB           74
## 5 DOT          164
## 6 ETH           97
## 7 SOL          183
low_market_captilization <- cryptocurrencies %>% filter(coin_size == "low marketcap") 
count(low_market_captilization)
## # A tibble: 65 × 2
## # Groups:   CoinSymbol [65]
##    CoinSymbol     n
##    <chr>      <int>
##  1 ALEPH       1201
##  2 ALGO         225
##  3 APT          368
##  4 AR           173
##  5 AUDIO        155
##  6 AXL           55
##  7 BICO          97
##  8 BTC            6
##  9 BTRST        109
## 10 CCD           87
## # ℹ 55 more rows

In both groups we see varied closing prices with some closing prices fluctuating between 20,000 and 60,000 dollars (“BTC”) and some between 0 and 9 dollars (“XRP”). There is also some interesting activity in the trend of the closing prices between 2021 and 2022 which correlates to the height of when cryptos received much of it’s funding.

2. Next we were interested in the percent changes overtime to you can gain a better understanding of their volatility and risk. Again we looked at the coins in FTX based on the high and low market cap category we created earlier.

high_percent <- cryptocurrencies %>% filter(pct_change >= 5) 
count(high_percent)
## # A tibble: 70 × 2
## # Groups:   CoinSymbol [70]
##    CoinSymbol     n
##    <chr>      <int>
##  1 ADA           25
##  2 ALEPH        169
##  3 ALGO          67
##  4 APT           31
##  5 AR            69
##  6 AUDIO         51
##  7 AVAX          52
##  8 AXL           10
##  9 BCH           27
## 10 BICO          21
## # ℹ 60 more rows
low_percent <- cryptocurrencies %>% filter(pct_change <= 5) 
count(low_percent)
## # A tibble: 72 × 2
## # Groups:   CoinSymbol [72]
##    CoinSymbol     n
##    <chr>      <int>
##  1 ADA           46
##  2 ALEPH       1032
##  3 ALGO         158
##  4 APT          337
##  5 AR           104
##  6 AUDIO        104
##  7 AVAX         108
##  8 AXL           45
##  9 BCH           47
## 10 BICO          76
## # ℹ 62 more rows

In each group there are some obvious spikes in the coins and this would be something to look at further to see if any coins follow the generally trend of the market and if not then is there certain coins that follow others.

3. Looking back at the closign prices from above. We wanted to zero in on the coins whose closing prices trended downward.

vertical_line_date <- as.Date("2022-11-02")

cryptocurrencies %>% filter(CoinSymbol %in% c("ALEPH", "FIDA", "FTT", "MAPS", 
                           "MSOL","OXY", "SOL", "SRM")) %>%
  ggplot(aes(x = as.Date(theDate), y = closing, color = CoinSymbol)) +
  geom_line() +
  geom_vline(aes(xintercept = vertical_line_date), 
             linetype = 2, 
             color = "black", 
             size = 0.75) +
  #facet_grid(CoinSymbol~., scales = "free_y") + 
  ggtitle("Closing Price (After FTX Collapse)") + 
  xlab("Date") +
  ylab("Closing Price (USD)") +
  theme(axis.title.x = element_text(color="black",size=10),
                axis.title.y = element_text(color="black", size=10),
                axis.text.x = element_text(size=6),
                axis.text.y = element_text(size=6),
                plot.title = element_text(color = "black",
                                          size=14))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

In this graphic we are looking at 8 coins in the data set. All but one (“SOL”) were categorized in the low market cap coins. The dashed line in the graph is set at November 2nd, 2022, which has been documented as the catalyst to FTX finally file for bankruptcy.

4. Extending this added layer of insight to the data, we can better visualize any impact the collapse of FTX had on the volume of trades within each coin.

cryptocurrencies %>% filter(coin_size == "high marketcap") %>% 
  ggplot(aes(x = as.Date(theDate), y = volume)) + 
  geom_line() +
  facet_grid(CoinSymbol~., scales = "free_y") + 
  geom_vline(aes(xintercept = vertical_line_date), 
             linetype = 2, 
             color = "red", 
             size = 0.5) +
  ggtitle("Trends in Yearly Trading Volume (high marketcap)") + 
  xlab("Date") +
  ylab("Volume") +
  theme(legend.position = "none",
        axis.title.x = element_text(color="black",size=10),
        axis.title.y = element_text(color="black", size=10),
        axis.text.x = element_text(size=6),
        axis.text.y = element_text(size=6),
        plot.title = element_text(color = "black", size=14))

cryptocurrencies %>% filter(coin_size == "low marketcap") %>% 
  ggplot(aes(x = as.Date(theDate), y = volume)) + 
  geom_line() +
  facet_grid(CoinSymbol~., scales = "free_y") + 
  geom_vline(aes(xintercept = vertical_line_date), 
             linetype = 2, 
             color = "red", 
             size = 0.5) +
  ggtitle("Trends in Yearly Trading Volume (low marketcap)") + 
  xlab("Date") +
  ylab("Volume") +
  theme(legend.position = "none",
        axis.title.x = element_text(color="black",size=10),
        axis.title.y = element_text(color="black", size=10),
        axis.text.x = element_text(size=6),
        axis.text.y = element_text(size=6),
        plot.title = element_text(color = "black", size=14))

In most of the high market coins there was large uptick in volume corresponding to a date on or around November 2nd (represented by the red dashed line).

Plot the correlation between variables

# Line graph for a specific cryptocurrency
ggplot(data = cryptocurrencies, aes(x = CoinSymbol, y = highest, color = CoinSymbol)) +
  geom_line() +
  labs(title = "Cryptocurrency Prices Over Time", x = "CoinSymbol", y = " the highest price")

# Bar chart for comparing trading volumes
ggplot(data = cryptocurrencies, aes(x = CoinSymbol, y = volume)) +
  geom_bar(stat = "identity") +
  labs(title = "Cryptocurrency Trading Volumes", x = "Coins", y = "Volume")

# Histogram of market capitalization
ggplot(data = cryptocurrencies, aes(x = marketcap)) +
  geom_histogram() +
  labs(title = "Distribution of Market Capitalization", x = "Market Cap", y = "Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Statistical Summary

# Calculate the Mean, Median, and Mode:
mean_price <- mean(cryptocurrencies$volume)
print(mean_price)
## [1] 1965674123
medium_price <- median(cryptocurrencies$marketcap)
print(medium_price)
## [1] 53111044
library(DescTools)
mode_price <- Mode(cryptocurrencies$pct_change)
print(mode_price)
## [1] 0.01
## attr(,"freq")
## [1] 39

Conclusions

Our team members proposed three cryptocurrencies file of FTX daily price, Layer1 and Coinbase value files, those are combined in one table, then, imported it to SQL workbench, built a normalized relational database, get rid of redundant data, organized the cleaned data by creating two new tables named “Cryptolist” and “Dailyprice” using composite keys grouped by (theDate and Coin) variables as a unique data in entire database without any two same data. So, this database obeys absolutely normalization 1NF rule and prevent data redundancy and keep data integrity throughout our analysis.

Working together on a real data set provided an opportunity for our team to reflect on the skills that helped us along the way to completing this project. I believe it did give us a deeper understanding of what skills are most valued data science skills that can be applied to many scenarios.

Technical Skills

Soft skills