This is a project for all the members of an assigned group to work on together, since being able to work effectively on a virtual team is a key “soft skill” for data scientists. Use data to answer the question, “Which are the most valued data science skills?” Please note especially the requirement about making a presentation during the class meetup on the following Wednesday.
Effective communication skill is essential for data scientists, and the ability to communicate the findings to team and explain what value these insights can bring to technical or non-technical stakeholders also are very important. Curiosity is a trait that makes data science more interesting and beneficial. Being curious about business problems will let you to the solution as well. Team works in a team setting and being a team player is critical. We should always listen to their team members for their input and take into consideration to work as a team project.
library(RCurl)
library(kableExtra)
library(DBI)
library(RMySQL)
library(RODBC)
conn <- dbConnect(MySQL(), user= "root", password = "M2max2023", dbname='MyDB1', host='localhost', port=3306)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggpubr)
## Loading required package: ggplot2
library(ggplot2)
library(ggcorrplot)
cryto_dbms <- dbGetQuery(conn , "select theDate,opening, opening, highest, lowest, closing, volume, marketcap, DailyPrice.CoinSymbol, coalesce(CoinDesc,'') CoinDesc
from DailyPrice
left join CoinList
on DailyPrice.CoinSymbol = CoinList.CoinSymbol")
glimpse(cryto_dbms)
## Rows: 17,333
## Columns: 10
## $ theDate <chr> "2013-04-01", "2013-05-01", "2013-06-01", "2013-07-01", "20…
## $ opening <dbl> 135.300003, 139.000000, 128.815002, 97.510002, 106.212997, …
## $ opening <dbl> 135.300003, 139.000000, 128.815002, 97.510002, 106.212997, …
## $ highest <dbl> 147.48801, 139.89000, 129.78000, 111.34400, 140.89000, 146.…
## $ lowest <dbl> 132.100006, 79.099998, 88.500000, 65.526001, 101.210999, 11…
## $ closing <dbl> 139.000000, 129.000000, 96.613998, 106.089996, 135.350006, …
## $ volume <dbl> 0, 0, 0, 0, 0, 0, 619926, 673892, 784439, 663994, 5847500, …
## $ marketcap <dbl> 1542813125, 1447870200, 1096957771, 1218777834, 1574672051,…
## $ CoinSymbol <chr> "BTC", "BTC", "BTC", "BTC", "BTC", "BTC", "ETH", "ETH", "ET…
## $ CoinDesc <chr> "Bitcoin", "Bitcoin", "Bitcoin", "Bitcoin", "Bitcoin", "Bit…
# Export the data frame to a CSV file local directory
write.csv(cryto_dbms, "DATA 607 Group Project3.csv", row.names = FALSE)
#Read the .csv file
cryptocurrencies <- read.csv("DATA 607 Group Project3.csv")
head(cryptocurrencies)
## theDate opening opening.1 highest lowest closing volume marketcap
## 1 2013-04-01 135.300 135.300 147.488 132.100 139.000 0 1542813125
## 2 2013-05-01 139.000 139.000 139.890 79.100 129.000 0 1447870200
## 3 2013-06-01 128.815 128.815 129.780 88.500 96.614 0 1096957771
## 4 2013-07-01 97.510 97.510 111.344 65.526 106.090 0 1218777834
## 5 2013-08-01 106.213 106.213 140.890 101.211 135.350 0 1574672051
## 6 2013-09-01 135.140 135.140 146.500 118.460 133.000 0 1566038425
## CoinSymbol CoinDesc
## 1 BTC Bitcoin
## 2 BTC Bitcoin
## 3 BTC Bitcoin
## 4 BTC Bitcoin
## 5 BTC Bitcoin
## 6 BTC Bitcoin
total_volume_in_month <- dbGetQuery(conn, "select theMonth, CoinSymbol, sum(volume) TotalVolumeInMonth
from
(
select monthname(theDate) theMonth, CoinSymbol, volume
from DailyPrice
where year(theDate) = 2021 and CoinSymbol in ('ADA', 'ETH')
) A
group by theMonth, CoinSymbol;")
total_volume_in_month
## theMonth CoinSymbol TotalVolumeInMonth
## 1 January ADA 1.822667e+11
## 2 February ADA 2.426997e+11
## 3 March ADA 1.675218e+11
## 4 April ADA 1.262510e+11
## 5 May ADA 2.257907e+11
## 6 June ADA 9.218966e+10
## 7 July ADA 5.194009e+10
## 8 August ADA 1.604260e+11
## 9 September ADA 1.521636e+11
## 10 October ADA 8.590928e+10
## 11 November ADA 7.988261e+10
## 12 December ADA 5.562197e+10
## 13 January ETH 2.224801e+12
## 14 February ETH 1.007718e+12
## 15 March ETH 7.308322e+11
## 16 April ETH 9.953342e+11
## 17 May ETH 1.510330e+12
## 18 June ETH 8.135231e+11
## 19 July ETH 6.299535e+11
## 20 August ETH 7.502077e+11
## 21 September ETH 6.539849e+11
## 22 October ETH 5.747985e+11
## 23 November ETH 5.735468e+11
## 24 December ETH 6.210420e+11
cryptocurrencies %>%
count(CoinSymbol) # count number of coins types
## CoinSymbol n
## 1 ADA 71
## 2 ALEPH 1201
## 3 ALGO 225
## 4 APT 368
## 5 AR 173
## 6 AUDIO 155
## 7 AVAX 160
## 8 AXL 55
## 9 BCH 74
## 10 BICO 97
## 11 BNB 74
## 12 BTC 6
## 13 BTRST 109
## 14 CCD 87
## 15 CELO 175
## 16 CHZ 223
## 17 COMP 153
## 18 DODO 156
## 19 DOT 164
## 20 EDG 187
## 21 EGLD 162
## 22 ERG 75
## 23 ETH 97
## 24 FIDA 1033
## 25 FLOW 141
## 26 FORT 69
## 27 FOX 144
## 28 FTM 258
## 29 FTT 1539
## 30 FUSE 140
## 31 GF 97
## 32 GRT 113
## 33 GT 1518
## 34 HBAR 212
## 35 HFT 344
## 36 KAS 71
## 37 KDA 176
## 38 KEEP 178
## 39 LUN 2
## 40 LYX 12
## 41 MAPS 993
## 42 MATIC 104
## 43 MINA 124
## 44 MSOL 740
## 45 NEAR 89
## 46 OXY 950
## 47 RAD 138
## 48 RARI 168
## 49 RBN 105
## 50 RLY 146
## 51 ROUTE 142
## 52 RSR 228
## 53 RSS3 87
## 54 RSV 163
## 55 RXD 316
## 56 SEI 64
## 57 SFI 152
## 58 SNX 66
## 59 SOL 183
## 60 SOUL 64
## 61 SRM 1165
## 62 SUI 24
## 63 TRAC 68
## 64 UMA 176
## 65 UNI 16
## 66 USDC 22
## 67 VEGA 110
## 68 XDC 65
## 69 XEC 118
## 70 XRD 108
## 71 XTZ 63
## 72 ZBC 82
summary(cryto_dbms)
## theDate opening opening highest
## Length:17333 Min. : 0.000 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 0.159 1st Qu.: 0.159 1st Qu.: 0.174
## Mode :character Median : 0.849 Median : 0.849 Median : 0.927
## Mean : 19.357 Mean : 19.357 Mean : 24.979
## 3rd Qu.: 4.454 3rd Qu.: 4.454 3rd Qu.: 4.792
## Max. :4623.679 Max. :4623.679 Max. :29455.032
## lowest closing volume marketcap
## Min. : 0.000 Min. : 0.000 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.: 0.146 1st Qu.: 0.158 1st Qu.:1.440e+06 1st Qu.:1.313e+07
## Median : 0.772 Median : 0.845 Median :8.988e+06 Median :5.311e+07
## Mean : 16.361 Mean : 19.391 Mean :1.966e+09 Mean :1.446e+09
## 3rd Qu.: 4.218 3rd Qu.: 4.446 3rd Qu.:9.791e+07 3rd Qu.:3.538e+08
## Max. :3933.506 Max. :4631.479 Max. :2.225e+12 Max. :5.491e+11
## CoinSymbol CoinDesc
## Length:17333 Length:17333
## Class :character Class :character
## Mode :character Mode :character
##
##
##
filter(cryptocurrencies,CoinSymbol == "ADA", theDate == "2023-09-13")
## [1] theDate opening opening.1 highest lowest closing
## [7] volume marketcap CoinSymbol CoinDesc
## <0 rows> (or 0-length row.names)
Since there are many coins in this data set they were also split up by marketcap size and a new column called coin_size was created. Those under the median Market cap value are labeled “Low marketcap” and those above the median value are labeled “High marketcap”. We can also see which coins fell into each category and how many observations of daily prices we have of each coin.
median(cryptocurrencies$marketcap)
## [1] 53111044
cryptocurrencies <- cryptocurrencies %>% group_by(CoinSymbol) %>%
mutate(coin_size = ifelse(mean(marketcap) > 5334422528,"high marketcap", "low marketcap"))
head(cryptocurrencies)
## # A tibble: 6 × 11
## # Groups: CoinSymbol [1]
## theDate opening opening.1 highest lowest closing volume marketcap CoinSymbol
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2013-04-… 135. 135. 147. 132. 139 0 1.54e9 BTC
## 2 2013-05-… 139 139 140. 79.1 129 0 1.45e9 BTC
## 3 2013-06-… 129. 129. 130. 88.5 96.6 0 1.10e9 BTC
## 4 2013-07-… 97.5 97.5 111. 65.5 106. 0 1.22e9 BTC
## 5 2013-08-… 106. 106. 141. 101. 135. 0 1.57e9 BTC
## 6 2013-09-… 135. 135. 146. 118. 133 0 1.57e9 BTC
## # ℹ 2 more variables: CoinDesc <chr>, coin_size <chr>
cryptocurrencies %>% group_by(CoinSymbol) %>% count(CoinSymbol)
## # A tibble: 72 × 2
## # Groups: CoinSymbol [72]
## CoinSymbol n
## <chr> <int>
## 1 ADA 71
## 2 ALEPH 1201
## 3 ALGO 225
## 4 APT 368
## 5 AR 173
## 6 AUDIO 155
## 7 AVAX 160
## 8 AXL 55
## 9 BCH 74
## 10 BICO 97
## # ℹ 62 more rows
We also want to add a numeric column that takes the value from calculating the percent change in the closing price and opening price to use later in our exploratory data analysis.
cryptocurrencies <- cryptocurrencies %>% group_by(CoinSymbol) %>%
mutate(pct_change = round(((closing - opening)/opening)*100, 2))
# Delete the 'opening.1'column
cryptocurrencies <- cryptocurrencies[, !names(cryptocurrencies) %in% "opening.1"]
head(cryptocurrencies)
## # A tibble: 6 × 11
## # Groups: CoinSymbol [1]
## theDate opening highest lowest closing volume marketcap CoinSymbol CoinDesc
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 2013-04-01 135. 147. 132. 139 0 1.54e9 BTC Bitcoin
## 2 2013-05-01 139 140. 79.1 129 0 1.45e9 BTC Bitcoin
## 3 2013-06-01 129. 130. 88.5 96.6 0 1.10e9 BTC Bitcoin
## 4 2013-07-01 97.5 111. 65.5 106. 0 1.22e9 BTC Bitcoin
## 5 2013-08-01 106. 141. 101. 135. 0 1.57e9 BTC Bitcoin
## 6 2013-09-01 135. 146. 118. 133 0 1.57e9 BTC Bitcoin
## # ℹ 2 more variables: coin_size <chr>, pct_change <dbl>
high_market_captilization <- cryptocurrencies %>% filter(coin_size == "high marketcap")
count(high_market_captilization)
## # A tibble: 7 × 2
## # Groups: CoinSymbol [7]
## CoinSymbol n
## <chr> <int>
## 1 ADA 71
## 2 AVAX 160
## 3 BCH 74
## 4 BNB 74
## 5 DOT 164
## 6 ETH 97
## 7 SOL 183
low_market_captilization <- cryptocurrencies %>% filter(coin_size == "low marketcap")
count(low_market_captilization)
## # A tibble: 65 × 2
## # Groups: CoinSymbol [65]
## CoinSymbol n
## <chr> <int>
## 1 ALEPH 1201
## 2 ALGO 225
## 3 APT 368
## 4 AR 173
## 5 AUDIO 155
## 6 AXL 55
## 7 BICO 97
## 8 BTC 6
## 9 BTRST 109
## 10 CCD 87
## # ℹ 55 more rows
In both groups we see varied closing prices with some closing prices fluctuating between 20,000 and 60,000 dollars (“BTC”) and some between 0 and 9 dollars (“XRP”). There is also some interesting activity in the trend of the closing prices between 2021 and 2022 which correlates to the height of when cryptos received much of it’s funding.
high_percent <- cryptocurrencies %>% filter(pct_change >= 5)
count(high_percent)
## # A tibble: 70 × 2
## # Groups: CoinSymbol [70]
## CoinSymbol n
## <chr> <int>
## 1 ADA 25
## 2 ALEPH 169
## 3 ALGO 67
## 4 APT 31
## 5 AR 69
## 6 AUDIO 51
## 7 AVAX 52
## 8 AXL 10
## 9 BCH 27
## 10 BICO 21
## # ℹ 60 more rows
low_percent <- cryptocurrencies %>% filter(pct_change <= 5)
count(low_percent)
## # A tibble: 72 × 2
## # Groups: CoinSymbol [72]
## CoinSymbol n
## <chr> <int>
## 1 ADA 46
## 2 ALEPH 1032
## 3 ALGO 158
## 4 APT 337
## 5 AR 104
## 6 AUDIO 104
## 7 AVAX 108
## 8 AXL 45
## 9 BCH 47
## 10 BICO 76
## # ℹ 62 more rows
In each group there are some obvious spikes in the coins and this would be something to look at further to see if any coins follow the generally trend of the market and if not then is there certain coins that follow others.
vertical_line_date <- as.Date("2022-11-02")
cryptocurrencies %>% filter(CoinSymbol %in% c("ALEPH", "FIDA", "FTT", "MAPS",
"MSOL","OXY", "SOL", "SRM")) %>%
ggplot(aes(x = as.Date(theDate), y = closing, color = CoinSymbol)) +
geom_line() +
geom_vline(aes(xintercept = vertical_line_date),
linetype = 2,
color = "black",
size = 0.75) +
#facet_grid(CoinSymbol~., scales = "free_y") +
ggtitle("Closing Price (After FTX Collapse)") +
xlab("Date") +
ylab("Closing Price (USD)") +
theme(axis.title.x = element_text(color="black",size=10),
axis.title.y = element_text(color="black", size=10),
axis.text.x = element_text(size=6),
axis.text.y = element_text(size=6),
plot.title = element_text(color = "black",
size=14))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In this graphic we are looking at 8 coins in the data set. All but one
(“SOL”) were categorized in the low market cap coins. The dashed line in
the graph is set at November 2nd, 2022, which has been documented as the
catalyst to FTX finally file for bankruptcy.
cryptocurrencies %>% filter(coin_size == "high marketcap") %>%
ggplot(aes(x = as.Date(theDate), y = volume)) +
geom_line() +
facet_grid(CoinSymbol~., scales = "free_y") +
geom_vline(aes(xintercept = vertical_line_date),
linetype = 2,
color = "red",
size = 0.5) +
ggtitle("Trends in Yearly Trading Volume (high marketcap)") +
xlab("Date") +
ylab("Volume") +
theme(legend.position = "none",
axis.title.x = element_text(color="black",size=10),
axis.title.y = element_text(color="black", size=10),
axis.text.x = element_text(size=6),
axis.text.y = element_text(size=6),
plot.title = element_text(color = "black", size=14))
cryptocurrencies %>% filter(coin_size == "low marketcap") %>%
ggplot(aes(x = as.Date(theDate), y = volume)) +
geom_line() +
facet_grid(CoinSymbol~., scales = "free_y") +
geom_vline(aes(xintercept = vertical_line_date),
linetype = 2,
color = "red",
size = 0.5) +
ggtitle("Trends in Yearly Trading Volume (low marketcap)") +
xlab("Date") +
ylab("Volume") +
theme(legend.position = "none",
axis.title.x = element_text(color="black",size=10),
axis.title.y = element_text(color="black", size=10),
axis.text.x = element_text(size=6),
axis.text.y = element_text(size=6),
plot.title = element_text(color = "black", size=14))
In most of the high market coins there was large uptick in volume corresponding to a date on or around November 2nd (represented by the red dashed line).
# Line graph for a specific cryptocurrency
ggplot(data = cryptocurrencies, aes(x = CoinSymbol, y = highest, color = CoinSymbol)) +
geom_line() +
labs(title = "Cryptocurrency Prices Over Time", x = "CoinSymbol", y = " the highest price")
# Bar chart for comparing trading volumes
ggplot(data = cryptocurrencies, aes(x = CoinSymbol, y = volume)) +
geom_bar(stat = "identity") +
labs(title = "Cryptocurrency Trading Volumes", x = "Coins", y = "Volume")
# Histogram of market capitalization
ggplot(data = cryptocurrencies, aes(x = marketcap)) +
geom_histogram() +
labs(title = "Distribution of Market Capitalization", x = "Market Cap", y = "Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Calculate the Mean, Median, and Mode:
mean_price <- mean(cryptocurrencies$volume)
print(mean_price)
## [1] 1965674123
medium_price <- median(cryptocurrencies$marketcap)
print(medium_price)
## [1] 53111044
library(DescTools)
mode_price <- Mode(cryptocurrencies$pct_change)
print(mode_price)
## [1] 0.01
## attr(,"freq")
## [1] 39
Our team members proposed three cryptocurrencies file of FTX daily price, Layer1 and Coinbase value files, those are combined in one table, then, imported it to SQL workbench, built a normalized relational database, get rid of redundant data, organized the cleaned data by creating two new tables named “Cryptolist” and “Dailyprice” using composite keys grouped by (theDate and Coin) variables as a unique data in entire database without any two same data. So, this database obeys absolutely normalization 1NF rule and prevent data redundancy and keep data integrity throughout our analysis.
Working together on a real data set provided an opportunity for our team to reflect on the skills that helped us along the way to completing this project. I believe it did give us a deeper understanding of what skills are most valued data science skills that can be applied to many scenarios.
Technical Skills
Soft skills