For this project, I will import three untidy datasets from our week 5 discussion board, tidy them, and analyze them. So first, I started by getting my libraries.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(readr)
library(ggplot2)
This second rmd file will be for the dataset is from Haig Bedros’s post and focuses on stock market. This dataset contains the oil and gold prices for different dates. It also contains what a bond that date will yield after different year period.
##Step 1: Import the dataset
stockMarket <- read_csv("market_indicators.csv")
## Rows: 3469 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): dxy, oil, gold, treasury_5_years, treasury_10_years, treasury_30_y...
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(stockMarket)
## [1] "Date" "dxy" "oil"
## [4] "gold" "treasury_5_years" "treasury_10_years"
## [7] "treasury_30_years"
##Step 2: Tidy the dataset
In order to tidy up this dataset and make it one observation per row, I
ended up making 3 new different csvs: 1 for the oil price data, gold
price data, and treasury year. These are all independent observations so
they can not be on the same dataset. Oil and Gold were easy as I just
subsetted the dataset with the couple columns needed. The clean ups are
for columns that did not have any data of that new subset holds
(oil,gold,treasury) but, probably held data for a different new subset
data. For the treasury columns, I subsetted them into a new dataset and
then used pivot_longer to combine them into 2 new columns.
oilMarket_tidy <- subset(stockMarket, select = c(1,2,3))
oilMarket_tidy <-oilMarket_tidy[rowSums(is.na(oilMarket_tidy)) == 0, ]
oilMarket_tidy
## # A tibble: 3,464 × 3
## Date dxy oil
## <date> <dbl> <dbl>
## 1 2009-03-31 85.4 49.7
## 2 2009-04-01 85.5 48.4
## 3 2009-04-02 84.4 52.6
## 4 2009-04-03 84.2 52.5
## 5 2009-04-06 84.6 51.0
## 6 2009-04-07 85.3 49.2
## 7 2009-04-08 85.4 49.4
## 8 2009-04-09 85.8 52.2
## 9 2009-04-13 84.6 50.0
## 10 2009-04-14 84.6 49.4
## # ℹ 3,454 more rows
goldMarket_tidy <- subset(stockMarket, select = c(1,2,4))
goldMarket_tidy <-goldMarket_tidy[rowSums(is.na(goldMarket_tidy)) == 0, ]
goldMarket_tidy
## # A tibble: 3,462 × 3
## Date dxy gold
## <date> <dbl> <dbl>
## 1 2009-03-31 85.4 923.
## 2 2009-04-01 85.5 926.
## 3 2009-04-02 84.4 907.
## 4 2009-04-03 84.2 896.
## 5 2009-04-06 84.6 872.
## 6 2009-04-07 85.3 882.
## 7 2009-04-08 85.4 885.
## 8 2009-04-09 85.8 882.
## 9 2009-04-13 84.6 895.
## 10 2009-04-14 84.6 891.
## # ℹ 3,452 more rows
treasuryMarket <- subset(stockMarket, select = c(1,2,5,6,7))
treasuryMarket <-treasuryMarket[rowSums(is.na(treasuryMarket)) == 0, ]
treasuryMarket
## # A tibble: 3,462 × 5
## Date dxy treasury_5_years treasury_10_years treasury_30_years
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2009-03-31 85.4 1.67 2.68 3.56
## 2 2009-04-01 85.5 1.65 2.66 3.49
## 3 2009-04-02 84.4 1.74 2.75 3.58
## 4 2009-04-03 84.2 1.88 2.91 3.72
## 5 2009-04-06 84.6 1.90 2.94 3.76
## 6 2009-04-07 85.3 1.86 2.91 3.73
## 7 2009-04-08 85.4 1.83 2.85 3.66
## 8 2009-04-09 85.8 1.89 2.93 3.76
## 9 2009-04-13 84.6 1.79 2.85 3.69
## 10 2009-04-14 84.6 1.72 2.79 3.66
## # ℹ 3,452 more rows
treasuryMarket_tidy <-
pivot_longer(treasuryMarket,
cols=c('treasury_5_years', 'treasury_10_years', 'treasury_30_years'),
names_to = 'Treasury Year',
values_to = 'Amount')
treasuryMarket_tidy <- treasuryMarket_tidy[order(treasuryMarket_tidy$Date),]
#clean up the Treasury Year names replace?
treasuryMarket_tidy$`Treasury Year`[which(treasuryMarket_tidy$`Treasury Year`=="treasury_5_years")] <- 5
treasuryMarket_tidy$`Treasury Year`[which(treasuryMarket_tidy$`Treasury Year`=="treasury_10_years")] <- 10
treasuryMarket_tidy$`Treasury Year`[which(treasuryMarket_tidy$`Treasury Year`=="treasury_30_years")] <- 30
treasuryMarket_tidy
## # A tibble: 10,386 × 4
## Date dxy `Treasury Year` Amount
## <date> <dbl> <chr> <dbl>
## 1 2009-03-31 85.4 5 1.67
## 2 2009-03-31 85.4 10 2.68
## 3 2009-03-31 85.4 30 3.56
## 4 2009-04-01 85.5 5 1.65
## 5 2009-04-01 85.5 10 2.66
## 6 2009-04-01 85.5 30 3.49
## 7 2009-04-02 84.4 5 1.74
## 8 2009-04-02 84.4 10 2.75
## 9 2009-04-02 84.4 30 3.58
## 10 2009-04-03 84.2 5 1.88
## # ℹ 10,376 more rows
write.csv(oilMarket_tidy,file='/Users/Ari/Data607/project2/oilMarket_tidy.csv')
write.csv(goldMarket_tidy,file='/Users/Ari/Data607/project2/goldMarket_tidy.csv')
write.csv(treasuryMarket_tidy,file='/Users/Ari/Data607/project2/treasuryMarket_tidy.csv')
##Step 3: Analysis
I believe since this data uses dates as a main key and is then recording
data throughout a time period. Line graphs will be perfect to display
and analyize this data.
ggplot(data=oilMarket_tidy, aes(x=Date, y=oil)) +
geom_line()+
geom_point()
The line graph above shows the price of oil throughout the dates. As
expected, there is no pattern to the price hikes & dips. The oil
price, however, mostly stays in the range of
$40-125. It
was going slowly up for a while until 2015 where it dipped and
stabilized. There is one huge drop below $0 in early 2020.
Interestingly, I think this weird drop is caused by the beginning of the
pandemic. Since then, the price is increasing rapidly but, the end of
graph shows a dip has started.
ggplot(data=goldMarket_tidy, aes(x=Date, y=gold)) +
geom_line()+
geom_point()
The line graph above shows the price of gold throughout the dates.
Unlike the oil prices, the gold price is rapidly changing and starts
much higher. There are two hugely increasing hikes up around 2009-2012
and 2018-2021. These is also three time periods where the price is
stable.
While the graph looks like gold is making bigger changing and the
price of oil changes the most. For the oil, the starting price was
around $49 and ending price around $74 so
there was only a $25 difference (~66% increase). On the
other other hand, gold’s starting price was around $900 and
end around $1850 so a $950 difference (around
a 48% increase). Both of these graphs have a lot of points with small
increasing or decreasing from day to day resulting very zigzag
lines.
ggplot(data=treasuryMarket_tidy, aes(x=dxy, y=Amount, color=Date)) +
geom_point()
The last graph shows the relationship between dxp/us dollar index and
the bond yield amount grouped by dates. From the scatter plot we can see
the clean increase in the dxy as opposed to the scattered y axis, bond
yield. The different dates are represented by increasing blue colored
lightness. All the same shade of blue tends to be bundled together with
no stray points around the x axis. The same shade group have points up
and down the y axis. This indicates that the treasury amount or bond
yield has no pattern.