Time Series, Data Formats, Output Formats, Project Introduction
2024-10-07
Quiz 1 is now graded.
10% (Submitted Quarto File) + 90% (Blackboard Answers, .csv files and .png file)
Please don’t worry if you are not happy with your score.
Final grading in this course:
adheres to Whitman grading policy, but is fairly gentle.
takes into account assignments, course project, and class particpation.
Quiz 2 will be during Week 11 and will combine previous skills with material from weeks 6 through 10
If you have questions about your quiz, please let me know.
HW 4 is due on Friday, 10/11.
Group Assignments
Complete HW 4 - Part 1 TODAY! (This should only take 5 min.)
Note: If you do not complete this Survey, I will not put you in a project group and you can not pass this class.
Groups of 5 or 6 will be determined and posted (Hopefully by Monday)
If you have a request to work with someone, include that information in your survey (Not required).
Friday, 10/11, is the last day I will accept any group requests.
I cannot guarantee that requests will be honored, but I will try.
I control assignments to maintain some balance in skill level among groups.
Data Sources, etc. also available, and will be updated as needed.
New in Fall 2024: Students are also required to use AI tools to find data.
Examples from previous semesters not comparable
This Fall is the first semster where the Quarto dashboard was fully functional and useable for this class project.
In previous semesters, students used flexdashboard
in RStudio and the storyboard template.
The new format gives students a lot more flexibility BUT has more potential pitfalls which we will cover in HW 5.
Groups assigned by Monday 10/21 at the latest
Thu. 10/31 at 5:00 PM: Draft Proposals Due - NO GRACE PERIOD
These proposals should consist of short bulles and links to data sources
Ideally, it should take me 5 minutes to read your proposed ideas and check your data.
Proposal Meetings:
Groups should come with questions and be prepared to answer my questions (10-15 min. per groups)
Meetings will take place in and outside of class. See sign-up sheet.
Wed. 10/31: HW 5 - Part 1 Due
Thu. 11/7: Quiz 2
Tue. 11/12: Final Proposals Due
Not much longer that draft proposal
Should still be bullet point format
Questions and issues discussed during meeting should be addressed
Chunk Headers
In Chunk 6 (Part 5), the chunk header in the the template appears as follows:
The eval=F
prevents this chunk from being evaluated when it is knit.
It was included in the template because the original code provided was incomplete and incorrect and would cause errors when rendered.
You are asked to remove the text eval=F
There are many other chunk header options, such as echo=F
and include=F
#|label: import data
and #|echo: false
NOTE: If two chunks are given the EXACT SAME name, e.g. #|label: importing data
, the file will not render.
So far, all Quarto files in this course have been rendered as HTML (.html) files or slides
Other common formats are Word documents, PDF documents, Powerpoint Slides, and dashboards
We will use the dashboard (next slide) format in HW 5 and in your projects.
Groups will also write their two project memos in Quarto and publish them as word documents.
REQUIRED: Download the latest version of Quarto here
Quarto Dashboard is a new feature of Quarto that is extremely flexible and straightforward to use.
The Quarto Dashboard Gallery includes example dashboards made with R, Python, and other langaugages.
In this course I will provide a simple template for HW 5 that can be used to build your dashboard.
Once you understand how to add pages, rows, column, tabsets, and modify as needed you are welcome to tailor the template to your project.
A Quarto dashboard is a flexible blank canvas that you can tailor to your project and future endeavors.
In recent weeks, we have worked with Box Office Mojo and Bureau of Labor Statistics Data
These datasets are time series data.
They all include a date variable and another quantitative variable that changes at each time period.
So far we have worked with data in an R format called a tibble
.
Two common data formats in R, tibble
and data.frame
are needed for creating ggplots of time series.
tibble
is the more modern format and is more compatible with tidyverse
commands to manage data.Today, we’ll discuss a third data format, xts
that can be used specifically for time series data.
xts
using tidyquant
PackageYahoo Finance, the Federal Reserve Bank, the Wall Street Journal, and others are excellent data sources that can be directly imported into R.
The default for getsymbols
in the tidyquant
package is Yahoo Finance.
Data format is xts
which we will cover today
hchart
for One Stockhchart
in the highcharter
package is one way to plot xts
data
hcharts
displayStocks can be shown in separate plots that can be shown side by side or in one stacked column
The command hw_grid
is used to display them and ncol
indicates how many columns.
nflx_plt <- hchart(NFLX$NFLX.Adjusted, name="Adjusted", color="green") |>
hc_add_series(NFLX$NFLX.High, name="High" , color="darkgreen") |>
hc_add_series(NFLX$NFLX.Low, name="Low" , color="lightgreen")
amzn_plt <- hchart(AMZN$AMZN.Adjusted, name="Adjusted", color="blue") |>
hc_add_series(AMZN$AMZN.High, name="High" , color="darkblue") |>
hc_add_series(AMZN$AMZN.Low, name="Low" , color="lightblue")
dis_plt <- hchart(DIS$DIS.Adjusted, name="Adjusted", color="mediumpurple") |>
hc_add_series(DIS$DIS.High, name="High" , color="purple4") |>
hc_add_series(DIS$DIS.Low, name="Low" , color="plum")
hcharts
DisplaySession ID: bua455f24
In the example above, we use the hw_grid
command to create a multi-plot composition of hcharts.
Previously, we covered another command to create a composition of non-interactive ggplots of tibble
data.
What is that other command?
Hints:
This very useful command is in the gridExtra
package which is loaded.
If gridExtra
is loaded in R, start typing grid
in the console, and the command and others will appear.
Session ID: bua455f24
Use provided exampled of getSymbols
code to write code to import the stock time series for Apple (AAPL
)
Open the imported xts
file by clicking on it in the Global Environment
Sort the AAPL.Adjusted
column by clicking on it.
Answer Question:
xts
When these stock datasets are imported, they are in xts
format.
xts
stands for Extensible Time Series which means they are self-aware.
The key feature is that date
is NOT a variable, but instead the dates become row IDs.
Any dataset with a date
variable can be converted to an xts
dataset.
Any xts
dataset can be converted a tibble or data.frame (two common R data formats).
NFLX.Open NFLX.High NFLX.Low NFLX.Close NFLX.Volume NFLX.Adjusted
2015-01-02 49.15143 50.33143 48.73143 49.84857 13475000 49.84857
2015-01-05 49.25857 49.25857 47.14714 47.31143 18165000 47.31143
2015-01-06 47.34714 47.64000 45.66143 46.50143 16037700 46.50143
2015-01-07 47.34714 47.42143 46.27143 46.74286 9849700 46.74286
2015-01-08 47.12000 47.83571 46.47857 47.78000 9601900 47.78000
2015-01-09 47.63143 48.02000 46.89857 47.04143 9578100 47.04143
xts
datasets using mergeConverting xts to a tibble or dataframe (R data formats) is required if you want to create a ggplot or use other methods covered previously
A good first step is to create a merged xts
dataset of the desired variables.
#|label: merge xts stock data
# data are merged by matching dates
nflx_amzn_dis <- merge(NFLX$NFLX.Adjusted,
AMZN$AMZN.Adjusted,
DIS$DIS.Adjusted)
head(nflx_amzn_dis)
NFLX.Adjusted AMZN.Adjusted DIS.Adjusted
2015-01-02 49.84857 15.4260 86.69246
2015-01-05 47.31143 15.1095 85.42558
2015-01-06 46.50143 14.7645 84.97248
2015-01-07 46.74286 14.9210 85.84172
2015-01-08 47.78000 15.0230 86.72946
2015-01-09 47.04143 14.8465 87.15480
xts
datasets to tibble formatThere are a few ways to convert an xts to a tibble.
In the code below I show the conversion and then I rename the the new date variable as date
# converting data to a tibble requires a couple lines of code
# I prefer to rename the index as date
nflx_amzn_dis_tibble <- nflx_amzn_dis |>
fortify.zoo() |> as_tibble(.name_repair = "minimal") |>
rename("date" = "Index")
head(nflx_amzn_dis_tibble)
# A tibble: 6 × 4
date NFLX.Adjusted AMZN.Adjusted DIS.Adjusted
<date> <dbl> <dbl> <dbl>
1 2015-01-02 49.8 15.4 86.7
2 2015-01-05 47.3 15.1 85.4
3 2015-01-06 46.5 14.8 85.0
4 2015-01-07 46.7 14.9 85.8
5 2015-01-08 47.8 15.0 86.7
6 2015-01-09 47.0 14.8 87.2
xts
xts
datasethchart
or dygraph
(next topic) for any dataset with a date
variable.hchart
)hchart
dygraph
is a more flexible alternative to hchart
.
dygraph
and hchart
allow viewer to interactively select date rangeHere is the dataset we will use:
#|label: dataset for dygraphs example
three_stocks <- merge(AMZN$AMZN.Adjusted, DIS$DIS.Adjusted, NFLX$NFLX.Adjusted)
names(three_stocks) <- c("AMZN.adj", "DIS.adj", "NFLX.adj")
head(three_stocks, 3) # print first three rows only
AMZN.adj DIS.adj NFLX.adj
2015-01-02 15.4260 86.69246 49.84857
2015-01-05 15.1095 85.42558 47.31143
2015-01-06 14.7645 84.97248 46.50143
Basic unformatted plot of three stocks with the range selector option
Two useful formatting options (shown below) to make the plot more readable are: Removing the the grid lines Formatting the axis labels
Vertical lines can be added at specific dates and can be labeled and formatted.
bls_tidy
Function - Labor DataBefore using our function on new data, we ALWAYS examine the .csv files
The number of rows to skip for these three labor datasets is 11.
bls_tidy <- function(data_file, skip_num, var_name){
read_csv(data_file, skip = skip_num, show_col_types = F) |>
pivot_longer(cols = Jan:Dec,
names_to = "month",
values_to = "value") |>
filter(!is.na(value)) |>
rename({{var_name}} := "value")
}
labor_force <- bls_tidy("data/bls_civ_lf.csv", skip_num=11, var_name="lf")
unemp <- bls_tidy("data/bls_civ_unemp.csv", skip_num=11, var_name="unemp")
emp <- bls_tidy("data/bls_civ_emp.csv", skip_num=11, var_name="emp")
head(unemp)
# A tibble: 6 × 3
Year month unemp
<dbl> <chr> <dbl>
1 2014 Jan 10202
2 2014 Feb 10349
3 2014 Mar 10380
4 2014 Apr 9702
5 2014 May 9859
6 2014 Jun 9460
Last Week and in HW 4 we covered joining TWO datasets.
The commands we covered (there are 4) all have the same limitation: datasets must be joined two at a time.
Joining with Piping
#|label: joining 3 datasets with pipes
# with piping
lf_all <- labor_force |>
full_join(emp) |>
full_join(unemp) |>
write_csv("data/labor_tidy.csv") #export
head(lf_all)
# A tibble: 6 × 5
Year month lf emp unemp
<dbl> <chr> <dbl> <dbl> <dbl>
1 2014 Jan 155352 145150 10202
2 2014 Feb 155483 145134 10349
3 2014 Mar 156028 145648 10380
4 2014 Apr 155369 145667 9702
5 2014 May 155684 145825 9859
6 2014 Jun 155707 146247 9460
Joining without Piping
#|label: joining 3 datasets without pipes
lf_all <- full_join(labor_force, emp)
lf_all <- full_join(lf_all, unemp)
head(lf_all)
# A tibble: 6 × 5
Year month lf emp unemp
<dbl> <chr> <dbl> <dbl> <dbl>
1 2014 Jan 155352 145150 10202
2 2014 Feb 155483 145134 10349
3 2014 Mar 156028 145648 10380
4 2014 Apr 155369 145667 9702
5 2014 May 155684 145825 9859
6 2014 Jun 155707 146247 9460
Chunk below includes code that is similar to Parts 3 and 4 of HW 4.
BONUS: Code modified to show how to get ‘End of Month’ (eom) date.
#|label: dates and data mod for plot
lf_plt <- lf_all |>
mutate(date_som = ym(paste(Year, month)), # create som date var
date = ceiling_date(date_som, "month")-1, # create eom month date var
empM = (emp/1000) |> round(2), # convert counts to millions
unempM = (unemp/1000) |> round(2)) |>
select(date, empM, unempM) |> # select vars and reshape
pivot_longer(cols=empM:unempM, names_to = "type", values_to = "count") |>
mutate(type = factor(type, # create factor var for plot
levels = c("unempM", "empM"),
labels = c("Unemployed", "Employed")))
head(lf_plt, 4) # examine first 8 rows
# A tibble: 4 × 3
date type count
<date> <fct> <dbl>
1 2014-01-31 Employed 145.
2 2014-01-31 Unemployed 10.2
3 2014-02-28 Employed 145.
4 2014-02-28 Unemployed 10.4
lf_area_plt_slides <- lf_plt |>
ggplot() +
geom_area(aes(x=date, y=count, fill=type)) +
theme_classic() +
theme(legend.position="bottom") +
scale_fill_manual(values=c("red", "blue")) +
scale_x_date(date_breaks = "year", date_labels = "%Y") +
labs(x="Date", y = "Number of Peolple (Millions)", fill="",
title="Total Labor Force: Employed and Unemployed ",
subtitle="Jan. 2014 - June 2024",
caption="Data Source:www.bls.gov") +
theme(plot.title = element_text(size = 20),
plot.subtitle = element_text(size = 15),
axis.title = element_text(size=18),
axis.text = element_text(size=15),
plot.caption = element_text(size = 10),
legend.text = element_text(size = 12),
panel.border = element_rect(colour = "lightgrey", fill=NA, linewidth=2),
plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2))
Additional formatting in previous slides can always be added
Plot exported using ggsave
which by default exports last plot created
#|label: simpler plot code with ggsave export
lf_area_plt <- lf_plt |>
ggplot() +
geom_area(aes(x=date, y=count, fill=type)) +
theme_classic() +
theme(legend.position="bottom") +
scale_fill_manual(values=c("red", "blue")) +
scale_x_date(date_breaks = "year", date_labels = "%Y") +
labs(x="Date", y = "Number of Peolple (Millions)", fill="",
title="Total Labor Force: Employed and Unemployed ",
subtitle="Jan. 2014 - Jun. 2024",
caption="Data Source:www.bls.gov") +
theme(plot.title = element_text(size = 20),
plot.subtitle = element_text(size = 15),
axis.title = element_text(size=18),
axis.text = element_text(size=15),
plot.caption = element_text(size = 10),
legend.text = element_text(size = 12))
ggsave("img/labor_force_area_plot.png", width=6,height=4)
In this exercise we will:
labor_tidy.csv
and convert variables to millions and round to 2 decimal places and select two variables. (Review)labor_new
to an xts
format, labor_xts
hchart
with two variables
lfM
and empM
and save it as labor_hc
Submit screenshots of plot from Viewer
pane.
Save R code as an R Script. In the R project folder I have saved an R Script for your work (Updated October 2024).
Copy and paste code into provided R Script and use save as
to save the file with your name., e.g. Week_7_In_Class_Penelope_Pooler.R
R Script should include:
code I provided to import and modify data
tibble to xts conversion of labor dataset
hchart plot code (required) with code comments using #
dygraph plot code (optional but recommended) with code comments using #
Submit final script on Blackboard (counts towards class participation for Week 7)
Due by Friday 10/11. No late submission accepted for In-class Exercises.
Quarto and Markdown files are ‘smart’, i.e. aware of where they are located.
R Scripts (older common file type) are useful BUT not aware of file location.
User must specify working directory
The script I provided is saved to your working directory
To check working directory: getwd()
To set working directory to code_data_output folder: (for working in an R Script)
NOTES:
R users and developers do not recommend setting working directories within code which would have to be changed for each laptop.
Whenever possible, use R Projects and ‘smart’ files such as .qmd
and .Rmd
files.
Time Series Data
Importing stock data from Yahoo Finance as xts
Converting between xts
and tibble
Plotting options include area plots, hcharts and dygraphs
dygraphs
and hcharts
are useful tools for understanding, managing, and curating time series data.
HW 4 due Friday, 10/11
Grace period in effect.
TAs and I are available to assist if you have questions.
You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.