---
title: "Week 8"
subtitle: "Dashboard Project Proposal & Data Cleaning"
author: "Penelope Pooler Eisenbies"
date: last-modified
lightbox: true
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
## Housekeeping
```{r include=F}
#|label: setup
knitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunks
options(scipen=100) # suppress scientific notation
# install pacman if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
pacman::p_load(pacman, tidyverse, gridExtra, magrittr,
kableExtra, tidyquant, highcharter, dygraphs)
# install and load required packages
p_loaded() # verify loaded packages
```
- Groups are assigned and Posit Cloud groups will be
created after Spring Break.
- Project Proposal Rough Draft is due Thursday, 3/26 (No
grace period)
- HW 5 - Part 1 is due Wed. 3/25 (2 day graceperiod)
- We will talk about HW 5 - Part 1 this week
##
### BUA 455 Dashboard Group Project
::: fragment
**Importance of Proposals:**
:::
- As a data analyst, you have skills that others don't.
- In this course, you are learning to acquire, combine,
manage, and present data.
- In the Business Analytics major, you learn to analyze
and interpret data.
- These skills are unique and are useful to employers,
colleagues, clients BUT you will spend your career
explaining and selling and explaining your skillset.
- Most people will NOT know how your skills can help them.
- It is your role to help them understand:
- what questions can be answered with data
- what data will help them answer their questions and
how to get it.
- How data management and analyses can help them and
will be essential to their goals.
##
### How the Proposal Process fits Wthin this course
- The scope of this course is **limited to data
acquisition, management, and presentation**.
- Much can be learned from combining data sources and
exploring data prior to analysis.
- In other courses, you learn analytical methods that
complement this course.
- The more methods and concepts you know, the better
prepared you are to deal with any data you encounter.
- No matter how much you know, the majority of you time
will be spent using the skills from this course.
- Depending on your work environment and position, others
may be assigned time consuming repetitive data
management tasks by you BUT
- You should still direct the data management and
presentation work because you will be held
responsible.
- **File management, data management, data curation, and
data presentation complement and augment analyses**.
##
### Project Proposal Rough Draft
- The rough draft does not need to be more than a brief
bullet-ed outline.
- **What data does your group plan to work with?**
- Summarize data topic in 1-2 sentences.
- Do some data exploration and look for links to good
data sources (AI can be helpful in search).
- Rough draft proposal should include ***links to
specific data sources that might work.***
- **What questions do you think you want to answer?**
- Propose 3-4 questions (or more) you hope to address
with your data dashboard.
- For each question propose which dataset and
variables might help you address it.
## Data Sources
- The [course website lists some data
sources](https://penelope2040.quarto.pub/bua-455-semester/#interesting-data){target="_blank"}
but there are MANY more.
- Another starting place for inspiration is
[**Statista**](https://login.libezproxy2.syr.edu/login?url=https://www.statista.com){target="_blank"}
- It is free if you login from the SU Library and is a
great starting point.
- **Your project should go beyond the small curated
Statista datasets** but it may provide inspiration.
- For this course, consider the topic choice to be a
iterative process:
1. Develop a topic and potential questions (but don't
become too attached).
2. Explore provided data sources, Google, and AI using
your question(s).
3. Examine data that is somewhat related to question.
4. Edit and refine topic and question based on data
available.
## Data Source Challenges
- If you are interested in **sports** data or **billboard
music** data and don't know how to scrape data, I can
help.
<br>
- There are sports and music data available, but the
websites often don't make the datasets downloadable.
<br>
- Instead, the data must be scraped as we did with the Box
Office Mojo data.
<br>
- Projects that require scraping data are doable, but data
management will take more time and **you should start
ASAP**.
## Data Sourcing Example
- I chose a VERY obscure topic to see how long it would
take me to find a reasonable dataset.
- Initial topic: **Craft Traditions in Europe**
- I started with prompt to CoPilot.
- That led me to links for
[**Eurostat**](https://ec.europa.eu/eurostat){target="_blank"}
and the [**European
Commission**](https://commission.europa.eu/index_en){target="_blank"}
- On the **European Commission** site I found the [**Joint
Research
Center**](https://commission.europa.eu/about-european-commission/departments-and-executive-agencies/joint-research-centre_en){target="_blank"}.
- A combination of website exploration and AI prompts led
to
- [**Competence Centre on Composite Indicators and
Scoreboards**](https://knowledge4policy.ec.europa.eu/composite-indicators_en){target="_blank"}
- [**JRC Composite Indicators
Site**](https://composite-indicators.jrc.ec.europa.eu/){target="_blank"}
and then led to
- and a site about [**cultural and creative European
cities**](https://composite-indicators.jrc.ec.europa.eu/cultural-creative-cities-monitor){target="_blank"}
- and a link to [**Docs and
Data**](https://composite-indicators.jrc.ec.europa.eu/cultural-creative-cities-monitor/docs-and-data){target="_blank"}
and this download link:
- {width="50%"}
## Flexibility Leads to Interesting Data
- Raw Excel Data with interesting variables for over 500
European cities.
- Dataset is not directly related to initial topic but
there are many veriables and observations to build a
multi-page dashboard.
- Going from initial obscure idea to interesting data took
about 30 minutes.
- We (TAs and I) are happy to help you with this process.
**Be flexible and curious.**
::: fragment
{width="65%;"
fig-align="center"}
:::
## Final Proposal
- **Due Thursday 4/9 - 24 Hour Grace Period**
- The final proposal draft should be 1-3 pages at most
(double spaced, 12 pt. font).
- **Bullet points are preferred to paragraphs**
- Although your client(s) are hypothetical, it helps to
envision your target client.
- The goal of the proposal is to
- explain in non-technical language why your dashboard
will be useful.
- convince the prospective client to trust you with
their data.
- How will client use the information provided in the
dashboard to
- improve their marketing plan or their purchasing and
staffing plan?
- improve their overall investment strategy or their
supply chain strategy?
## More about Proposal Final Draft
- **Due Thursday 4/9 - 24 Hour Grace Period**
- The proposal also provides a roadmap for the work you
plan to do so that all group members are informed.
- FINAL Proposal should **ALSO** include:
- a preliminary descriptive list of your planned
pages.
- a tentative plan for what data will be presented on
each page, and how will it be presented (which plots
or tables).
- **a list of which team member is primarily
responsible for each page**.
## Proposals are **TENTATIVE** Agreements
- Understandably, the work is not yet done, so your plans
may change, which is okay.
- Your final plots and tables are not done, but it is good
to have a plan for what they will be.
- If possible show explicit links between each question or
goal you plan to address and the proposed component of
the dashboard that addresses it.
- In a 'real world' situation:
- It is better not to provide final proprietary work
products at proposal stage.
- Ideas and plots should be polished enough to entice
client.
- A good proposal shows that you have creative ideas and
have put some thought into how you can help the client.
- In real life, proposals and client goals are more
concrete, but in this course, I encourage you to
experiment.
- ***Again, it is okay if your plans change based on what
you discover in the data.***
## Introduction to HW 5 - Part 1 Data
::: fragment
**Today: Examine and clean data**
:::
- First steps are to examine .csv file and to import and
examine data in R.
- Look for quirks that might be a challenge
- Hint: Examine `date_added` column in the .csv file
or in R
::: fragment
```{r import netflix data and examine}
# import data and examine
nflx <- read_csv("data/netflix_titles.csv", show_col_types=F) |> glimpse()
```
:::
##
### Week 8 In-class Exercises - Q1-Q3
[***Poll
Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} -
My User Name: **penelopepoolereisenbies685**
- **Q1.** How many different date formats do you see in
the date_added column in the `.csv` file?
- **Q2.** How many different date formats do you see in
the date_added column in the imported dataset in the
`Global Environment`?
- **Q3.** How many different commands are required to
convert the date_added column to a date variable?
## Cleaning These Data
New command for dealing with text data: **`separate`**
Review of `lubridate` commands
- Next steps:
- Select variables
- use separate to split `listed_in` into 3 columns
(Why?)
- Convert `date_added` to a date variable and then
create a year variable
::: fragment
```{r select and do prelim variable cleaning}
nflx <- nflx |>
select(show_id, type, date_added, release_year, rating, listed_in) |> # useful variables (subjective)
separate(listed_in, sep=", ", into=c("g1", "g2", "g3")) |> # separate out genre listings (notice ", ")
# will result in lots of NAs
mutate(date_added = mdy(date_added), # convert data_added to a date var
year_added = year(date_added)) # create a separate year var
head(nflx,3)
```
:::
## Week 8 In-class Exercises - Q4
[***Poll
Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} -
My User Name: **penelopepoolereisenbies685**
`separate` is the reverse of an R command we have already
learned.
- What command is that?
## Data Exploration and Additional Cleaning
::: fragment
**Finding and Removing Data with Missing Values (Depends on
Subject Matter)**
:::
- Are there observations for which `year_added` is missing
(`NA`)?
- **Cool trick:** `is.na(nflx$year_added)` will create
a vector of `FALSE` and `TRUE` values.
- `FALSE` unless the observation value is `NA`.
- R interprets `FALSE` as 0 and `TRUE` as 1 so you can
sum this vector to find number of missing values.
::: fragment
```{r remove observations for which year_added is missing}
sum(is.na(nflx$year_added)) # number of missing values (there are 10)
table(is.na(nflx$year_added)) # could also summarize with table
nflx <- nflx |> filter(!is.na(year_added)) # remove NA's and check again
sum(is.na(nflx$year_added))
```
:::
## Data Exploration and Additional Cleaning
- It is not always valid to delete data unless it is for a
specific valid transparent reason.
- For example: Are there observations for which rating is
`NR` (not-rated) or `UR` (un-rated)? Are there other
issues?
- In this case we delete observation because we are
restricting our data to more common rating categories.
- Remainder of the categories can be grouped logically.
::: fragment
```{r remove observations with nr or ur rating}
table(nflx$rating) # examine categories
nflx <- nflx |> filter(!rating %in% c("NR", "UR", "66 min", "74 min", "84 min"))
table(nflx$rating) # examine again after filter
```
:::
## When can you delete data?
**If you find raw uncleaned data on the internet or are
given data by a client or co-worker, is it okay to delete
observations?**
Yes, under certain circumstances:
1. **You must be completely transparent about how you are
limiting the data and why.**
- For example, I am limiting these data to specific time
periods and specific genres.
- Other time and genre categories have too little data.
2. **You should never delete outliers or unusual
observations unless you can confirm they are errors or
are not part of the data you are studying.**
- I am simplifying the data by removing categories that
are not of primary interest.
- Someone else might choose different categories and limit
the data differently.
- Always be transparent about data sources and all
decisions you make.
- I am also deleting three observations for which the
ratings were missing.
##
### Modifying text or numeric variables with `ifelse`
- Numeric or text data can be converted to a categorical
variable with a set of explicit `ifelse` statements.
- `ifelse` is not the ONLY way to categorize data, but
it's my tool of choice.
- The input values for `ifelse` are ALWAYS the same:
- Input 1: a test where result is `TRUE` or `FALSE`.
- Input 2: output value if the test result is `TRUE`.
- Input 3: output value if the test result is `FALSE`.
::: fragment
```{r simple example of ifelse, eval = F}
x <- seq(9,90,9) # create var x that is a sequence of values
(d <- tibble(x)) # convert x to a tibble (dataset)
(d <- d |> mutate(y = ifelse(x < 35, "Low", NA))) # add variable y to dataset d
(d <- d |> mutate(y = ifelse(x >= 35 & x < 65, "Medium", y)))
(d <- d |> mutate(y = ifelse(x >= 65, "High", y)))
```
:::
##
### Modifying text multiple `ifelse` statements
- The previous example can be (and should be) condensed
into one mutate statement as shown here.
::: fragment
```{r above example in one mutate command}
# create a dataset with 1 variable
x <- seq(9,90,9)
d <- tibble(x)
d <- d |> mutate(y = ifelse(x < 35, "Low", NA),
y = ifelse(x >= 35 & x < 65, "Medium", y),
y = ifelse(x >= 65, "High", y),
yF = factor(y, levels = c("Low", "Medium", "High"))) |>
# levels option used to specify correct order
glimpse()
```
:::
## Week 8 In-class Exercise
- We will examine the Netflix ratings data again and
simplify the categories to reflect minimum appropriate
ages.
- In this case, I have done some research on what these
ratings mean.
- In your projects, you will be expected to understand the
values in your variables
- You are likely to have to simplify your data
categories for your data management goals.
- My proposed solution for these data is to categorize
both TV shows and Movies into four minimum age
categories:
- Minimum Age Categories: 0, 7, 13, 17
- Minimum Age 0: "G", "TV-G", "TV-Y"
- Minimum Age 7: "PG", "TV-PG", "TV-Y7", "TV-Y7-FV"
- Minimum Age 13: "PG-13", "TV-14"
- Minimum Age 17: "NC-17", "TV-MA", "R"
## What we will do Today (Thu. 3/5):
- Together we will fill in the blanks in the R chunk below
- You will submit the entire chunk (with your added
comments) on Blackboard in a text file.
- Copy it into empty an text file and save it with
your name.
- This counts towards class participation for
Thursday, 10/23 and is due by Friday, 10/24, at
midnight.
```{r using ifelse to create min_age variable for prof, echo=F, eval=T, include=F}
#table(nflx$rating)
nflx <- nflx |>
mutate(min_age = ifelse(rating %in% c("G", "TV-G", "TV-Y"), 0, NA),
min_age = ifelse(rating %in% c("PG", "TV-PG", "TV-Y7", "TV-Y7-FV"), 7, min_age),
min_age = ifelse(rating %in% c("PG-13", "TV-14"), 13, min_age),
min_age = ifelse(rating %in% c("NC-17", "TV-MA", "R"), 17, min_age))
#table(nflx$min_age)
```
::: fragment
```{r using ifelse to create min_age variable}
table(nflx$rating)
# nflx <- nflx |>
# mutate(min_age = ifelse(rating %in% c("G", "TV-G", "TV-Y"), 0, ____),
# min_age = ifelse(rating %in% c("PG", "TV-PG", "TV-Y7", "TV-Y7-FV"), 7, _____),
# min_age = ifelse(rating %in% c("PG-13", "TV-14"), ____, ____),
# min_age = ifelse())
# table(nflx$min_age)
```
:::
## Another `ifelse` Example
- We can use a similar technique to categorize when shows
or movies were first shown (released).
- This code is very similar to our simple numerical
example shown previously.
- In HW 5 - Part 1 (**Available Thu. 3/5**) you will:
- Filter data to remove releases from before 1981.
::: fragment
```{r categorizing release data into periods}
#table(nflx$release_year)
nflx <- nflx |>
mutate(release_period = ifelse(release_year <= 1980, "1925-1980", NA),
release_period = ifelse(release_year > 1980 & release_year <= 2000, "1981-2000", release_period),
release_period = ifelse(release_year > 2000 & release_year <= 2005, "2001-2005", release_period),
release_period = ifelse(release_year > 2005 & release_year <= 2010, "2006-2010", release_period),
release_period = ifelse(release_year > 2010 & release_year <= 2015, "2011-2015", release_period),
release_period = ifelse(release_year > 2015, "2016-2021", release_period))
table(nflx$release_period)
```
:::
## Cleaning Text Variables
- The code that follows **WILL NOT RUN** until the
`min_age` and `release_period` variables are created
(previous chunks).
- The next steps are tedious, but important, and will be
different for every dataset you encounter.
- On a quiz you should know how to use these text
commands:
- `gsub`, `str_trim`,
- `separate`, `paste`, (or `unite`)
- `tolower`, `toupper`, (or `str_to_lower` and
`str_to_upper`)
- `ifelse` (covered in previous slides)
<br>
- Not required but useful: use `?str_trim` to also see
`str_squish` and `str_pad`
##
### Week 8 In-class Exercises - Q5
[***Poll
Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} -
My User Name: **penelopepoolereisenbies685**
- `str_trim` is used in an upcoming chunk to trim spaces
before characters in some observations.
::: fragment
**What is another more general command that can used to
replace spaces in character strings with nothing, e.g.
replace " movies" with "movies"?**
:::
##
### Week 8 In-class Exercises - Q6
[***Poll
Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} -
My User Name: **penelopepoolereisenbies685**
- Type `?tolower` into the console to view help file for
this command and `toupper`.
- If the tidyverse package is loaded, you can type
`?str_to_title` in the console to view the helpfile for
this command, and others.
- Copy and paste the following lines into the console. The
third line requires tye tidyverse package.
::: fragment
`tolower("WhAT is the Result of this Command?")`
`toupper("WhAT is the Result of this Command?")`
`str_to_title("WhAT is the Result of this Command?")`
:::
::: fragment
**What is the base command (not tidyverse) to covert a text
string to all lowercase, which makes data management
easier?**
:::
## Multi-step Process to clean genre
- Reshape data with `pivot_longer` and convert all text to
lower case with `to_lower`
- Examine in console using this command:
`cbind(table(nflx$genre))`
::: fragment
```{r cleaning up genre}
nflx <- nflx |>
select(show_id, type, year_added, release_period, min_age, g1, g2, g3) |>
pivot_longer(cols=c("g1","g2","g3"), names_to="g", values_to="genre")
# g is a place holder
nflx <- nflx |> select(!g) |> mutate(genre=tolower(genre)) |> glimpse()
```
:::
## Multi-step Process to clean genre
:::::: columns
::: {.column width="48%"}
- Recall when `separate` was used above, delimiter was
specified as ", "
- That removed leading spaces.
- Alternatively, you emove leading spaces with
`str_trim`
- Examine in console using this command:
`cbind(table(nflx$genre))`
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
{fig-align="center"}
:::
::::::
## Remove Unnecessary Text
::::::: columns
:::: {.column width="48%"}
- This text is unnecessary and redundant because of `type`
variable in data.
- Examine in console using this command:
`cbind(table(nflx$genre))`
::: fragment
```{r remove movies tv shows and tv text}
# remove the text `movies` and `tv shows` and `tv` from genre names
nflx <- nflx |>
mutate(genre = gsub(" movies", "", genre),
genre = gsub("movies", "", genre),
genre = gsub(" tv shows", "", genre),
genre = gsub("tv shows", "", genre),
genre = gsub("tv ", "", genre),
genre = gsub(" tv", "", genre))
```
:::
::::
::: {.column width="4%"}
:::
::: {.column width="48%"}
{fig-align="center"}
:::
:::::::
##
### Consolidate `genre` Categories using `ifelse`
- Below I am only editing and consolidating categories I
plan to use.
- Examine in console using this command:
`cbind(table(nflx$genre))`
::: fragment
```{r consolidate and edit genres of interest}
nflx <- nflx |>
mutate(genre = ifelse(genre == "drama", "dramas", genre),
genre = ifelse(genre %in% c("stand-up comedy",
"stand-up comedy & talk shows"),
"comedies", genre),
genre = ifelse(genre %in% c("documentaries", "docuseries"),
"docs", genre),
genre = ifelse(genre %in% c("children & family", "kids'"),
"kids", genre),
genre = gsub("action & adventure", "action_adventr", genre))
```
:::
## Filter Data to Genres of Interest
::::::: columns
:::: {.column width="60%"}
::: fragment
```{r filter data to six genres we have consolidated}
nflx <- nflx |>
filter(genre %in% c("action_adventr", "comedies", "docs",
"dramas", "international", "kids"))
```
:::
- Examine results in console with these commands
- `cbind(table(nflx$genre))`
- `table(nflx$type, nflx$genre)`
::::
::: {.column width="4%"}
:::
::: {.column width="36%"}
{fig-align="center"}
:::
:::::::
## Create Final Summary Dataset
In the HW 5 - Part 1 Example Dashboard, I use the `nflx_tv`
dataset, `TV Show` observations from this dataset.
In HW 5 - Part 1, Students will create use the `nflx_mv`
dataset, the `Movie` observations from this dataset.
```{r create final summary dataset}
nflx_wide <- nflx |>
group_by(type, release_period, year_added, min_age, genre) |>
summarize(n = n()) |> # n() tallies number of obs
pivot_wider(id_cols = c(type, release_period, year_added, min_age), # col. for each genre
names_from = genre,
values_from = n)
nflx_wide[is.na(nflx_wide)] <- 0 # why should NA's be 0 here?
nflx_tv <- nflx_wide |> # example dashboard made from nflx tv
filter(type == "TV Show")
# nflx_mv <-
```
##
### Optional Group Meetings with Me
:::::: columns
::: {.column width="48%"}
- After break I will post a sign-up schedule
- Meetings will be 5-10 minutes on Zoom (or can be in
person during Office Hours)
- After reading your submitted pre-proposal I will:
- Ask your group questions
- Provide feedback to guide you towards good data and
a good plan
- Answer question you might have before you write your
final proposal.
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
{fig-align="center"}
:::
::::::
##
### Key Points from This Week
- Group Dashboard Project and Proposals:
- Rough Draft due Thu. 3/26 at 6:00 PM
- Optional proposal group meetings will take place on
4/7 and 4/8
- Final Proposal due Thu. 4/9 at 6:00 PM (24 hour
grace period)
- Data Exploration and Cleaning
- Iterative, subjective process
- Always be transparent and document choices
- Introduction to HW 5 - Part 1 Data and Assignment
- Due Wed. 3/25 at midnight
- Quiz 2 on Thu. 4/7
::: fragment
You may submit an 'Engagement Question' about each lecture
until midnight on the day of the lecture. **A minimum of
four submissions are required during the semester.**
:::