---
title: "Week 9"
subtitle: "Dashboard Project Proposal & Data Cleaning"
author: "Penelope Pooler Eisenbies"
date: last-modified
lightbox: true
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
## Housekeeping
```{r include=F}
#|label: setup
knitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunks
options(scipen=100) # suppress scientific notation
# install pacman if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
pacman::p_load(pacman, tidyverse, gridExtra, magrittr,
kableExtra, tidyquant, highcharter, dygraphs)
# install and load required packages
p_loaded() # verify loaded packages
```
- Groups are assigned and Posit Cloud groups will be created this week.
- Project Proposal Rough Draft is due Thursday, 10/30 (No grace period)
- HW 5 - Part 1 is due Wed. 10/29 (2 day graceperiod)
- We will talk about HW 5 - Part 1 this week
##
### BUA 455 Dashboard Group Project
::: fragment
**Importance of Proposals:**
:::
- As a data analyst, you have skills that others don't.
- In this course, you are learning to acquire, combine, manage, and present data.
- In the Business Analytics major, you learn to analyze and interpret data.
- These skills are unique and are useful to employers, colleagues, clients BUT you will spend your career explaining and selling and explaining your skillset.
- Most people will NOT know how your skills can help them.
- It is your role to help them understand:
- what questions can be answered with data
- what data will help them answer their questions and how to get it.
- How data management and analyses can help them and will be essential to their goals.
##
### How the Proposal Process fits Wthin this course
- The scope of this course is **limited to data acquisition, management, and presentation**.
- Much can be learned from combining data sources and exploring data prior to analysis.
- In other courses, you learn analytical methods that complement this course.
- The more methods and concepts you know, the better prepared you are to deal with any data you encounter.
- No matter how much you know, the majority of you time will be spent using the skills from this course.
- Depending on your work environment and position, others may be assigned time consuming repetitive data management tasks by you BUT
- You should still direct the data management and presentation work because you will be held responsible.
- **File management, data management, data curation, and data presentation complement and augment analyses**.
##
### Project Proposal Rough Draft
- The rough draft does not need to be more than a brief bullet-ed outline.
- **What data does your group plan to work with?**
- Summarize data topic in 1-2 sentences.
- Do some data exploration and look for links to good data sources (AI can be helpful in search).
- Rough draft proposal should include ***links to specific data sources that might work.***
- **What questions do you think you want to answer?**
- Propose 3-4 questions (or more) you hope to address with your data dashboard.
- For each question propose which dataset and variables might help you address it.
## Data Sources
- The [course website lists some data sources](https://penelope2040.quarto.pub/bua-455-semester/#interesting-data){target="_blank"} but there are MANY more.
- Another starting place for inspiration is [**Statista**](https://login.libezproxy2.syr.edu/login?url=https://www.statista.com){target="_blank"}
- It is free if you login from the SU Library and is a great starting point.
- **Your project should go beyond the small curated Statista datasets** but it may provide inspiration.
- For this course, consider the topic choice to be a iterative process:
1. Develop a topic and potential questions (but don't become too attached).
2. Explore provided data sources, Google, and AI using your question(s).
3. Examine data that is somewhat related to question.
4. Edit and refine topic and question based on data available.
## Data Source Challenges
- If you are interested in **sports** data or **billboard music** data and don't know how to scrape data, I can help.
<br>
- There are sports and music data available, but the websites often don't make the datasets downloadable.
<br>
- Instead, the data must be scraped as we did with the Box Office Mojo data.
<br>
- Projects that require scraping data are doable, but data management will take more time and **you should start ASAP**.
## Data Sourcing Example
- I chose a VERY obscure topic to see how long it would take me to find a reasonable dataset.
- Initial topic: **Craft Traditions in Europe**
- I started with prompt to CoPilot.
- That led me to links for [**Eurostat**](https://ec.europa.eu/eurostat){target="_blank"} and the [**European Commission**](https://commission.europa.eu/index_en){target="_blank"}
- On the **European Commission** site I found the [**Joint Research Center**](https://commission.europa.eu/about-european-commission/departments-and-executive-agencies/joint-research-centre_en){target="_blank"}.
- A combination of website exploration and AI prompts led to
- [**Competence Centre on Composite Indicators and Scoreboards**](https://knowledge4policy.ec.europa.eu/composite-indicators_en){target="_blank"}
- [**JRC Composite Indicators Site**](https://composite-indicators.jrc.ec.europa.eu/){target="_blank"} and then led to
- and a site about [**cultural and creative European cities**](https://composite-indicators.jrc.ec.europa.eu/cultural-creative-cities-monitor){target="_blank"}
- and a link to [**Docs and Data**](https://composite-indicators.jrc.ec.europa.eu/cultural-creative-cities-monitor/docs-and-data){target="_blank"} and this download link:
- {width="50%"}
## Flexibility Leads to Interesting Data
- Raw Excel Data with interesting variables for over 500 European cities.
- Dataset is not directly related to initial topic but there are many veriables and observations to build a multi-page dashboard.
- Going from initial obscure idea to interesting data took about 30 minutes.
- We (TAs and I) are happy to help you with this process. **Be flexible and curious.**
::: fragment
{width="65%;" fig-align="center"}
:::
## Final Proposal
- **Due Tue. 11/11 - 48 Hour Grace Period**
- The final proposal draft should be 1-3 pages at most (double spaced, 12 pt. font).
- **Bullet points are preferred to paragraphs**
- Although your client(s) are hypothetical, it helps to envision your target client.
- The goal of the proposal is to
- explain in non-technical language why your dashboard will be useful.
- convince the prospective client to trust you with their data.
- How will client use the information provided in the dashboard to
- improve their marketing plan or their purchasing and staffing plan?
- improve their overall investment strategy or their supply chain strategy?
## More about Proposal Final Draft
- **Due Tue. 11/11 - 48 Hour Grace Period**
- The proposal also provides a roadmap for the work you plan to do so that all group members are informed.
- FINAL Proposal should **ALSO** include:
- a preliminary descriptive list of your planned pages.
- a tentative plan for what data will be presented on each page, and how will it be presented (which plots or tables).
- **a list of which team member is primarily responsible for each page**.
## Proposals are **TENTATIVE** Agreements
- Understandably, the work is not yet done, so your plans may change, which is okay.
- Your final plots and tables are not done, but it is good to have a plan for what they will be.
- If possible show explicit links between each question or goal you plan to address and the proposed component of the dashboard that addresses it.
- In a 'real world' situation:
- It is better not to provide final proprietary work products at proposal stage.
- Ideas and plots should be polished enough to entice client.
- A good proposal shows that you have creative ideas and have put some thought into how you can help the client.
- In real life, proposals and client goals are more concrete, but in this course, I encourage you to experiment.
- ***Again, it is okay if your plans change based on what you discover in the data.***
## Introduction to HW 5 - Part 1 Data
::: fragment
**Today: Examine and clean data**
:::
- First steps are to examine .csv file and to import and examine data in R.
- Look for quirks that might be a challenge
- Hint: Examine `date_added` column in the .csv file or in R
::: fragment
```{r import netflix data and examine}
# import data and examine
nflx <- read_csv("data/netflix_titles.csv", show_col_types=F) |> glimpse()
```
:::
##
### Week 9 In-class Exercises - Q1-Q3
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
- **Q1.** How many different date formats do you see in the date_added column in the `.csv` file?
- **Q2.** How many different date formats do you see in the date_added column in the imported dataset in the `Global Environment`?
- **Q3.** How many different commands are required to convert the date_added column to a date variable?
## Cleaning These Data
New command for dealing with text data: **`separate`**
Review of `lubridate` commands
- Next steps:
- Select variables
- use separate to split `listed_in` into 3 columns (Why?)
- Convert `date_added` to a date variable and then create a year variable
::: fragment
```{r select and do prelim variable cleaning}
nflx <- nflx |>
select(show_id, type, date_added, release_year, rating, listed_in) |> # useful variables (subjective)
separate(listed_in, sep=", ", into=c("g1", "g2", "g3")) |> # separate out genre listings (notice ", ")
# will result in lots of NAs
mutate(date_added = mdy(date_added), # convert data_added to a date var
year_added = year(date_added)) # create a separate year var
head(nflx,3)
```
:::
## Week 9 In-class Exercises - Q4
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
`separate` is the reverse of an R command we have already learned.
- What command is that?
## Data Exploration and Additional Cleaning
::: fragment
**Finding and Removing Data with Missing Values (Depends on Subject Matter)**
:::
- Are there observations for which `year_added` is missing (`NA`)?
- **Cool trick:** `is.na(nflx$year_added)` will create a vector of `FALSE` and `TRUE` values.
- `FALSE` unless the observation value is `NA`.
- R interprets `FALSE` as 0 and `TRUE` as 1 so you can sum this vector to find number of missing values.
::: fragment
```{r remove observations for which year_added is missing}
sum(is.na(nflx$year_added)) # number of missing values (there are 10)
table(is.na(nflx$year_added)) # could also summarize with table
nflx <- nflx |> filter(!is.na(year_added)) # remove NA's and check again
sum(is.na(nflx$year_added))
```
:::
## Data Exploration and Additional Cleaning
- It is not always valid to delete data unless it is for a specific valid transparent reason.
- For example: Are there observations for which rating is `NR` (not-rated) or `UR` (un-rated)? Are there other issues?
- In this case we delete observation because we are restricting our data to more common rating categories.
- Remainder of the categories can be grouped logically.
::: fragment
```{r remove observations with nr or ur rating}
table(nflx$rating) # examine categories
nflx <- nflx |> filter(!rating %in% c("NR", "UR", "66 min", "74 min", "84 min"))
table(nflx$rating) # examine again after filter
```
:::
## When can you delete data?
**If you find raw uncleaned data on the internet or are given data by a client or co-worker, is it okay to delete observations?**
Yes, under certain circumstances:
1. **You must be completely transparent about how you are limiting the data and why.**
- For example, I am limiting these data to specific time periods and specific genres.
- Other time and genre categories have too little data.
2. **You should never delete outliers or unusual observations unless you can confirm they are errors or are not part of the data you are studying.**
- I am simplifying the data by removing categories that are not of primary interest.
- Someone else might choose different categories and limit the data differently.
- Always be transparent about data sources and all decisions you make.
- I am also deleting three observations for which the ratings were missing.
##
### Modifying text or numeric variables with `ifelse`
- Numeric or text data can be converted to a categorical variable with a set of explicit `ifelse` statements.
- `ifelse` is not the ONLY way to categorize data, but it's my tool of choice.
- The input values for `ifelse` are ALWAYS the same:
- Input 1: a test where result is `TRUE` or `FALSE`.
- Input 2: output value if the test result is `TRUE`.
- Input 3: output value if the test result is `FALSE`.
::: fragment
```{r simple example of ifelse, eval = F}
x <- seq(9,90,9) # create var x that is a sequence of values
(d <- tibble(x)) # convert x to a tibble (dataset)
(d <- d |> mutate(y = ifelse(x < 35, "Low", NA))) # add variable y to dataset d
(d <- d |> mutate(y = ifelse(x >= 35 & x < 65, "Medium", y)))
(d <- d |> mutate(y = ifelse(x >= 65, "High", y)))
```
:::
##
### Modifying text multiple `ifelse` statements
- The previous example can be (and should be) condensed into one mutate statement as shown here.
::: fragment
```{r above example in one mutate command}
# create a dataset with 1 variable
x <- seq(9,90,9)
d <- tibble(x)
d <- d |> mutate(y = ifelse(x < 35, "Low", NA),
y = ifelse(x >= 35 & x < 65, "Medium", y),
y = ifelse(x >= 65, "High", y),
yF = factor(y, levels = c("Low", "Medium", "High"))) |>
# levels option used to specify correct order
glimpse()
```
:::
## Week 9 In-class Exercise
- We will examine the Netflix ratings data again and simplify the categories to reflect minimum appropriate ages.
- In this case, I have done some research on what these ratings mean.
- In your projects, you will be expected to understand the values in your variables
- You are likely to have to simplify your data categories for your data management goals.
- My proposed solution for these data is to categorize both TV shows and Movies into four minimum age categories:
- Minimum Age Categories: 0, 7, 13, 17
- Minimum Age 0: "G", "TV-G", "TV-Y"
- Minimum Age 7: "PG", "TV-PG", "TV-Y7", "TV-Y7-FV"
- Minimum Age 13: "PG-13", "TV-14"
- Minimum Age 17: "NC-17", "TV-MA", "R"
## What we will do Today (Thu. 10/23):
- Together we will fill in the blanks in the R chunk below
- You will submit the entire chunk (with your added comments) on Blackboard in a text file.
- Copy it into empty an text file and save it with your name.
- This counts towards class participation for Thursday, 10/23 and is due by Friday, 10/24, at midnight.
```{r using ifelse to create min_age variable for prof, echo=F, eval=T, include=F}
#table(nflx$rating)
nflx <- nflx |>
mutate(min_age = ifelse(rating %in% c("G", "TV-G", "TV-Y"), 0, NA),
min_age = ifelse(rating %in% c("PG", "TV-PG", "TV-Y7", "TV-Y7-FV"), 7, min_age),
min_age = ifelse(rating %in% c("PG-13", "TV-14"), 13, min_age),
min_age = ifelse(rating %in% c("NC-17", "TV-MA", "R"), 17, min_age))
#table(nflx$min_age)
```
::: fragment
```{r using ifelse to create min_age variable}
table(nflx$rating)
# nflx <- nflx |>
# mutate(min_age = ifelse(rating %in% c("G", "TV-G", "TV-Y"), 0, ____),
# min_age = ifelse(rating %in% c("PG", "TV-PG", "TV-Y7", "TV-Y7-FV"), 7, _____),
# min_age = ifelse(rating %in% c("PG-13", "TV-14"), ____, ____),
# min_age = ifelse())
# table(nflx$min_age)
```
:::
## Another `ifelse` Example
- We can use a similar technique to categorize when shows or movies were first shown (released).
- This code is very similar to our simple numerical example shown previously.
- In HW 5 - Part 1 (**Now Available**) you will:
- Filter data to remove releases from before 1981.
::: fragment
```{r categorizing release data into periods}
#table(nflx$release_year)
nflx <- nflx |>
mutate(release_period = ifelse(release_year <= 1980, "1925-1980", NA),
release_period = ifelse(release_year > 1980 & release_year <= 2000, "1981-2000", release_period),
release_period = ifelse(release_year > 2000 & release_year <= 2005, "2001-2005", release_period),
release_period = ifelse(release_year > 2005 & release_year <= 2010, "2006-2010", release_period),
release_period = ifelse(release_year > 2010 & release_year <= 2015, "2011-2015", release_period),
release_period = ifelse(release_year > 2015, "2016-2021", release_period))
table(nflx$release_period)
```
:::
## Cleaning Text Variables
- The code that follows **WILL NOT RUN** until the `min_age` and `release_period` variables are created (previous chunks).
- The next steps are tedious, but important, and will be different for every dataset you encounter.
- On a quiz you should know how to use these text commands:
- `gsub`, `str_trim`,
- `separate`, `paste`, (`unite`)
- `tolower`, `toupper`, (`str_to_lower`, `str_to_upper`)
- `ifelse` (covered in previous slides)
<br>
- Not required but useful: use `?str_trim` to also see `str_squish` and `str_pad`
##
### Week 9 In-class Exercises - Q5
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
- `str_trim` is used in an upcoming chunk to trim spaces before characters in some observations.
::: fragment
**What is another more general command that can used to replace spaces in character strings with nothing, e.g. replace " movies" with "movies"?**
:::
##
### Week 9 In-class Exercises - Q6
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
- Type `?tolower` into the console to view help file for this command and `toupper`.
- If the tidyverse package is loaded, you can type `?str_to_title` in the console to view the helpfile for this command, and others.
- Copy and paste the following lines into the console. The third line requires tye tidyverse package.
::: fragment
`tolower("WhAT is the Result of this Command?")`
`toupper("WhAT is the Result of this Command?")`
`str_to_title("WhAT is the Result of this Command?")`
:::
::: fragment
**What is the base command (not tidyverse) to covert a text string to all lowercase, which makes data management easier?**
:::
## Multi-step Process to clean genre
- Reshape data with `pivot_longer` and convert all text to lower case with `to_lower`
- Examine in console using this command: `cbind(table(nflx$genre))`
::: fragment
```{r cleaning up genre}
nflx <- nflx |>
select(show_id, type, year_added, release_period, min_age, g1, g2, g3) |>
pivot_longer(cols=c("g1","g2","g3"), names_to="g", values_to="genre")
# g is a place holder
nflx <- nflx |> select(!g) |> mutate(genre=tolower(genre)) |> glimpse()
```
:::
## Multi-step Process to clean genre
:::::: columns
::: {.column width="48%"}
- Recall when `separate` was used above, delimiter was specified as ", "
- That removed leading spaces.
- Alternatively, you emove leading spaces with `str_trim`
- Examine in console using this command: `cbind(table(nflx$genre))`
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
{fig-align="center"}
:::
::::::
## Remove Unnecessary Text
::::::: columns
:::: {.column width="48%"}
- This text is unnecessary and redundant because of `type` variable in data.
- Examine in console using this command: `cbind(table(nflx$genre))`
::: fragment
```{r remove movies tv shows and tv text}
# remove the text `movies` and `tv shows` and `tv` from genre names
nflx <- nflx |>
mutate(genre = gsub(" movies", "", genre),
genre = gsub("movies", "", genre),
genre = gsub(" tv shows", "", genre),
genre = gsub("tv shows", "", genre),
genre = gsub("tv ", "", genre),
genre = gsub(" tv", "", genre))
```
:::
::::
::: {.column width="4%"}
:::
::: {.column width="48%"}
{fig-align="center"}
:::
:::::::
##
### Consolidate `genre` Categories using `ifelse`
- Below I am only editing and consolidating categories I plan to use.
- Examine in console using this command: `cbind(table(nflx$genre))`
::: fragment
```{r consolidate and edit genres of interest}
nflx <- nflx |>
mutate(genre = ifelse(genre == "drama", "dramas", genre),
genre = ifelse(genre %in% c("stand-up comedy",
"stand-up comedy & talk shows"),
"comedies", genre),
genre = ifelse(genre %in% c("documentaries", "docuseries"),
"docs", genre),
genre = ifelse(genre %in% c("children & family", "kids'"),
"kids", genre),
genre = gsub("action & adventure", "action_adventr", genre))
```
:::
## Filter Data to Genres of Interest
::::::: columns
:::: {.column width="60%"}
::: fragment
```{r filter data to six genres we have consolidated}
nflx <- nflx |>
filter(genre %in% c("action_adventr", "comedies", "docs",
"dramas", "international", "kids"))
```
:::
- Examine results in console with these commands
- `cbind(table(nflx$genre))`
- `table(nflx$type, nflx$genre)`
::::
::: {.column width="4%"}
:::
::: {.column width="36%"}
{fig-align="center"}
:::
:::::::
## Create Final Summary Dataset
In the HW 5 - Part 1 Example Dashboard, I use the `nflx_tv` dataset, `TV Show` observations from this dataset.
In HW 5 - Part 1, Students will create use the `nflx_mv` dataset, the `Movie` observations from this dataset.
```{r create final summary dataset}
nflx_wide <- nflx |>
group_by(type, release_period, year_added, min_age, genre) |>
summarize(n = n()) |> # n() tallies number of obs
pivot_wider(id_cols = c(type, release_period, year_added, min_age), # col. for each genre
names_from = genre,
values_from = n)
nflx_wide[is.na(nflx_wide)] <- 0 # why should NA's be 0 here?
nflx_tv <- nflx_wide |> # example dashboard made from nflx tv
filter(type == "TV Show")
# nflx_mv <-
```
##
### Optional Group Meetings with Me
:::::: columns
::: {.column width="48%"}
- After break I will post a sign-up schedule
- Meetings will be 5-10 minutes on Zoom
- After reading your submitted pre-proposal I will:
- Ask your group questions
- Provide feedback to guide you towards good data and a good plan
- Answer question you might have before you write your final proposal.
:::
::: {.column width="4%"}
:::
::: {.column width="48%"}
{fig-align="center"}
:::
::::::
##
### Key Points from This Week
- Group Dashboard Project and Proposals:
- Rough Draft due Thu. 10/30 at 6:00 PM
- Optional proposal group meetings will take place on 11/6 and 11/7
- Final Proposal due Tue. 11/11 at 6:00 PM
- Data Exploration and Cleaning
- Iterative, subjective process
- Always be transparent and document choices
- Introduction to HW 5 - Part 1 Data and Assignment
- Due Wed. 10/29 at midnight
- Quiz 2 on Thu. 11/6
::: fragment
You may submit an 'Engagement Question' about each lecture until midnight on the day of the lecture. **A minimum of four submissions are required during the semester.**
:::