Week 8

Dashboard Project Proposal & Data Cleaning

Author

Penelope Pooler Eisenbies

Published

March 3, 2025

Housekeeping

Groups will be assigned by Thursday, 3/6.
Project Proposal Rough Draft is due Thursday, 3/27
HW 5 - Part 1 is due Wed. 3/26
- We will talk about HW 5 - Part 1 on Thursday 3/6.

BUA 455 Dashboard Group Project

Importance of Proposals:

As a data analyst, you have skills that others don’t.
In this course, you are learning to acquire, combine, manage, and present data.
In the Business Analytics major, you learn to analyze and interpret data.
These skills are unique and are useful to employers, colleagues, clients BUT you will spend your career explaining and selling and explaining your skillset.
Most people will NOT know how your skills can help them.
It is your role to help them understand:
- what questions can be answered with data
- what data will help them answer their questions and how to get it.
- How data management and analyses can help them and will be essential to their goals.

How the Proposal Process fits Wthin this course

The scope of this course is limited to data acquisition, management, and presentation.
There is much that can be learned from combining data sources and exploring data prior to analysis.
In other courses, you are learning analytical methods that complement this course.
The more methods and concepts you know, the better prepared you are to deal with any data you encounter.
No matter how much you know, the majority of you time will be spent using the skills from this course.
Depending on your work environment and position, others may be assigned time consuming repetitive data management tasks by you BUT
- You should still direct the data management and presentation work because you will be held responsible.
File management, data management, data curation, and data presentation complement and augment analyses.

Project Proposal Rough Draft

The rough draft does not need to be more than a brief bullet-ed outline.
What data does your group plan to work with?
- Summarize data topic in 1-2 sentences.
- Do some data exploration and look for links to good data sources (AI may be helpful in search).
- Rough draft proposal should include links to specific data sources that might work.
What questions do you think you want to answer?
- Propose 2-4 questions (or more) you hope to address with your data dashboard.
- For each question propose which dataset and variables might help you convey an answer.

Data Sources

The course website lists some data sources and I will add to this list over Spring Break.
Another starting place for inspiration is Statista
- It is free if you login from the SU Library and is a great starting point.
- Your project should go beyond the small curated Statista datasets but it may provide inspiration.
For this course, consider the topic choice to be a iterative process:
1. Develop a topic and potential questions (but don’t become too attached).
2. Explore provided data sources, Google, and AI using your question(s).
3. Examine data that is somewhat related to question.
4. Edit and refine topic and question based on data available.

Data Source Challenges

If you are interested in sports data or billboard music data and don’t know how to scrape data, I can help.

There are sports and music data available, but the websites often don’t make the datasets downloadable.

Instead, the data must be scraped as we did with the Box Office Mojo data.

Projects that require scraping data are doable, but data management will take more time and you should start ASAP.

Data Sourcing Example

I chose a VERY obscure topic to see how long it would take me to find a reasonable dataset.
Initial topic: Craft Traditions in Europe
I started with posing a question to CoPilot.
That led me to links for Eurostat and the European Commission
One the European Commission site I found the Joint Research Center.
A combination of website exploration and AI questions led to
Competence Centre on Composite Indicators and Scoreboards
- JRC Composite Indicators Site and then led to
- and a site about cultural and creative European cities
- and a link to Docs and Data and this download link:

Flexibility Leads to Interesting Data

Raw Excel Data with interesting variables for over 500 European cities.
Dataset is not directly related to initial topic but there are data to build a multi-page dashboard.
Going from initial obscure idea to interesting data took about 30 minutes.
We (TAs and I) are happy to help you with this process. Be flexible and curious.

Final Proposal

Due Tue. 4/8 - 48 Hour Grace Period
The final proposal draft should be 1-3 pages at most (double spaced, 12 pt. font).
- Bullet points are preferred to paragraphs
Although your client(s) are hypothetical, it helps to envision your target client.
The goal of the proposal is to
- explain in non-technical language why your dashboard will be useful.
- convince the prospective client to trust you with their data.
How will client use the information provided in the dashboard to
- improve their marketing plan or their purchasing and staffing plan?
- improve their overall investment strategy or their supply chain strategy?

More about Proposal Final Draft

Due Tue. 4/8 - 48 Hour Grace Period
The proposal also provides a roadmap for the work you plan to do so that all group members are informed.
- List tentative goals that you hope to achieve and ideas for how to achieve them.
FINAL Proposal should ALSO include:
- a preliminary descriptive list of your planned pages.
- a tentative plan for what data will be presented on each page, and how will it be presented (which plots or tables).
- a list of which team member is primarily responsible for each page.

Proposals are TENTATIVE Agreements

Understandably, the work is not yet done, so your plans may change, which is okay.
Your final plots and tables are not done, but it is good to have a plan for what they will be.
If possible show explicit links between each question or goal you plan to address and the proposed component of the dashboard that addresses it.
In a ‘real world’ situation:
- It is better not to provide final proprietary work products at proposal stage.
- Ideas and plots should be polished enough to entice client.
A good proposal shows that you have creative ideas and have put some thought into how you can help the client.
In real life, proposals and client goals are more concrete, but in this course, I encourage you to experiment.
Again, it is okay if your plans change based on what you discover in the data.

Introduction to HW 5 - Part 1 Data

Today: Examine and clean data

First steps are to examine .csv file and to import and examine data in R.
Look for quirks that might be a challenge
- Hint: Examine date_added column in the .csv file or in R

Code

```{r import netflix data and examine}
# import data and examine
nflx <- read_csv("data/netflix_titles.csv", show_col_types=F) |> glimpse()
```

Rows: 8,807
Columns: 12
$ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
$ type         <chr> "Movie", "TV Show", "TV Show", "TV Show", "TV Show", "TV …
$ title        <chr> "Dick Johnson Is Dead", "Blood & Water", "Ganglands", "Ja…
$ director     <chr> "Kirsten Johnson", NA, "Julien Leclercq", NA, NA, "Mike F…
$ cast         <chr> NA, "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Mola…
$ country      <chr> "United States", "South Africa", NA, NA, "India", NA, NA,…
$ date_added   <chr> "September 25, 2021", "September 24, 2021", "September 24…
$ release_year <dbl> 2020, 2021, 2021, 2021, 2021, 2021, 2021, 1993, 2021, 202…
$ rating       <chr> "PG-13", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "PG…
$ duration     <chr> "90 min", "2 Seasons", "1 Season", "1 Season", "2 Seasons…
$ listed_in    <chr> "Documentaries", "International TV Shows, TV Dramas, TV M…
$ description  <chr> "As her father nears the end of his life, filmmaker Kirst…

Week 8 In-class Exercises - Q1-Q3

Session ID: bua455s25

Q1. How many different date formats do you see in the date_added column in the .csv file?
Q2. How many different date formats do you see in the date_added column in the imported dataset in the Global Environment?
Q3. How many different commands are required to convert the date_added column to a date variable?

Cleaning These Data

New command for dealing with text data: separate

Review of lubridate commands

Next steps:
- Select variables
- use separate to split listed_in into 3 columns (Why?)
- Convert date_added to a date variable and then create a year variable

Code

```{r select and do prelim variable cleaning}
nflx <- nflx |>
  select(show_id, type, date_added, release_year, rating, listed_in) |> # useful variables (subjective)

  separate(listed_in, sep=", ", into=c("g1", "g2", "g3")) |>        # separate out genre listings (notice ", ")
                                                                    # will result in lots of NAs
  mutate(date_added = mdy(date_added),                              # convert data_added to a date var
         year_added = year(date_added))                             # create a separate year var
```

Warning: Expected 3 pieces. Missing pieces filled with `NA` in 5078 rows [1, 4, 7, 9,
10, 13, 14, 16, 17, 19, 23, 24, 28, 29, 30, 32, 35, 38, 39, 40, ...].

Code

```{r select and do prelim variable cleaning}
head(nflx,3)
```

# A tibble: 3 × 9
  show_id type    date_added release_year rating g1       g2    g3    year_added
  <chr>   <chr>   <date>            <dbl> <chr>  <chr>    <chr> <chr>      <dbl>
1 s1      Movie   2021-09-25         2020 PG-13  Documen… <NA>  <NA>        2021
2 s2      TV Show 2021-09-24         2021 TV-MA  Interna… TV D… TV M…       2021
3 s3      TV Show 2021-09-24         2021 TV-MA  Crime T… Inte… TV A…       2021

Week 8 In-class Exercises - Q4

Session ID: bua455s25

separate is the reverse of an R command we have already learned.

What command is that?

Data Exploration and Additional Cleaning

Finding and Removing Data with Missing Values (Depends on Subject Matter)

Are there observations for which year_added is missing (NA)?
- Cool trick: is.na(nflx$year_added) will create a vector of FALSE and TRUE values.
- FALSE unless the observation value is NA.
- R interprets FALSE as 0 and TRUE as 1 so you can sum this vector to find number of missing values.

Code

```{r remove observations for which year_added is missing}
sum(is.na(nflx$year_added))                # number of missing values (there are 10)
table(is.na(nflx$year_added))              # could also summarize with table
nflx <- nflx |> filter(!is.na(year_added)) # remove NA's and check again
sum(is.na(nflx$year_added))
```

[1] 10

FALSE  TRUE 
 8797    10 
[1] 0

Data Exploration and Additional Cleaning

It is not always valid to delete data unless it is for a specific valid transparent reason.
For example: Are there observations for which rating is NR (not-rated) or UR (un-rated)? Are there other issues?
In this case we delete observation because we are restricting our data to more common rating categories.
Remainder of the categories can be grouped logically.

Code

```{r remove observations with nr or ur rating}
table(nflx$rating) # examine categories
nflx <- nflx |> filter(!rating %in% c("NR", "UR", "66 min", "74 min", "84 min"))
table(nflx$rating) # examine again after filter
```


  66 min   74 min   84 min        G    NC-17       NR       PG    PG-13 
       1        1        1       41        3       79      287      490 
       R    TV-14     TV-G    TV-MA    TV-PG     TV-Y    TV-Y7 TV-Y7-FV 
     799     2157      220     3205      861      306      333        6 
      UR 
       3 

       G    NC-17       PG    PG-13        R    TV-14     TV-G    TV-MA 
      41        3      287      490      799     2157      220     3205 
   TV-PG     TV-Y    TV-Y7 TV-Y7-FV 
     861      306      333        6

When can you delete data?

If you find raw uncleaned data on the internet or are given data by a client or co-worker, is it okay to delete observations?

Yes, under certain circumstances:

You must be completely transparent about how you are limiting the data and why.

For example, I am limiting these data to specific time periods and specific genres.
Other time and genre categories have too little data.

You should never delete outliers or unusual observations unless you can confirm they are errors or are not part of the data you are studying.

I am simplifying the data by removing categories that are not of primary interest.
Someone else might choose different categories and limit the data differently.
Always be transparent about data sources and all decisions you make.
I am also deleting three observations for which the ratings were missing.

Modifying text or numeric variables with `ifelse`

Numeric or text data can be converted to a categorical variable with a set of explicit ifelse statements.
ifelse is not the ONLY way to categorize data, but it’s my tool of choice.
The input values for ifelse are ALWAYS the same:
- Input 1: a test where result is TRUE or FALSE.
- Input 2: output value if the test result is TRUE.
- Input 3: output value if the test result is FALSE.

Code

```{r simple example of ifelse, eval = F}
x <- seq(9,90,9)    # create var x that is a sequence of values
(d <- tibble(x))    # convert x to a tibble (dataset) 
(d <- d |> mutate(y = ifelse(x < 35, "Low", NA))) # add variable y to dataset d
(d <- d |> mutate(y = ifelse(x >= 35 & x < 65, "Medium", y))) 
(d <- d |> mutate(y = ifelse(x >= 65, "High", y)))           
```

Modifying text multiple `ifelse` statements

The previous example can be (and should be) condensed into one mutate statement as shown here.

Code

```{r above example in one mutate command}
# create a dataset with 1 variable
x <- seq(9,90,9)
d <- tibble(x)
d <- d |> mutate(y = ifelse(x < 35, "Low", NA),                         
                 y = ifelse(x >= 35 & x < 65, "Medium", y),
                 y = ifelse(x >= 65, "High", y),
                 yF = factor(y, levels = c("Low", "Medium", "High"))) |> 
                              # levels option used to specify correct order
glimpse()
```

Rows: 10
Columns: 3
$ x  <dbl> 9, 18, 27, 36, 45, 54, 63, 72, 81, 90
$ y  <chr> "Low", "Low", "Low", "Medium", "Medium", "Medium", "Medium", "High"…
$ yF <fct> Low, Low, Low, Medium, Medium, Medium, Medium, High, High, High

Week 8 In-class Exercise

We will examine the Netflix ratings data again and simplify the categories to reflect minimum appropriate ages.
In this case, I have done some research on what these ratings mean.
In your projects, you will be expected to understand the values in your variables
- You are likely to have to simplify your data categories for your data management goals.
My proposed solution for these data is to categorize both TV shows and Movies into four minimum age categories:
- Minimum Age Categories: 0, 7, 13, 17
- Minimum Age 0: “G”, “TV-G”, “TV-Y”
- Minimum Age 7: “PG”, “TV-PG”, “TV-Y7”, “TV-Y7-FV”
- Minimum Age 13: “PG-13”, “TV-14”
- Minimum Age 17: “NC-17”, “TV-MA”, “R”

What we will do Today (Thu. 3/6):

Together we will fill in the blanks in the R chunk below
- You will submit the entire chunk (with your added comments) on Blackboard in a text file.
- Copy it into empty an text file and save it with your name.
- This counts towards class participation for Thursday, 3/6 and is due by Friday, 3/7, at midnight.

Code

```{r using ifelse to create min_age variable}
table(nflx$rating)

# nflx <- nflx |>
#   mutate(min_age = ifelse(rating %in% c("G", "TV-G", "TV-Y"), 0, ____),
#          min_age = ifelse(rating %in% c("PG", "TV-PG", "TV-Y7", "TV-Y7-FV"), 7, _____),
#          min_age = ifelse(rating %in% c("PG-13", "TV-14"), ____, ____),
#          min_age = ifelse())

# table(nflx$min_age)
```


       G    NC-17       PG    PG-13        R    TV-14     TV-G    TV-MA 
      41        3      287      490      799     2157      220     3205 
   TV-PG     TV-Y    TV-Y7 TV-Y7-FV 
     861      306      333        6

Another `ifelse` Example

We can use a similar technique to categorize when shows or movies were first shown (released).
This code is very similar to our simple numerical example shown previously.
In HW 5 - Part 1 (Now Available) you will:
- Filter data to remove releases from before 1981.

Code

```{r categorizing release data into periods}
#table(nflx$release_year)

nflx <- nflx |>
  mutate(release_period = ifelse(release_year <= 1980, "1925-1980", NA),
         release_period = ifelse(release_year > 1980 & release_year <= 2000, "1981-2000", release_period),
         release_period = ifelse(release_year > 2000 & release_year <= 2005, "2001-2005", release_period),
         release_period = ifelse(release_year > 2005 & release_year <= 2010, "2006-2010", release_period),
         release_period = ifelse(release_year > 2010 & release_year <= 2015, "2011-2015", release_period),
         release_period = ifelse(release_year > 2015, "2016-2021", release_period))

table(nflx$release_period)
```


1925-1980 1981-2000 2001-2005 2006-2010 2011-2015 2016-2021 
      130       424       297       655      1577      5629

Cleaning Text Variables

The code that follows WILL NOT RUN until the min_age and release_period variables are created (previous chunks).
The next steps are tedious, but important, and will be different for every dataset you encounter.
On a quiz you should know how to use these text commands:
- gsub, str_trim,
- separate, paste,
- tolower, toupper,
- ifelse (covered in previous slides)
- Not required but useful: use ?str_trim to also see str_squish and str_pad

Week 8 In-class Exercises - Q5-Q6

Session ID: bua455s25

str_trim is used in an upcoming chunk to trim spaces before characters in some observations.
Q5. Can gsub also be used for this purpose?

Type ?tolower into the console to bring up the help file for this command and toupper.
Copy and paste these lines into the console:

tolower("WhAT is the Result of this Command?")

toupper("WhAT is the Result of this Command?")

Answer Q6.

Multi-step Process to clean genre

Reshape data with pivot_longer and convert all text to lower case with to_lower
Examine in console using this command: cbind(table(nflx$genre))

Code

```{r cleaning up genre}
nflx <- nflx |>
  select(show_id, type, year_added, release_period, min_age, g1, g2, g3) |>
  pivot_longer(cols=c("g1","g2","g3"), names_to="g", values_to="genre") 
                                                 # g is a place holder
nflx <- nflx |> select(!g) |> mutate(genre=tolower(genre)) |> glimpse()
```

Rows: 26,136
Columns: 6
$ show_id        <chr> "s1", "s1", "s1", "s2", "s2", "s2", "s3", "s3", "s3", "…
$ type           <chr> "Movie", "Movie", "Movie", "TV Show", "TV Show", "TV Sh…
$ year_added     <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2…
$ release_period <chr> "2016-2021", "2016-2021", "2016-2021", "2016-2021", "20…
$ min_age        <dbl> 13, 13, 13, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17,…
$ genre          <chr> "documentaries", NA, NA, "international tv shows", "tv …

Multi-step Process to clean genre

Recall when separate was used above, delimiter was specified as “,”
- That removed leading spaces.
- Alternatively, you emove leading spaces with str_trim
Examine in console using this command: cbind(table(nflx$genre))

Remove Unnecessary Text

This text is unnecessary and redundant because of type variable in data.
Examine in console using this command: cbind(table(nflx$genre))

Code

```{r remove movies tv shows and tv text}
# remove the text `movies` and `tv shows` and `tv` from genre names
nflx <- nflx |>
  mutate(genre = gsub(" movies", "", genre),
         genre = gsub("movies", "", genre),
         genre = gsub(" tv shows", "", genre),
         genre = gsub("tv shows", "", genre),
         genre = gsub("tv ", "", genre),
         genre = gsub(" tv", "", genre))
```

Consolidate `genre` Categories using `ifelse`

Below I am only editing and consolidating categories I plan to use.
Examine in console using this command: cbind(table(nflx$genre))

Code

```{r consolidate and edit genres of interest}
nflx <- nflx |>
  mutate(genre = ifelse(genre == "drama", "dramas", genre),
         
         genre = ifelse(genre %in% c("stand-up comedy", 
                                     "stand-up comedy & talk shows"), 
                        "comedies", genre),

         genre = ifelse(genre %in% c("documentaries", "docuseries"), 
                        "docs", genre),
         
         genre = ifelse(genre %in% c("children & family", "kids'"), 
                        "kids", genre),

         genre = gsub("action & adventure", "action_adventr", genre))
```

Filter Data to Genres of Interest

Code

```{r filter data to six genres we have consolidated}
nflx <- nflx |>
  filter(genre %in% c("action_adventr", "comedies", "docs", 
                      "dramas", "international", "kids"))
```

Examine results in console with these commands
- cbind(table(nflx$genre))
- table(nflx$type, nflx$genre)

Create Final Summary Dataset

In the HW 5 - Part 1 Example Dashboard, I use the nflx_tv dataset, TV Show observations from this dataset.

In HW 5 - Part 1, Students will create use the nflx_mv dataset, the Movie observations from this dataset.

Code

```{r create final summary dataset}
nflx_wide <- nflx |>                            
  group_by(type, release_period, year_added, min_age, genre) |>
  summarize(n = n()) |>                   # n() tallies number of obs       
  pivot_wider(id_cols = c(type, release_period, year_added, min_age), # col. for each genre
              names_from = genre, 
              values_from = n)
```

`summarise()` has grouped output by 'type', 'release_period', 'year_added',
'min_age'. You can override using the `.groups` argument.

Code

```{r create final summary dataset}
nflx_wide[is.na(nflx_wide)] <- 0                              # why should NA's be 0 here?


nflx_tv <- nflx_wide |>                                       # example dashboard  made from nflx tv
  filter(type == "TV Show") 

# nflx_mv <- 
```

Optional Group Meetings with Me

After break I will post a sign-up schedule
Meetings will be 5-10 minutes on Zoom
After reading your submitted pre-proposal I will:
- Ask your group questions
- Provide feedback to guide you towards good data and a good plan
- Answer question you might have before you write your final proposal.

Key Points from This Week

Group Dashboard Project and Proposals:
- Rough Draft due Thu. 3/27 at 6:00 PM
- Optional proposal group meetings will take place on 4/3 and 4/4
- Final Proposal due Tue. 4/8 at 6:00 PM
Data Exploration and Cleaning
- Iterative, subjective process
- Always be transparent and document choices
Introduction to HW 5 - Part 1 Data
- Due Wed. 3/26 at midnight
Quiz 2 on Thu. 4/3

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.

--- title: "Week 8" subtitle: "Dashboard Project Proposal & Data Cleaning" author: "Penelope Pooler Eisenbies" date: last-modified toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r include=F} #|label: setup knitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunks options(scipen=100) # suppress scientific notation # install pacman if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") pacman::p_load(pacman, tidyverse, gridExtra, magrittr, kableExtra, tidyquant, highcharter, dygraphs) # install and load required packages p_loaded() # verify loaded packages ``` - Groups will be assigned by Thursday, 3/6. - Project Proposal Rough Draft is due Thursday, 3/27 - HW 5 - Part 1 is due Wed. 3/26 - We will talk about HW 5 - Part 1 on Thursday 3/6. ## ### BUA 455 Dashboard Group Project ::: fragment **Importance of Proposals:** ::: - As a data analyst, you have skills that others don't. - In this course, you are learning to acquire, combine, manage, and present data. - In the Business Analytics major, you learn to analyze and interpret data. - These skills are unique and are useful to employers, colleagues, clients BUT you will spend your career explaining and selling and explaining your skillset. - Most people will NOT know how your skills can help them. - It is your role to help them understand: - what questions can be answered with data - what data will help them answer their questions and how to get it. - How data management and analyses can help them and will be essential to their goals. ## ### How the Proposal Process fits Wthin this course - The scope of this course is **limited to data acquisition, management, and presentation**. - There is much that can be learned from combining data sources and exploring data prior to analysis. - In other courses, you are learning analytical methods that complement this course. - The more methods and concepts you know, the better prepared you are to deal with any data you encounter. - No matter how much you know, the majority of you time will be spent using the skills from this course. - Depending on your work environment and position, others may be assigned time consuming repetitive data management tasks by you BUT - You should still direct the data management and presentation work because you will be held responsible. - **File management, data management, data curation, and data presentation complement and augment analyses**. ## ### Project Proposal Rough Draft - The rough draft does not need to be more than a brief bullet-ed outline. - **What data does your group plan to work with?** - Summarize data topic in 1-2 sentences. - Do some data exploration and look for links to good data sources (AI may be helpful in search). - Rough draft proposal should include ***links to specific data sources that might work.*** - **What questions do you think you want to answer?** - Propose 2-4 questions (or more) you hope to address with your data dashboard. - For each question propose which dataset and variables might help you convey an answer. ## Data Sources - The [course website lists some data sources](https://penelope2040.quarto.pub/bua-455-semester/#interesting-data){target="_blank"} and I will add to this list over Spring Break. - Another starting place for inspiration is [**Statista**](https://login.libezproxy2.syr.edu/login?url=https://www.statista.com){target="_blank"} - It is free if you login from the SU Library and is a great starting point. - **Your project should go beyond the small curated Statista datasets** but it may provide inspiration. - For this course, consider the topic choice to be a iterative process: 1. Develop a topic and potential questions (but don't become too attached). 2. Explore provided data sources, Google, and AI using your question(s). 3. Examine data that is somewhat related to question. 4. Edit and refine topic and question based on data available. ## Data Source Challenges - If you are interested in **sports** data or **billboard music** data and don't know how to scrape data, I can help. - There are sports and music data available, but the websites often don't make the datasets downloadable. - Instead, the data must be scraped as we did with the Box Office Mojo data. - Projects that require scraping data are doable, but data management will take more time and **you should start ASAP**. ## Data Sourcing Example - I chose a VERY obscure topic to see how long it would take me to find a reasonable dataset. - Initial topic: **Craft Traditions in Europe** - I started with posing a question to CoPilot. - That led me to links for [**Eurostat**](https://ec.europa.eu/eurostat){target="_blank"} and the [**European Commission**](https://commission.europa.eu/index_en){target="_blank"} - One the **European Commission** site I found the [**Joint Research Center**](https://commission.europa.eu/about-european-commission/departments-and-executive-agencies/joint-research-centre_en){target="_blank"}. - A combination of website exploration and AI questions led to - [**Competence Centre on Composite Indicators and Scoreboards**](https://knowledge4policy.ec.europa.eu/composite-indicators_en){target="_blank"} - [**JRC Composite Indicators Site**](https://composite-indicators.jrc.ec.europa.eu/){target="_blank"} and then led to - and a site about [**cultural and creative European cities**](https://composite-indicators.jrc.ec.europa.eu/cultural-creative-cities-monitor){target="_blank"} - and a link to [**Docs and Data**](https://composite-indicators.jrc.ec.europa.eu/cultural-creative-cities-monitor/docs-and-data){target="_blank"} and this download link: - ![](img/data_sourcing_example_europe_culture.png){width="50%"} ## Flexibility Leads to Interesting Data - Raw Excel Data with interesting variables for over 500 European cities. - Dataset is not directly related to initial topic but there are data to build a multi-page dashboard. - Going from initial obscure idea to interesting data took about 30 minutes. - We (TAs and I) are happy to help you with this process. **Be flexible and curious.** ::: fragment ![](img/Euro_Culture_2023.png){width="65%;" fig-align="center"} ::: ## Final Proposal - **Due Tue. 4/8 - 48 Hour Grace Period** - The final proposal draft should be 1-3 pages at most (double spaced, 12 pt. font). - **Bullet points are preferred to paragraphs** - Although your client(s) are hypothetical, it helps to envision your target client. - The goal of the proposal is to - explain in non-technical language why your dashboard will be useful. - convince the prospective client to trust you with their data. - How will client use the information provided in the dashboard to - improve their marketing plan or their purchasing and staffing plan? - improve their overall investment strategy or their supply chain strategy? ## More about Proposal Final Draft - **Due Tue. 4/8 - 48 Hour Grace Period** - The proposal also provides a roadmap for the work you plan to do so that all group members are informed. - List tentative goals that you hope to achieve and ideas for how to achieve them. - FINAL Proposal should **ALSO** include: - a preliminary descriptive list of your planned pages. - a tentative plan for what data will be presented on each page, and how will it be presented (which plots or tables). - a list of which team member is primarily responsible for each page. ## Proposals are **TENTATIVE** Agreements - Understandably, the work is not yet done, so your plans may change, which is okay. - Your final plots and tables are not done, but it is good to have a plan for what they will be. - If possible show explicit links between each question or goal you plan to address and the proposed component of the dashboard that addresses it. - In a 'real world' situation: - It is better not to provide final proprietary work products at proposal stage. - Ideas and plots should be polished enough to entice client. - A good proposal shows that you have creative ideas and have put some thought into how you can help the client. - In real life, proposals and client goals are more concrete, but in this course, I encourage you to experiment. - ***Again, it is okay if your plans change based on what you discover in the data.*** ## Introduction to HW 5 - Part 1 Data ::: fragment **Today: Examine and clean data** ::: - First steps are to examine .csv file and to import and examine data in R. - Look for quirks that might be a challenge - Hint: Examine `date_added` column in the .csv file or in R ::: fragment ```{r import netflix data and examine} # import data and examine nflx <- read_csv("data/netflix_titles.csv", show_col_types=F) |> glimpse() ``` ::: ## ### Week 8 In-class Exercises - Q1-Q3 ***Session ID: bua455s25*** - **Q1.** How many different date formats do you see in the date_added column in the `.csv` file? - **Q2.** How many different date formats do you see in the date_added column in the imported dataset in the `Global Environment`? - **Q3.** How many different commands are required to convert the date_added column to a date variable? ## Cleaning These Data New command for dealing with text data: **`separate`** Review of `lubridate` commands - Next steps: - Select variables - use separate to split `listed_in` into 3 columns (Why?) - Convert `date_added` to a date variable and then create a year variable ::: fragment ```{r select and do prelim variable cleaning} nflx <- nflx |> select(show_id, type, date_added, release_year, rating, listed_in) |> # useful variables (subjective) separate(listed_in, sep=", ", into=c("g1", "g2", "g3")) |> # separate out genre listings (notice ", ") # will result in lots of NAs mutate(date_added = mdy(date_added), # convert data_added to a date var year_added = year(date_added)) # create a separate year var head(nflx,3) ``` ::: ## Week 8 In-class Exercises - Q4 ***Session ID: bua455s25*** `separate` is the reverse of an R command we have already learned. - What command is that? ## Data Exploration and Additional Cleaning ::: fragment **Finding and Removing Data with Missing Values (Depends on Subject Matter)** ::: - Are there observations for which `year_added` is missing (`NA`)? - **Cool trick:** `is.na(nflx$year_added)` will create a vector of `FALSE` and `TRUE` values. - `FALSE` unless the observation value is `NA`. - R interprets `FALSE` as 0 and `TRUE` as 1 so you can sum this vector to find number of missing values. ::: fragment ```{r remove observations for which year_added is missing} sum(is.na(nflx$year_added)) # number of missing values (there are 10) table(is.na(nflx$year_added)) # could also summarize with table nflx <- nflx |> filter(!is.na(year_added)) # remove NA's and check again sum(is.na(nflx$year_added)) ``` ::: ## Data Exploration and Additional Cleaning - It is not always valid to delete data unless it is for a specific valid transparent reason. - For example: Are there observations for which rating is `NR` (not-rated) or `UR` (un-rated)? Are there other issues? - In this case we delete observation because we are restricting our data to more common rating categories. - Remainder of the categories can be grouped logically. ::: fragment ```{r remove observations with nr or ur rating} table(nflx$rating) # examine categories nflx <- nflx |> filter(!rating %in% c("NR", "UR", "66 min", "74 min", "84 min")) table(nflx$rating) # examine again after filter ``` ::: ## When can you delete data? **If you find raw uncleaned data on the internet or are given data by a client or co-worker, is it okay to delete observations?** Yes, under certain circumstances: 1. **You must be completely transparent about how you are limiting the data and why.** - For example, I am limiting these data to specific time periods and specific genres. - Other time and genre categories have too little data. 2. **You should never delete outliers or unusual observations unless you can confirm they are errors or are not part of the data you are studying.** - I am simplifying the data by removing categories that are not of primary interest. - Someone else might choose different categories and limit the data differently. - Always be transparent about data sources and all decisions you make. - I am also deleting three observations for which the ratings were missing. ## ### Modifying text or numeric variables with `ifelse` - Numeric or text data can be converted to a categorical variable with a set of explicit `ifelse` statements. - `ifelse` is not the ONLY way to categorize data, but it's my tool of choice. - The input values for `ifelse` are ALWAYS the same: - Input 1: a test where result is `TRUE` or `FALSE`. - Input 2: output value if the test result is `TRUE`. - Input 3: output value if the test result is `FALSE`. ::: fragment ```{r simple example of ifelse, eval = F} x <- seq(9,90,9) # create var x that is a sequence of values (d <- tibble(x)) # convert x to a tibble (dataset) (d <- d |> mutate(y = ifelse(x < 35, "Low", NA))) # add variable y to dataset d (d <- d |> mutate(y = ifelse(x >= 35 & x < 65, "Medium", y))) (d <- d |> mutate(y = ifelse(x >= 65, "High", y))) ``` ::: ## ### Modifying text multiple `ifelse` statements - The previous example can be (and should be) condensed into one mutate statement as shown here. ::: fragment ```{r above example in one mutate command} # create a dataset with 1 variable x <- seq(9,90,9) d <- tibble(x) d <- d |> mutate(y = ifelse(x < 35, "Low", NA), y = ifelse(x >= 35 & x < 65, "Medium", y), y = ifelse(x >= 65, "High", y), yF = factor(y, levels = c("Low", "Medium", "High"))) |> # levels option used to specify correct order glimpse() ``` ::: ## Week 8 In-class Exercise - We will examine the Netflix ratings data again and simplify the categories to reflect minimum appropriate ages. - In this case, I have done some research on what these ratings mean. - In your projects, you will be expected to understand the values in your variables - You are likely to have to simplify your data categories for your data management goals. - My proposed solution for these data is to categorize both TV shows and Movies into four minimum age categories: - Minimum Age Categories: 0, 7, 13, 17 - Minimum Age 0: "G", "TV-G", "TV-Y" - Minimum Age 7: "PG", "TV-PG", "TV-Y7", "TV-Y7-FV" - Minimum Age 13: "PG-13", "TV-14" - Minimum Age 17: "NC-17", "TV-MA", "R" ## What we will do Today (Thu. 3/6): - Together we will fill in the blanks in the R chunk below - You will submit the entire chunk (with your added comments) on Blackboard in a text file. - Copy it into empty an text file and save it with your name. - This counts towards class participation for Thursday, 3/6 and is due by Friday, 3/7, at midnight. ```{r using ifelse to create min_age variable for prof, echo=F, eval=T, include=F} #table(nflx$rating) nflx <- nflx |> mutate(min_age = ifelse(rating %in% c("G", "TV-G", "TV-Y"), 0, NA), min_age = ifelse(rating %in% c("PG", "TV-PG", "TV-Y7", "TV-Y7-FV"), 7, min_age), min_age = ifelse(rating %in% c("PG-13", "TV-14"), 13, min_age), min_age = ifelse(rating %in% c("NC-17", "TV-MA", "R"), 17, min_age)) #table(nflx$min_age) ``` ::: fragment ```{r using ifelse to create min_age variable} table(nflx$rating) # nflx <- nflx |> # mutate(min_age = ifelse(rating %in% c("G", "TV-G", "TV-Y"), 0, ____), # min_age = ifelse(rating %in% c("PG", "TV-PG", "TV-Y7", "TV-Y7-FV"), 7, _____), # min_age = ifelse(rating %in% c("PG-13", "TV-14"), ____, ____), # min_age = ifelse()) # table(nflx$min_age) ``` ::: ## Another `ifelse` Example - We can use a similar technique to categorize when shows or movies were first shown (released). - This code is very similar to our simple numerical example shown previously. - In HW 5 - Part 1 (**Now Available**) you will: - Filter data to remove releases from before 1981. ::: fragment ```{r categorizing release data into periods} #table(nflx$release_year) nflx <- nflx |> mutate(release_period = ifelse(release_year <= 1980, "1925-1980", NA), release_period = ifelse(release_year > 1980 & release_year <= 2000, "1981-2000", release_period), release_period = ifelse(release_year > 2000 & release_year <= 2005, "2001-2005", release_period), release_period = ifelse(release_year > 2005 & release_year <= 2010, "2006-2010", release_period), release_period = ifelse(release_year > 2010 & release_year <= 2015, "2011-2015", release_period), release_period = ifelse(release_year > 2015, "2016-2021", release_period)) table(nflx$release_period) ``` ::: ## Cleaning Text Variables - The code that follows **WILL NOT RUN** until the `min_age` and `release_period` variables are created (previous chunks). - The next steps are tedious, but important, and will be different for every dataset you encounter. - On a quiz you should know how to use these text commands: - `gsub`, `str_trim`, - `separate`, `paste`, - `tolower`, `toupper`, - `ifelse` (covered in previous slides) - Not required but useful: use `?str_trim` to also see `str_squish` and `str_pad` ## ### Week 8 In-class Exercises - Q5-Q6 ***Session ID: bua455s25*** - `str_trim` is used in an upcoming chunk to trim spaces before characters in some observations. - **Q5.** Can `gsub` also be used for this purpose? - Type `?tolower` into the console to bring up the help file for this command and `toupper`. - Copy and paste these lines into the console: ::: fragment ``` tolower("WhAT is the Result of this Command?") toupper("WhAT is the Result of this Command?") ``` ::: - Answer **Q6.** ## Multi-step Process to clean genre - Reshape data with `pivot_longer` and convert all text to lower case with `to_lower` - Examine in console using this command: `cbind(table(nflx$genre))` ::: fragment ```{r cleaning up genre} nflx <- nflx |> select(show_id, type, year_added, release_period, min_age, g1, g2, g3) |> pivot_longer(cols=c("g1","g2","g3"), names_to="g", values_to="genre") # g is a place holder nflx <- nflx |> select(!g) |> mutate(genre=tolower(genre)) |> glimpse() ``` ::: ## Multi-step Process to clean genre :::::: columns ::: {.column width="48%"} - Recall when `separate` was used above, delimiter was specified as ", " - That removed leading spaces. - Alternatively, you emove leading spaces with `str_trim` - Examine in console using this command: `cbind(table(nflx$genre))` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![](img/owl.png){fig-align="center"} ::: :::::: ## Remove Unnecessary Text ::::::: columns :::: {.column width="48%"} - This text is unnecessary and redundant because of `type` variable in data. - Examine in console using this command: `cbind(table(nflx$genre))` ::: fragment ```{r remove movies tv shows and tv text} # remove the text `movies` and `tv shows` and `tv` from genre names nflx <- nflx |> mutate(genre = gsub(" movies", "", genre), genre = gsub("movies", "", genre), genre = gsub(" tv shows", "", genre), genre = gsub("tv shows", "", genre), genre = gsub("tv ", "", genre), genre = gsub(" tv", "", genre)) ``` ::: :::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![](img/beaver.png){fig-align="center"} ::: ::::::: ## ### Consolidate `genre` Categories using `ifelse` - Below I am only editing and consolidating categories I plan to use. - Examine in console using this command: `cbind(table(nflx$genre))` ::: fragment ```{r consolidate and edit genres of interest} nflx <- nflx |> mutate(genre = ifelse(genre == "drama", "dramas", genre), genre = ifelse(genre %in% c("stand-up comedy", "stand-up comedy & talk shows"), "comedies", genre), genre = ifelse(genre %in% c("documentaries", "docuseries"), "docs", genre), genre = ifelse(genre %in% c("children & family", "kids'"), "kids", genre), genre = gsub("action & adventure", "action_adventr", genre)) ``` ::: ## Filter Data to Genres of Interest ::::::: columns :::: {.column width="60%"} ::: fragment ```{r filter data to six genres we have consolidated} nflx <- nflx |> filter(genre %in% c("action_adventr", "comedies", "docs", "dramas", "international", "kids")) ``` ::: - Examine results in console with these commands - `cbind(table(nflx$genre))` - `table(nflx$type, nflx$genre)` :::: ::: {.column width="4%"} ::: ::: {.column width="36%"} ![](img/owl.png){fig-align="center"} ::: ::::::: ## Create Final Summary Dataset In the HW 5 - Part 1 Example Dashboard, I use the `nflx_tv` dataset, `TV Show` observations from this dataset. In HW 5 - Part 1, Students will create use the `nflx_mv` dataset, the `Movie` observations from this dataset. ```{r create final summary dataset} nflx_wide <- nflx |> group_by(type, release_period, year_added, min_age, genre) |> summarize(n = n()) |> # n() tallies number of obs pivot_wider(id_cols = c(type, release_period, year_added, min_age), # col. for each genre names_from = genre, values_from = n) nflx_wide[is.na(nflx_wide)] <- 0 # why should NA's be 0 here? nflx_tv <- nflx_wide |> # example dashboard made from nflx tv filter(type == "TV Show") # nflx_mv <- ``` ## ### Optional Group Meetings with Me :::::: columns ::: {.column width="48%"} - After break I will post a sign-up schedule - Meetings will be 5-10 minutes on Zoom - After reading your submitted pre-proposal I will: - Ask your group questions - Provide feedback to guide you towards good data and a good plan - Answer question you might have before you write your final proposal. ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ![](img/beaver.png){fig-align="center"} ::: :::::: ## ### Key Points from This Week - Group Dashboard Project and Proposals: - Rough Draft due Thu. 3/27 at 6:00 PM - Optional proposal group meetings will take place on 4/3 and 4/4 - Final Proposal due Tue. 4/8 at 6:00 PM - Data Exploration and Cleaning - Iterative, subjective process - Always be transparent and document choices - Introduction to HW 5 - Part 1 Data - Due Wed. 3/26 at midnight - Quiz 2 on Thu. 4/3 ::: fragment You may submit an 'Engagement Question' about each lecture until midnight on the day of the lecture. **A minimum of four submissions are required during the semester.** :::

Housekeeping

BUA 455 Dashboard Group Project

How the Proposal Process fits Wthin this course

Project Proposal Rough Draft

Data Sources

Data Source Challenges

Data Sourcing Example

Flexibility Leads to Interesting Data

Final Proposal

More about Proposal Final Draft

Proposals are TENTATIVE Agreements

Introduction to HW 5 - Part 1 Data

Week 8 In-class Exercises - Q1-Q3

Cleaning These Data

Week 8 In-class Exercises - Q4

Data Exploration and Additional Cleaning

Data Exploration and Additional Cleaning

When can you delete data?

Modifying text or numeric variables with ifelse

Modifying text multiple ifelse statements

Week 8 In-class Exercise

What we will do Today (Thu. 3/6):

Another ifelse Example

Cleaning Text Variables

Week 8 In-class Exercises - Q5-Q6

Multi-step Process to clean genre

Multi-step Process to clean genre

Remove Unnecessary Text

Consolidate genre Categories using ifelse

Filter Data to Genres of Interest

Create Final Summary Dataset

Optional Group Meetings with Me

Key Points from This Week

Modifying text or numeric variables with `ifelse`

Modifying text multiple `ifelse` statements

Another `ifelse` Example

Consolidate `genre` Categories using `ifelse`