Project Proposal Rough Draft is due Thursday, 3/27
HW 5 - Part 1 is due Wed. 3/26
We will talk about HW 5 - Part 1 on Thursday 3/6.
BUA 455 Dashboard Group Project
Importance of Proposals:
As a data analyst, you have skills that others don’t.
In this course, you are learning to acquire, combine, manage, and present data.
In the Business Analytics major, you learn to analyze and interpret data.
These skills are unique and are useful to employers, colleagues, clients BUT you will spend your career explaining and selling and explaining your skillset.
Most people will NOT know how your skills can help them.
It is your role to help them understand:
what questions can be answered with data
what data will help them answer their questions and how to get it.
How data management and analyses can help them and will be essential to their goals.
How the Proposal Process fits Wthin this course
The scope of this course is limited to data acquisition, management, and presentation.
There is much that can be learned from combining data sources and exploring data prior to analysis.
In other courses, you are learning analytical methods that complement this course.
The more methods and concepts you know, the better prepared you are to deal with any data you encounter.
No matter how much you know, the majority of you time will be spent using the skills from this course.
Depending on your work environment and position, others may be assigned time consuming repetitive data management tasks by you BUT
You should still direct the data management and presentation work because you will be held responsible.
File management, data management, data curation, and data presentation complement and augment analyses.
Project Proposal Rough Draft
The rough draft does not need to be more than a brief bullet-ed outline.
What data does your group plan to work with?
Summarize data topic in 1-2 sentences.
Do some data exploration and look for links to good data sources (AI may be helpful in search).
Rough draft proposal should include links to specific data sources that might work.
What questions do you think you want to answer?
Propose 2-4 questions (or more) you hope to address with your data dashboard.
For each question propose which dataset and variables might help you convey an answer.
Q1. How many different date formats do you see in the date_added column in the .csv file?
Q2. How many different date formats do you see in the date_added column in the imported dataset in the Global Environment?
Q3. How many different commands are required to convert the date_added column to a date variable?
Cleaning These Data
New command for dealing with text data: separate
Review of lubridate commands
Next steps:
Select variables
use separate to split listed_in into 3 columns (Why?)
Convert date_added to a date variable and then create a year variable
Code
```{r select and do prelim variable cleaning}nflx <- nflx |> select(show_id, type, date_added, release_year, rating, listed_in) |> # useful variables (subjective) separate(listed_in, sep=", ", into=c("g1", "g2", "g3")) |> # separate out genre listings (notice ", ") # will result in lots of NAs mutate(date_added = mdy(date_added), # convert data_added to a date var year_added = year(date_added)) # create a separate year var```
```{r select and do prelim variable cleaning}head(nflx,3)```
# A tibble: 3 × 9
show_id type date_added release_year rating g1 g2 g3 year_added
<chr> <chr> <date> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 s1 Movie 2021-09-25 2020 PG-13 Documen… <NA> <NA> 2021
2 s2 TV Show 2021-09-24 2021 TV-MA Interna… TV D… TV M… 2021
3 s3 TV Show 2021-09-24 2021 TV-MA Crime T… Inte… TV A… 2021
Week 8 In-class Exercises - Q4
Session ID: bua455s25
separate is the reverse of an R command we have already learned.
What command is that?
Data Exploration and Additional Cleaning
Finding and Removing Data with Missing Values (Depends on Subject Matter)
Are there observations for which year_added is missing (NA)?
Cool trick:is.na(nflx$year_added) will create a vector of FALSE and TRUE values.
FALSE unless the observation value is NA.
R interprets FALSE as 0 and TRUE as 1 so you can sum this vector to find number of missing values.
Code
```{r remove observations for which year_added is missing}sum(is.na(nflx$year_added)) # number of missing values (there are 10)table(is.na(nflx$year_added)) # could also summarize with tablenflx <- nflx |> filter(!is.na(year_added)) # remove NA's and check againsum(is.na(nflx$year_added))```
[1] 10
FALSE TRUE
8797 10
[1] 0
Data Exploration and Additional Cleaning
It is not always valid to delete data unless it is for a specific valid transparent reason.
For example: Are there observations for which rating is NR (not-rated) or UR (un-rated)? Are there other issues?
In this case we delete observation because we are restricting our data to more common rating categories.
Remainder of the categories can be grouped logically.
Code
```{r remove observations with nr or ur rating}table(nflx$rating) # examine categoriesnflx <- nflx |> filter(!rating %in% c("NR", "UR", "66 min", "74 min", "84 min"))table(nflx$rating) # examine again after filter```
66 min 74 min 84 min G NC-17 NR PG PG-13
1 1 1 41 3 79 287 490
R TV-14 TV-G TV-MA TV-PG TV-Y TV-Y7 TV-Y7-FV
799 2157 220 3205 861 306 333 6
UR
3
G NC-17 PG PG-13 R TV-14 TV-G TV-MA
41 3 287 490 799 2157 220 3205
TV-PG TV-Y TV-Y7 TV-Y7-FV
861 306 333 6
When can you delete data?
If you find raw uncleaned data on the internet or are given data by a client or co-worker, is it okay to delete observations?
Yes, under certain circumstances:
You must be completely transparent about how you are limiting the data and why.
For example, I am limiting these data to specific time periods and specific genres.
Other time and genre categories have too little data.
You should never delete outliers or unusual observations unless you can confirm they are errors or are not part of the data you are studying.
I am simplifying the data by removing categories that are not of primary interest.
Someone else might choose different categories and limit the data differently.
Always be transparent about data sources and all decisions you make.
I am also deleting three observations for which the ratings were missing.
Modifying text or numeric variables with ifelse
Numeric or text data can be converted to a categorical variable with a set of explicit ifelse statements.
ifelse is not the ONLY way to categorize data, but it’s my tool of choice.
The input values for ifelse are ALWAYS the same:
Input 1: a test where result is TRUE or FALSE.
Input 2: output value if the test result is TRUE.
Input 3: output value if the test result is FALSE.
Code
```{r simple example of ifelse, eval = F}x <- seq(9,90,9) # create var x that is a sequence of values(d <- tibble(x)) # convert x to a tibble (dataset) (d <- d |> mutate(y = ifelse(x < 35, "Low", NA))) # add variable y to dataset d(d <- d |> mutate(y = ifelse(x >= 35 & x < 65, "Medium", y))) (d <- d |> mutate(y = ifelse(x >= 65, "High", y))) ```
Modifying text multiple ifelse statements
The previous example can be (and should be) condensed into one mutate statement as shown here.
Code
```{r above example in one mutate command}# create a dataset with 1 variablex <- seq(9,90,9)d <- tibble(x)d <- d |> mutate(y = ifelse(x < 35, "Low", NA), y = ifelse(x >= 35 & x < 65, "Medium", y), y = ifelse(x >= 65, "High", y), yF = factor(y, levels = c("Low", "Medium", "High"))) |> # levels option used to specify correct orderglimpse()```
```{r filter data to six genres we have consolidated}nflx <- nflx |> filter(genre %in% c("action_adventr", "comedies", "docs", "dramas", "international", "kids"))```
Examine results in console with these commands
cbind(table(nflx$genre))
table(nflx$type, nflx$genre)
Create Final Summary Dataset
In the HW 5 - Part 1 Example Dashboard, I use the nflx_tv dataset, TV Show observations from this dataset.
In HW 5 - Part 1, Students will create use the nflx_mv dataset, the Movie observations from this dataset.
Code
```{r create final summary dataset}nflx_wide <- nflx |> group_by(type, release_period, year_added, min_age, genre) |> summarize(n = n()) |> # n() tallies number of obs pivot_wider(id_cols = c(type, release_period, year_added, min_age), # col. for each genre names_from = genre, values_from = n)```
`summarise()` has grouped output by 'type', 'release_period', 'year_added',
'min_age'. You can override using the `.groups` argument.
Code
```{r create final summary dataset}nflx_wide[is.na(nflx_wide)] <- 0 # why should NA's be 0 here?nflx_tv <- nflx_wide |> # example dashboard made from nflx tv filter(type == "TV Show") # nflx_mv <- ```
Optional Group Meetings with Me
After break I will post a sign-up schedule
Meetings will be 5-10 minutes on Zoom
After reading your submitted pre-proposal I will:
Ask your group questions
Provide feedback to guide you towards good data and a good plan
Answer question you might have before you write your final proposal.
Key Points from This Week
Group Dashboard Project and Proposals:
Rough Draft due Thu. 3/27 at 6:00 PM
Optional proposal group meetings will take place on 4/3 and 4/4
Final Proposal due Tue. 4/8 at 6:00 PM
Data Exploration and Cleaning
Iterative, subjective process
Always be transparent and document choices
Introduction to HW 5 - Part 1 Data
Due Wed. 3/26 at midnight
Quiz 2 on Thu. 4/3
You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.
Source Code
---title: "Week 8"subtitle: "Dashboard Project Proposal & Data Cleaning"author: "Penelope Pooler Eisenbies"date: last-modifiedtoc: truetoc-depth: 3toc-location: lefttoc-title: "Table of Contents"toc-expand: 1format: html: code-line-numbers: true code-fold: true code-tools: trueexecute: echo: fenced---## Housekeeping```{r include=F}#|label: setupknitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunksoptions(scipen=100) # suppress scientific notation # install pacman if neededif (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")pacman::p_load(pacman, tidyverse, gridExtra, magrittr, kableExtra, tidyquant, highcharter, dygraphs) # install and load required packages p_loaded() # verify loaded packages```- Groups will be assigned by Thursday, 3/6.- Project Proposal Rough Draft is due Thursday, 3/27- HW 5 - Part 1 is due Wed. 3/26 - We will talk about HW 5 - Part 1 on Thursday 3/6.## ### BUA 455 Dashboard Group Project::: fragment**Importance of Proposals:**:::- As a data analyst, you have skills that others don't.- In this course, you are learning to acquire, combine, manage, and present data.- In the Business Analytics major, you learn to analyze and interpret data.- These skills are unique and are useful to employers, colleagues, clients BUT you will spend your career explaining and selling and explaining your skillset.- Most people will NOT know how your skills can help them.- It is your role to help them understand: - what questions can be answered with data - what data will help them answer their questions and how to get it. - How data management and analyses can help them and will be essential to their goals.## ### How the Proposal Process fits Wthin this course- The scope of this course is **limited to data acquisition, management, and presentation**.- There is much that can be learned from combining data sources and exploring data prior to analysis.- In other courses, you are learning analytical methods that complement this course.- The more methods and concepts you know, the better prepared you are to deal with any data you encounter.- No matter how much you know, the majority of you time will be spent using the skills from this course.- Depending on your work environment and position, others may be assigned time consuming repetitive data management tasks by you BUT - You should still direct the data management and presentation work because you will be held responsible.- **File management, data management, data curation, and data presentation complement and augment analyses**.## ### Project Proposal Rough Draft- The rough draft does not need to be more than a brief bullet-ed outline.- **What data does your group plan to work with?** - Summarize data topic in 1-2 sentences. - Do some data exploration and look for links to good data sources (AI may be helpful in search). - Rough draft proposal should include ***links to specific data sources that might work.***- **What questions do you think you want to answer?** - Propose 2-4 questions (or more) you hope to address with your data dashboard. - For each question propose which dataset and variables might help you convey an answer.## Data Sources- The [course website lists some data sources](https://penelope2040.quarto.pub/bua-455-semester/#interesting-data){target="_blank"} and I will add to this list over Spring Break.- Another starting place for inspiration is [**Statista**](https://login.libezproxy2.syr.edu/login?url=https://www.statista.com){target="_blank"} - It is free if you login from the SU Library and is a great starting point. - **Your project should go beyond the small curated Statista datasets** but it may provide inspiration.- For this course, consider the topic choice to be a iterative process: 1. Develop a topic and potential questions (but don't become too attached). 2. Explore provided data sources, Google, and AI using your question(s). 3. Examine data that is somewhat related to question. 4. Edit and refine topic and question based on data available.## Data Source Challenges- If you are interested in **sports** data or **billboard music** data and don't know how to scrape data, I can help.<br>- There are sports and music data available, but the websites often don't make the datasets downloadable.<br>- Instead, the data must be scraped as we did with the Box Office Mojo data.<br>- Projects that require scraping data are doable, but data management will take more time and **you should start ASAP**.## Data Sourcing Example- I chose a VERY obscure topic to see how long it would take me to find a reasonable dataset.- Initial topic: **Craft Traditions in Europe**- I started with posing a question to CoPilot.- That led me to links for [**Eurostat**](https://ec.europa.eu/eurostat){target="_blank"} and the [**European Commission**](https://commission.europa.eu/index_en){target="_blank"}- One the **European Commission** site I found the [**Joint Research Center**](https://commission.europa.eu/about-european-commission/departments-and-executive-agencies/joint-research-centre_en){target="_blank"}.- A combination of website exploration and AI questions led to- [**Competence Centre on Composite Indicators and Scoreboards**](https://knowledge4policy.ec.europa.eu/composite-indicators_en){target="_blank"} - [**JRC Composite Indicators Site**](https://composite-indicators.jrc.ec.europa.eu/){target="_blank"} and then led to - and a site about [**cultural and creative European cities**](https://composite-indicators.jrc.ec.europa.eu/cultural-creative-cities-monitor){target="_blank"} - and a link to [**Docs and Data**](https://composite-indicators.jrc.ec.europa.eu/cultural-creative-cities-monitor/docs-and-data){target="_blank"} and this download link: - {width="50%"}## Flexibility Leads to Interesting Data- Raw Excel Data with interesting variables for over 500 European cities.- Dataset is not directly related to initial topic but there are data to build a multi-page dashboard.- Going from initial obscure idea to interesting data took about 30 minutes.- We (TAs and I) are happy to help you with this process. **Be flexible and curious.**::: fragment{width="65%;" fig-align="center"}:::## Final Proposal- **Due Tue. 4/8 - 48 Hour Grace Period**- The final proposal draft should be 1-3 pages at most (double spaced, 12 pt. font). - **Bullet points are preferred to paragraphs**- Although your client(s) are hypothetical, it helps to envision your target client.- The goal of the proposal is to - explain in non-technical language why your dashboard will be useful. - convince the prospective client to trust you with their data.- How will client use the information provided in the dashboard to - improve their marketing plan or their purchasing and staffing plan? - improve their overall investment strategy or their supply chain strategy?## More about Proposal Final Draft- **Due Tue. 4/8 - 48 Hour Grace Period**- The proposal also provides a roadmap for the work you plan to do so that all group members are informed. - List tentative goals that you hope to achieve and ideas for how to achieve them.- FINAL Proposal should **ALSO** include: - a preliminary descriptive list of your planned pages. - a tentative plan for what data will be presented on each page, and how will it be presented (which plots or tables). - a list of which team member is primarily responsible for each page.## Proposals are **TENTATIVE** Agreements- Understandably, the work is not yet done, so your plans may change, which is okay.- Your final plots and tables are not done, but it is good to have a plan for what they will be.- If possible show explicit links between each question or goal you plan to address and the proposed component of the dashboard that addresses it.- In a 'real world' situation: - It is better not to provide final proprietary work products at proposal stage. - Ideas and plots should be polished enough to entice client.- A good proposal shows that you have creative ideas and have put some thought into how you can help the client.- In real life, proposals and client goals are more concrete, but in this course, I encourage you to experiment.- ***Again, it is okay if your plans change based on what you discover in the data.***## Introduction to HW 5 - Part 1 Data::: fragment**Today: Examine and clean data**:::- First steps are to examine .csv file and to import and examine data in R.- Look for quirks that might be a challenge - Hint: Examine `date_added` column in the .csv file or in R::: fragment```{r import netflix data and examine}# import data and examinenflx <- read_csv("data/netflix_titles.csv", show_col_types=F) |> glimpse()```:::## ### Week 8 In-class Exercises - Q1-Q3***Session ID: bua455s25***- **Q1.** How many different date formats do you see in the date_added column in the `.csv` file?- **Q2.** How many different date formats do you see in the date_added column in the imported dataset in the `Global Environment`?- **Q3.** How many different commands are required to convert the date_added column to a date variable?## Cleaning These DataNew command for dealing with text data: **`separate`**Review of `lubridate` commands- Next steps: - Select variables - use separate to split `listed_in` into 3 columns (Why?) - Convert `date_added` to a date variable and then create a year variable::: fragment```{r select and do prelim variable cleaning}nflx <- nflx |> select(show_id, type, date_added, release_year, rating, listed_in) |> # useful variables (subjective) separate(listed_in, sep=", ", into=c("g1", "g2", "g3")) |> # separate out genre listings (notice ", ") # will result in lots of NAs mutate(date_added = mdy(date_added), # convert data_added to a date var year_added = year(date_added)) # create a separate year varhead(nflx,3)```:::## Week 8 In-class Exercises - Q4***Session ID: bua455s25***`separate` is the reverse of an R command we have already learned.- What command is that?## Data Exploration and Additional Cleaning::: fragment**Finding and Removing Data with Missing Values (Depends on Subject Matter)**:::- Are there observations for which `year_added` is missing (`NA`)? - **Cool trick:** `is.na(nflx$year_added)` will create a vector of `FALSE` and `TRUE` values. - `FALSE` unless the observation value is `NA`. - R interprets `FALSE` as 0 and `TRUE` as 1 so you can sum this vector to find number of missing values.::: fragment```{r remove observations for which year_added is missing}sum(is.na(nflx$year_added)) # number of missing values (there are 10)table(is.na(nflx$year_added)) # could also summarize with tablenflx <- nflx |> filter(!is.na(year_added)) # remove NA's and check againsum(is.na(nflx$year_added))```:::## Data Exploration and Additional Cleaning- It is not always valid to delete data unless it is for a specific valid transparent reason.- For example: Are there observations for which rating is `NR` (not-rated) or `UR` (un-rated)? Are there other issues?- In this case we delete observation because we are restricting our data to more common rating categories.- Remainder of the categories can be grouped logically.::: fragment```{r remove observations with nr or ur rating}table(nflx$rating) # examine categoriesnflx <- nflx |> filter(!rating %in% c("NR", "UR", "66 min", "74 min", "84 min"))table(nflx$rating) # examine again after filter```:::## When can you delete data?**If you find raw uncleaned data on the internet or are given data by a client or co-worker, is it okay to delete observations?**Yes, under certain circumstances:1. **You must be completely transparent about how you are limiting the data and why.**- For example, I am limiting these data to specific time periods and specific genres.- Other time and genre categories have too little data.2. **You should never delete outliers or unusual observations unless you can confirm they are errors or are not part of the data you are studying.**- I am simplifying the data by removing categories that are not of primary interest.- Someone else might choose different categories and limit the data differently.- Always be transparent about data sources and all decisions you make.- I am also deleting three observations for which the ratings were missing.## ### Modifying text or numeric variables with `ifelse`- Numeric or text data can be converted to a categorical variable with a set of explicit `ifelse` statements.- `ifelse` is not the ONLY way to categorize data, but it's my tool of choice.- The input values for `ifelse` are ALWAYS the same: - Input 1: a test where result is `TRUE` or `FALSE`. - Input 2: output value if the test result is `TRUE`. - Input 3: output value if the test result is `FALSE`.::: fragment```{r simple example of ifelse, eval = F}x <- seq(9,90,9) # create var x that is a sequence of values(d <- tibble(x)) # convert x to a tibble (dataset) (d <- d |> mutate(y = ifelse(x < 35, "Low", NA))) # add variable y to dataset d(d <- d |> mutate(y = ifelse(x >= 35 & x < 65, "Medium", y))) (d <- d |> mutate(y = ifelse(x >= 65, "High", y))) ```:::## ### Modifying text multiple `ifelse` statements- The previous example can be (and should be) condensed into one mutate statement as shown here.::: fragment```{r above example in one mutate command}# create a dataset with 1 variablex <- seq(9,90,9)d <- tibble(x)d <- d |> mutate(y = ifelse(x < 35, "Low", NA), y = ifelse(x >= 35 & x < 65, "Medium", y), y = ifelse(x >= 65, "High", y), yF = factor(y, levels = c("Low", "Medium", "High"))) |> # levels option used to specify correct orderglimpse()```:::## Week 8 In-class Exercise- We will examine the Netflix ratings data again and simplify the categories to reflect minimum appropriate ages.- In this case, I have done some research on what these ratings mean.- In your projects, you will be expected to understand the values in your variables - You are likely to have to simplify your data categories for your data management goals.- My proposed solution for these data is to categorize both TV shows and Movies into four minimum age categories: - Minimum Age Categories: 0, 7, 13, 17 - Minimum Age 0: "G", "TV-G", "TV-Y" - Minimum Age 7: "PG", "TV-PG", "TV-Y7", "TV-Y7-FV" - Minimum Age 13: "PG-13", "TV-14" - Minimum Age 17: "NC-17", "TV-MA", "R"## What we will do Today (Thu. 3/6):- Together we will fill in the blanks in the R chunk below - You will submit the entire chunk (with your added comments) on Blackboard in a text file. - Copy it into empty an text file and save it with your name. - This counts towards class participation for Thursday, 3/6 and is due by Friday, 3/7, at midnight.```{r using ifelse to create min_age variable for prof, echo=F, eval=T, include=F}#table(nflx$rating)nflx <- nflx |>mutate(min_age = ifelse(rating %in% c("G", "TV-G", "TV-Y"), 0, NA), min_age = ifelse(rating %in% c("PG", "TV-PG", "TV-Y7", "TV-Y7-FV"), 7, min_age), min_age = ifelse(rating %in% c("PG-13", "TV-14"), 13, min_age), min_age = ifelse(rating %in% c("NC-17", "TV-MA", "R"), 17, min_age))#table(nflx$min_age)```::: fragment```{r using ifelse to create min_age variable}table(nflx$rating)# nflx <- nflx |># mutate(min_age = ifelse(rating %in% c("G", "TV-G", "TV-Y"), 0, ____),# min_age = ifelse(rating %in% c("PG", "TV-PG", "TV-Y7", "TV-Y7-FV"), 7, _____),# min_age = ifelse(rating %in% c("PG-13", "TV-14"), ____, ____),# min_age = ifelse())# table(nflx$min_age)```:::## Another `ifelse` Example- We can use a similar technique to categorize when shows or movies were first shown (released).- This code is very similar to our simple numerical example shown previously.- In HW 5 - Part 1 (**Now Available**) you will: - Filter data to remove releases from before 1981.::: fragment```{r categorizing release data into periods}#table(nflx$release_year)nflx <- nflx |> mutate(release_period = ifelse(release_year <= 1980, "1925-1980", NA), release_period = ifelse(release_year > 1980 & release_year <= 2000, "1981-2000", release_period), release_period = ifelse(release_year > 2000 & release_year <= 2005, "2001-2005", release_period), release_period = ifelse(release_year > 2005 & release_year <= 2010, "2006-2010", release_period), release_period = ifelse(release_year > 2010 & release_year <= 2015, "2011-2015", release_period), release_period = ifelse(release_year > 2015, "2016-2021", release_period))table(nflx$release_period)```:::## Cleaning Text Variables- The code that follows **WILL NOT RUN** until the `min_age` and `release_period` variables are created (previous chunks).- The next steps are tedious, but important, and will be different for every dataset you encounter.- On a quiz you should know how to use these text commands: - `gsub`, `str_trim`, - `separate`, `paste`, - `tolower`, `toupper`, - `ifelse` (covered in previous slides) <br> - Not required but useful: use `?str_trim` to also see `str_squish` and `str_pad`## ### Week 8 In-class Exercises - Q5-Q6***Session ID: bua455s25***- `str_trim` is used in an upcoming chunk to trim spaces before characters in some observations.- **Q5.** Can `gsub` also be used for this purpose?<br>- Type `?tolower` into the console to bring up the help file for this command and `toupper`.- Copy and paste these lines into the console:::: fragment``` tolower("WhAT is the Result of this Command?")toupper("WhAT is the Result of this Command?")```:::- Answer **Q6.**## Multi-step Process to clean genre- Reshape data with `pivot_longer` and convert all text to lower case with `to_lower`- Examine in console using this command: `cbind(table(nflx$genre))`::: fragment```{r cleaning up genre}nflx <- nflx |> select(show_id, type, year_added, release_period, min_age, g1, g2, g3) |> pivot_longer(cols=c("g1","g2","g3"), names_to="g", values_to="genre") # g is a place holdernflx <- nflx |> select(!g) |> mutate(genre=tolower(genre)) |> glimpse()```:::## Multi-step Process to clean genre:::::: columns::: {.column width="48%"}- Recall when `separate` was used above, delimiter was specified as ", " - That removed leading spaces. - Alternatively, you emove leading spaces with `str_trim`- Examine in console using this command: `cbind(table(nflx$genre))`:::::: {.column width="4%"}:::::: {.column width="48%"}{fig-align="center"}:::::::::## Remove Unnecessary Text::::::: columns:::: {.column width="48%"}- This text is unnecessary and redundant because of `type` variable in data.- Examine in console using this command: `cbind(table(nflx$genre))`::: fragment```{r remove movies tv shows and tv text}# remove the text `movies` and `tv shows` and `tv` from genre namesnflx <- nflx |> mutate(genre = gsub(" movies", "", genre), genre = gsub("movies", "", genre), genre = gsub(" tv shows", "", genre), genre = gsub("tv shows", "", genre), genre = gsub("tv ", "", genre), genre = gsub(" tv", "", genre))```:::::::::: {.column width="4%"}:::::: {.column width="48%"}{fig-align="center"}::::::::::## ### Consolidate `genre` Categories using `ifelse`- Below I am only editing and consolidating categories I plan to use.- Examine in console using this command: `cbind(table(nflx$genre))`::: fragment```{r consolidate and edit genres of interest}nflx <- nflx |> mutate(genre = ifelse(genre == "drama", "dramas", genre), genre = ifelse(genre %in% c("stand-up comedy", "stand-up comedy & talk shows"), "comedies", genre), genre = ifelse(genre %in% c("documentaries", "docuseries"), "docs", genre), genre = ifelse(genre %in% c("children & family", "kids'"), "kids", genre), genre = gsub("action & adventure", "action_adventr", genre))```:::## Filter Data to Genres of Interest::::::: columns:::: {.column width="60%"}::: fragment```{r filter data to six genres we have consolidated}nflx <- nflx |> filter(genre %in% c("action_adventr", "comedies", "docs", "dramas", "international", "kids"))```:::- Examine results in console with these commands - `cbind(table(nflx$genre))` - `table(nflx$type, nflx$genre)`::::::: {.column width="4%"}:::::: {.column width="36%"}{fig-align="center"}::::::::::## Create Final Summary DatasetIn the HW 5 - Part 1 Example Dashboard, I use the `nflx_tv` dataset, `TV Show` observations from this dataset.In HW 5 - Part 1, Students will create use the `nflx_mv` dataset, the `Movie` observations from this dataset.```{r create final summary dataset}nflx_wide <- nflx |> group_by(type, release_period, year_added, min_age, genre) |> summarize(n = n()) |> # n() tallies number of obs pivot_wider(id_cols = c(type, release_period, year_added, min_age), # col. for each genre names_from = genre, values_from = n)nflx_wide[is.na(nflx_wide)] <- 0 # why should NA's be 0 here?nflx_tv <- nflx_wide |> # example dashboard made from nflx tv filter(type == "TV Show") # nflx_mv <- ```## ### Optional Group Meetings with Me:::::: columns::: {.column width="48%"}- After break I will post a sign-up schedule- Meetings will be 5-10 minutes on Zoom- After reading your submitted pre-proposal I will: - Ask your group questions - Provide feedback to guide you towards good data and a good plan - Answer question you might have before you write your final proposal.:::::: {.column width="4%"}:::::: {.column width="48%"}{fig-align="center"}:::::::::## ### Key Points from This Week- Group Dashboard Project and Proposals: - Rough Draft due Thu. 3/27 at 6:00 PM - Optional proposal group meetings will take place on 4/3 and 4/4 - Final Proposal due Tue. 4/8 at 6:00 PM- Data Exploration and Cleaning - Iterative, subjective process - Always be transparent and document choices- Introduction to HW 5 - Part 1 Data - Due Wed. 3/26 at midnight- Quiz 2 on Thu. 4/3::: fragmentYou may submit an 'Engagement Question' about each lecture until midnight on the day of the lecture. **A minimum of four submissions are required during the semester.**:::