Housekeeping

  • Groups will be assigned by Thursday, 3/6.

  • Project Proposal Rough Draft is due Thursday, 3/27

  • HW 5 - Part 1 is due Wed. 3/26

    • We will talk about HW 5 - Part 1 on Thursday 3/6.

BUA 455 Dashboard Group Project

Importance of Proposals:

  • As a data analyst, you have skills that others don’t.

  • In this course, you are learning to acquire, combine, manage, and present data.

  • In the Business Analytics major, you learn to analyze and interpret data.

  • These skills are unique and are useful to employers, colleagues, clients BUT you will spend your career explaining and selling and explaining your skillset.

  • Most people will NOT know how your skills can help them.

  • It is your role to help them understand:

    • what questions can be answered with data

    • what data will help them answer their questions and how to get it.

    • How data management and analyses can help them and will be essential to their goals.

How the Proposal Process fits Wthin this course

  • The scope of this course is limited to data acquisition, management, and presentation.

  • There is much that can be learned from combining data sources and exploring data prior to analysis.

  • In other courses, you are learning analytical methods that complement this course.

  • The more methods and concepts you know, the better prepared you are to deal with any data you encounter.

  • No matter how much you know, the majority of you time will be spent using the skills from this course.

  • Depending on your work environment and position, others may be assigned time consuming repetitive data management tasks by you BUT

    • You should still direct the data management and presentation work because you will be held responsible.
  • File management, data management, data curation, and data presentation complement and augment analyses.

Project Proposal Rough Draft

  • The rough draft does not need to be more than a brief bullet-ed outline.

  • What data does your group plan to work with?

    • Summarize data topic in 1-2 sentences.

    • Do some data exploration and look for links to good data sources (AI may be helpful in search).

    • Rough draft proposal should include links to specific data sources that might work.

  • What questions do you think you want to answer?

    • Propose 2-4 questions (or more) you hope to address with your data dashboard.

    • For each question propose which dataset and variables might help you convey an answer.

Data Sources

  • The course website lists some data sources and I will add to this list over Spring Break.

  • Another starting place for inspiration is Statista

    • It is free if you login from the SU Library and is a great starting point.

    • Your project should go beyond the small curated Statista datasets but it may provide inspiration.

  • For this course, consider the topic choice to be a iterative process:

    1. Develop a topic and potential questions (but don’t become too attached).

    2. Explore provided data sources, Google, and AI using your question(s).

    3. Examine data that is somewhat related to question.

    4. Edit and refine topic and question based on data available.

Data Source Challenges

  • If you are interested in sports data or billboard music data and don’t know how to scrape data, I can help.


  • There are sports and music data available, but the websites often don’t make the datasets downloadable.


  • Instead, the data must be scraped as we did with the Box Office Mojo data.


  • Projects that require scraping data are doable, but data management will take more time and you should start ASAP.

Data Sourcing Example

Flexibility Leads to Interesting Data

  • Raw Excel Data with interesting variables for over 500 European cities.

  • Dataset is not directly related to initial topic but there are data to build a multi-page dashboard.

  • Going from initial obscure idea to interesting data took about 30 minutes.

  • We (TAs and I) are happy to help you with this process. Be flexible and curious.

Final Proposal

  • Due Tue. 4/8 - 48 Hour Grace Period

  • The final proposal draft should be 1-3 pages at most (double spaced, 12 pt. font).

    • Bullet points are preferred to paragraphs
  • Although your client(s) are hypothetical, it helps to envision your target client.

  • The goal of the proposal is to

    • explain in non-technical language why your dashboard will be useful.

    • convince the prospective client to trust you with their data.

  • How will client use the information provided in the dashboard to

    • improve their marketing plan or their purchasing and staffing plan?

    • improve their overall investment strategy or their supply chain strategy?

More about Proposal Final Draft

  • Due Tue. 4/8 - 48 Hour Grace Period

  • The proposal also provides a roadmap for the work you plan to do so that all group members are informed.

    • List tentative goals that you hope to achieve and ideas for how to achieve them.
  • FINAL Proposal should ALSO include:

    • a preliminary descriptive list of your planned pages.

    • a tentative plan for what data will be presented on each page, and how will it be presented (which plots or tables).

    • a list of which team member is primarily responsible for each page.

Proposals are TENTATIVE Agreements

  • Understandably, the work is not yet done, so your plans may change, which is okay.

  • Your final plots and tables are not done, but it is good to have a plan for what they will be.

  • If possible show explicit links between each question or goal you plan to address and the proposed component of the dashboard that addresses it.

  • In a ‘real world’ situation:

    • It is better not to provide final proprietary work products at proposal stage.

    • Ideas and plots should be polished enough to entice client.

  • A good proposal shows that you have creative ideas and have put some thought into how you can help the client.

  • In real life, proposals and client goals are more concrete, but in this course, I encourage you to experiment.

  • Again, it is okay if your plans change based on what you discover in the data.

Introduction to HW 5 - Part 1 Data

Today: Examine and clean data

  • First steps are to examine .csv file and to import and examine data in R.

  • Look for quirks that might be a challenge

    • Hint: Examine date_added column in the .csv file or in R
# import data and examine
nflx <- read_csv("data/netflix_titles.csv", show_col_types=F) |> glimpse()
Rows: 8,807
Columns: 12
$ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
$ type         <chr> "Movie", "TV Show", "TV Show", "TV Show", "TV Show", "TV …
$ title        <chr> "Dick Johnson Is Dead", "Blood & Water", "Ganglands", "Ja…
$ director     <chr> "Kirsten Johnson", NA, "Julien Leclercq", NA, NA, "Mike F…
$ cast         <chr> NA, "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Mola…
$ country      <chr> "United States", "South Africa", NA, NA, "India", NA, NA,…
$ date_added   <chr> "September 25, 2021", "September 24, 2021", "September 24…
$ release_year <dbl> 2020, 2021, 2021, 2021, 2021, 2021, 2021, 1993, 2021, 202…
$ rating       <chr> "PG-13", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "PG…
$ duration     <chr> "90 min", "2 Seasons", "1 Season", "1 Season", "2 Seasons…
$ listed_in    <chr> "Documentaries", "International TV Shows, TV Dramas, TV M…
$ description  <chr> "As her father nears the end of his life, filmmaker Kirst…

💥 Week 8 In-class Exercises - Q1-Q3 💥

Session ID: bua455s25

  • Q1. How many different date formats do you see in the date_added column in the .csv file?

  • Q2. How many different date formats do you see in the date_added column in the imported dataset in the Global Environment?

  • Q3. How many different commands are required to convert the date_added column to a date variable?

Cleaning These Data

New command for dealing with text data: separate

Review of lubridate commands

  • Next steps:
    • Select variables
    • use separate to split listed_in into 3 columns (Why?)
    • Convert date_added to a date variable and then create a year variable
nflx <- nflx |>
  select(show_id, type, date_added, release_year, rating, listed_in) |> # useful variables (subjective)

  separate(listed_in, sep=", ", into=c("g1", "g2", "g3")) |>        # separate out genre listings (notice ", ")
                                                                    # will result in lots of NAs
  mutate(date_added = mdy(date_added),                              # convert data_added to a date var
         year_added = year(date_added))                             # create a separate year var
head(nflx,3)
# A tibble: 3 × 9
  show_id type    date_added release_year rating g1       g2    g3    year_added
  <chr>   <chr>   <date>            <dbl> <chr>  <chr>    <chr> <chr>      <dbl>
1 s1      Movie   2021-09-25         2020 PG-13  Documen… <NA>  <NA>        2021
2 s2      TV Show 2021-09-24         2021 TV-MA  Interna… TV D… TV M…       2021
3 s3      TV Show 2021-09-24         2021 TV-MA  Crime T… Inte… TV A…       2021

💥 Week 8 In-class Exercises - Q4 💥

Session ID: bua455s25

separate is the reverse of an R command we have already learned.

  • What command is that?

Data Exploration and Additional Cleaning

Finding and Removing Data with Missing Values (Depends on Subject Matter)

  • Are there observations for which year_added is missing (NA)?

    • Cool trick: is.na(nflx$year_added) will create a vector of FALSE and TRUE values.

    • FALSE unless the observation value is NA.

    • R interprets FALSE as 0 and TRUE as 1 so you can sum this vector to find number of missing values.

sum(is.na(nflx$year_added))                # number of missing values (there are 10)
[1] 10
table(is.na(nflx$year_added))              # could also summarize with table

FALSE  TRUE 
 8797    10 
nflx <- nflx |> filter(!is.na(year_added)) # remove NA's and check again
sum(is.na(nflx$year_added))
[1] 0

Data Exploration and Additional Cleaning

  • It is not always valid to delete data unless it is for a specific valid transparent reason.

  • For example: Are there observations for which rating is NR (not-rated) or UR (un-rated)? Are there other issues?

  • In this case we delete observation because we are restricting our data to more common rating categories.

  • Remainder of the categories can be grouped logically.

table(nflx$rating) # examine categories

  66 min   74 min   84 min        G    NC-17       NR       PG    PG-13 
       1        1        1       41        3       79      287      490 
       R    TV-14     TV-G    TV-MA    TV-PG     TV-Y    TV-Y7 TV-Y7-FV 
     799     2157      220     3205      861      306      333        6 
      UR 
       3 
nflx <- nflx |> filter(!rating %in% c("NR", "UR", "66 min", "74 min", "84 min"))
table(nflx$rating) # examine again after filter

       G    NC-17       PG    PG-13        R    TV-14     TV-G    TV-MA 
      41        3      287      490      799     2157      220     3205 
   TV-PG     TV-Y    TV-Y7 TV-Y7-FV 
     861      306      333        6 

When can you delete data?

If you find raw uncleaned data on the internet or are given data by a client or co-worker, is it okay to delete observations?

Yes, under certain circumstances:

  1. You must be completely transparent about how you are limiting the data and why.
  • For example, I am limiting these data to specific time periods and specific genres.

  • Other time and genre categories have too little data.

  1. You should never delete outliers or unusual observations unless you can confirm they are errors or are not part of the data you are studying.
  • I am simplifying the data by removing categories that are not of primary interest.

  • Someone else might choose different categories and limit the data differently.

  • Always be transparent about data sources and all decisions you make.

  • I am also deleting three observations for which the ratings were missing.

Modifying text or numeric variables with ifelse

  • Numeric or text data can be converted to a categorical variable with a set of explicit ifelse statements.

  • ifelse is not the ONLY way to categorize data, but it’s my tool of choice.

  • The input values for ifelse are ALWAYS the same:

    • Input 1: a test where result is TRUE or FALSE.
    • Input 2: output value if the test result is TRUE.
    • Input 3: output value if the test result is FALSE.
x <- seq(9,90,9)    # create var x that is a sequence of values
(d <- tibble(x))    # convert x to a tibble (dataset) 
(d <- d |> mutate(y = ifelse(x < 35, "Low", NA))) # add variable y to dataset d
(d <- d |> mutate(y = ifelse(x >= 35 & x < 65, "Medium", y))) 
(d <- d |> mutate(y = ifelse(x >= 65, "High", y)))           

Modifying text multiple ifelse statements

  • The previous example can be (and should be) condensed into one mutate statement as shown here.
# create a dataset with 1 variable
x <- seq(9,90,9)
d <- tibble(x)
d <- d |> mutate(y = ifelse(x < 35, "Low", NA),                         
                 y = ifelse(x >= 35 & x < 65, "Medium", y),
                 y = ifelse(x >= 65, "High", y),
                 yF = factor(y, levels = c("Low", "Medium", "High"))) |> 
                              # levels option used to specify correct order
glimpse()
Rows: 10
Columns: 3
$ x  <dbl> 9, 18, 27, 36, 45, 54, 63, 72, 81, 90
$ y  <chr> "Low", "Low", "Low", "Medium", "Medium", "Medium", "Medium", "High"…
$ yF <fct> Low, Low, Low, Medium, Medium, Medium, Medium, High, High, High

Week 8 In-class Exercise

  • We will examine the Netflix ratings data again and simplify the categories to reflect minimum appropriate ages.

  • In this case, I have done some research on what these ratings mean.

  • In your projects, you will be expected to understand the values in your variables

    • You are likely to have to simplify your data categories for your data management goals.
  • My proposed solution for these data is to categorize both TV shows and Movies into four minimum age categories:

    • Minimum Age Categories: 0, 7, 13, 17

    • Minimum Age 0: “G”, “TV-G”, “TV-Y”

    • Minimum Age 7: “PG”, “TV-PG”, “TV-Y7”, “TV-Y7-FV”

    • Minimum Age 13: “PG-13”, “TV-14”

    • Minimum Age 17: “NC-17”, “TV-MA”, “R”

What we will do Today (Thu. 3/6):

  • Together we will fill in the blanks in the R chunk below

    • You will submit the entire chunk (with your added comments) on Blackboard in a text file.

    • Copy it into empty an text file and save it with your name.

    • This counts towards class participation for Thursday, 3/6 and is due by Friday, 3/7, at midnight.

table(nflx$rating)

       G    NC-17       PG    PG-13        R    TV-14     TV-G    TV-MA 
      41        3      287      490      799     2157      220     3205 
   TV-PG     TV-Y    TV-Y7 TV-Y7-FV 
     861      306      333        6 
# nflx <- nflx |>
#   mutate(min_age = ifelse(rating %in% c("G", "TV-G", "TV-Y"), 0, ____),
#          min_age = ifelse(rating %in% c("PG", "TV-PG", "TV-Y7", "TV-Y7-FV"), 7, _____),
#          min_age = ifelse(rating %in% c("PG-13", "TV-14"), ____, ____),
#          min_age = ifelse())

# table(nflx$min_age)

Another ifelse Example

  • We can use a similar technique to categorize when shows or movies were first shown (released).

  • This code is very similar to our simple numerical example shown previously.

  • In HW 5 - Part 1 (Now Available) you will:

    • Filter data to remove releases from before 1981.
#table(nflx$release_year)

nflx <- nflx |>
  mutate(release_period = ifelse(release_year <= 1980, "1925-1980", NA),
         release_period = ifelse(release_year > 1980 & release_year <= 2000, "1981-2000", release_period),
         release_period = ifelse(release_year > 2000 & release_year <= 2005, "2001-2005", release_period),
         release_period = ifelse(release_year > 2005 & release_year <= 2010, "2006-2010", release_period),
         release_period = ifelse(release_year > 2010 & release_year <= 2015, "2011-2015", release_period),
         release_period = ifelse(release_year > 2015, "2016-2021", release_period))

table(nflx$release_period)

1925-1980 1981-2000 2001-2005 2006-2010 2011-2015 2016-2021 
      130       424       297       655      1577      5629 

Cleaning Text Variables

  • The code that follows WILL NOT RUN until the min_age and release_period variables are created (previous chunks).

  • The next steps are tedious, but important, and will be different for every dataset you encounter.

  • On a quiz you should know how to use these text commands:

    • gsub, str_trim,

    • separate, paste,

    • tolower, toupper,

    • ifelse (covered in previous slides)


    • Not required but useful: use ?str_trim to also see str_squish and str_pad

💥 Week 8 In-class Exercises - Q5-Q6 💥

Session ID: bua455s25

  • str_trim is used in an upcoming chunk to trim spaces before characters in some observations.

  • Q5. Can gsub also be used for this purpose?


  • Type ?tolower into the console to bring up the help file for this command and toupper.

  • Copy and paste these lines into the console:

tolower("WhAT is the Result of this Command?")

toupper("WhAT is the Result of this Command?")
  • Answer Q6.

Multi-step Process to clean genre

  • Reshape data with pivot_longer and convert all text to lower case with to_lower

  • Examine in console using this command: cbind(table(nflx$genre))

nflx <- nflx |>
  select(show_id, type, year_added, release_period, min_age, g1, g2, g3) |>
  pivot_longer(cols=c("g1","g2","g3"), names_to="g", values_to="genre") 
                                                 # g is a place holder
nflx <- nflx |> select(!g) |> mutate(genre=tolower(genre)) |> glimpse()
Rows: 26,136
Columns: 6
$ show_id        <chr> "s1", "s1", "s1", "s2", "s2", "s2", "s3", "s3", "s3", "…
$ type           <chr> "Movie", "Movie", "Movie", "TV Show", "TV Show", "TV Sh…
$ year_added     <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2…
$ release_period <chr> "2016-2021", "2016-2021", "2016-2021", "2016-2021", "20…
$ min_age        <dbl> 13, 13, 13, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17,…
$ genre          <chr> "documentaries", NA, NA, "international tv shows", "tv …

Multi-step Process to clean genre

  • Recall when separate was used above, delimiter was specified as “,”

    • That removed leading spaces.

    • Alternatively, you emove leading spaces with str_trim

  • Examine in console using this command: cbind(table(nflx$genre))

Remove Unnecessary Text

  • This text is unnecessary and redundant because of type variable in data.

  • Examine in console using this command: cbind(table(nflx$genre))

# remove the text `movies` and `tv shows` and `tv` from genre names
nflx <- nflx |>
  mutate(genre = gsub(" movies", "", genre),
         genre = gsub("movies", "", genre),
         genre = gsub(" tv shows", "", genre),
         genre = gsub("tv shows", "", genre),
         genre = gsub("tv ", "", genre),
         genre = gsub(" tv", "", genre))

Consolidate genre Categories using ifelse

  • Below I am only editing and consolidating categories I plan to use.

  • Examine in console using this command: cbind(table(nflx$genre))

nflx <- nflx |>
  mutate(genre = ifelse(genre == "drama", "dramas", genre),
         
         genre = ifelse(genre %in% c("stand-up comedy", 
                                     "stand-up comedy & talk shows"), 
                        "comedies", genre),

         genre = ifelse(genre %in% c("documentaries", "docuseries"), 
                        "docs", genre),
         
         genre = ifelse(genre %in% c("children & family", "kids'"), 
                        "kids", genre),

         genre = gsub("action & adventure", "action_adventr", genre))

Filter Data to Genres of Interest

nflx <- nflx |>
  filter(genre %in% c("action_adventr", "comedies", "docs", 
                      "dramas", "international", "kids"))
  • Examine results in console with these commands

    • cbind(table(nflx$genre))

    • table(nflx$type, nflx$genre)

Create Final Summary Dataset

In the HW 5 - Part 1 Example Dashboard, I use the nflx_tv dataset, TV Show observations from this dataset.

In HW 5 - Part 1, Students will create use the nflx_mv dataset, the Movie observations from this dataset.

nflx_wide <- nflx |>                            
  group_by(type, release_period, year_added, min_age, genre) |>
  summarize(n = n()) |>                   # n() tallies number of obs       
  pivot_wider(id_cols = c(type, release_period, year_added, min_age), # col. for each genre
              names_from = genre, 
              values_from = n)

nflx_wide[is.na(nflx_wide)] <- 0                              # why should NA's be 0 here?


nflx_tv <- nflx_wide |>                                       # example dashboard  made from nflx tv
  filter(type == "TV Show") 

# nflx_mv <- 

Optional Group Meetings with Me

  • After break I will post a sign-up schedule

  • Meetings will be 5-10 minutes on Zoom

  • After reading your submitted pre-proposal I will:

    • Ask your group questions

    • Provide feedback to guide you towards good data and a good plan

    • Answer question you might have before you write your final proposal.