---
title: "BUA 345 - Lecture 3"
author: "Penelope Pooler Eisenbies"
date: last-modified
lightbox: true
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
## Housekeeping
```{r setup, echo=FALSE, warning=F, message=F, include=F}
#| include: false
# this line specifies options for default options for all R Chunks
knitr::opts_chunk$set(echo=F)
# suppress scientific notation
options(scipen=100)
# install helper package that loads and installs other packages, if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
# install and load required packages
pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra,
countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt,
mosaicData, epiDisplay, vistributions, psych, tidyquant, dygraphs)
# verify packages
# p_loaded()
```
```{r bom_data_prep, eval=F}
bom <- read_csv("data/bom2025.csv", show_col_types = F, skip=9) |>
dplyr::select(1,8,9) |>
filter(!is.na(Gross)) |>
mutate(date = dmy(paste(Date, 2025)))
#bom$date[is.na(bom$date)] <- dmy(29022028) # next leap year is 2028
bom <- bom |>
mutate(DATE = date,
DAY= wday(DATE, label=T, abbr=T),
MONTH = month(DATE, label=T, abbr=T),
GROSS = gsub("$", "", Gross, fixed = T),
GROSS = gsub(",", "", GROSS, fixed = T) |> as.numeric(),
TITLE = `#1 Release`) |>
arrange(Date) |>
dplyr::select(DATE, MONTH, DAY, TITLE, GROSS) |>
write_csv("Excel Files/BOM_Clean.csv")
```
**HW 1 was due Monday, 1/19/26**
**HW 2 is due Monday, 1/26/26**
- You have unlimited attempts to redo assignments.
### Today's plan:
- Review Question
- Data Preparation Continued
- More on Excel Tables
- Number Filters
- Text Filters
- Data Cleaning – Missing Values
- Data Cleaning – Typos
- Conditional Formatting - Introduction
<br>
##
### Lecture 3 In-class Exercise - Q1
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
Open the data file `Lecture 3 Review Question Titanic Data.xlsx`.
1. Convert these data to an Excel Table.
2. Check for duplicates based on name and age and remove if needed.
3. **Find the highest fare price (in 1912 UK pounds) paid by a female in 2nd class that is 50 or older (rounded to closest £).**
::: fragment
**HINT:** After you convert the data to an Excel Table and remove duplicates: - filter data for Gender (female) and CLASS (Second) - sort the data by AGE (Largest to smallest)
:::
## Excel Tables
- You are a real estate agent and have data with 1037 obs. (rows).
::: fragment
{fig-align="center"}
:::
- Convert these to an Excel Table:
::: fragment
{fig-align="center"}
:::
## Remove Duplicates
::::::: columns
::: {.column width="48%"}
1. Check for duplicates using all variables except `OBS.NO`.
- `OBS.NO` is unique for each row.
2. In the next polling question, indicate how many duplicates were found and removed.
:::
::: {.column width="4%"}
:::
:::: {.column width="48%"}
::: fragment
{fig-align="center"}
:::
::::
:::::::
##
### Lecture 3 In-class Exercise - Q2
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
How many duplicates were removed from the Lecture 3 Real Estate dataset in Worksheet 1?
## Number Filters
:::::::: columns
::: {.column width="48%"}
- Now we will use `Number Filters` to find a house that meets our client's needs.
- They want a house with
- 4 or more bedrooms
- `BEDROOMS` filtered to 4 and 5
- 3 or more bathrooms
- `BATHROOMS` filtered to 3 or more
- a listed price of `$200K` or less.
- These first two conditions are straightforward to implement.
:::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center" height="3in"}
:::
::: fragment
{fig-align="center" height="3in"}
:::
:::::
::::::::
##
:::::::: columns
::: {.column width="48%"}
### Number Filters Cont'd
1. In the price column click the filter icon.
2. Click `Number Filters` \> `Less Than or Equal To...`
3. Complete the pop-up box as shown and click `OK`.
:::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center"}
:::
::: fragment
{fig-align="center"}
:::
:::::
::::::::
##
### Lecture 3 In-class Exercise - Q3
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
What is the year of construction (CONSTRUCTED IN) of the oldest house that meets these criteria:
- 4 or more bedrooms
- 3 or more bathrooms
- a listed price of \$200K or less.
##
::::::::: columns
::: {.column width="48%"}
### Clearing Filters
- Now that we’ve answered this question, we clear all filters and RESET our worksheet.
- Clear Filter from
- PRICE
- BEDROOMS
- BATHROOMS
- Verify that we have all 1009 observations showing.
- Select any column and look at lower right corner.
- It should say `Count 1010` which includes top row of column names.
:::
::: {.column width="4%"}
:::
:::::: {.column width="48%"}
::: fragment
{fig-align="center" height="2in"}
:::
::: fragment
{fig-align="center" height="2in"}
:::
::: fragment
{fig-align="center" height="2in"}
:::
::::::
:::::::::
##
### Another Number Filters Example
:::::::: columns
::: {.column width="48%"}
- A DIFFERENT client is looking for a house and they want:
- Newer Construction
- `CONSTRUCTED IN` filtered to 2010 or greater
- A price of `$250K` or less.
- `PRICE` filtered to 250000 or less
- How many houses fit their criteria?
:::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center" height="2in"}
:::
::: fragment
{fig-align="center" height="2in"}
:::
:::::
::::::::
##
### Lecture 3 In-class Exercise - Q4
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
How many houses in the real estate dataset meet these criteria?
- Newer Construction (constructed in 2010 or later)
- A price (starting price) of \$250K or less.
<br>
- HINT: The row numbers in filtered data are not helpful, but you can select one column and look at the count in the lower corner.
- Count = Number of Houses + Variable Row
- Number of Houses = Count - 1
<br>
- **Clear all filters after answering this question.**
##
### Text Filters and Sorting
::::::::: columns
::: {.column width="48%"}
- Another client is looking for a house with
- a solar power generator
- What is the LEAST EXPENSIVE house with this amenity?
1. Filter `OTHER AMENITIES` using the ‘Text Filter’ option ‘Contains’ with the text solar (not case-sensitive)
2. Sort `PRICE` from 'Smallest to largest'
:::
::: {.column width="4%"}
:::
:::::: {.column width="48%"}
::: fragment
{fig-align="center" height="2in"}
:::
::: fragment
{fig-align="center" height="2in"}
:::
::: fragment
{fig-align="center" height="1in"}
:::
::::::
:::::::::
##
### Lecture 3 In-class Exercise - Q5
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
- What is the price of the least expensive house that claims to have solar power?
<br>
- Follow-up Question:
- Given this listed price, do you think this claim should be verified with in person?
<br>
- **Clear all filters after answering this question.**
##
### Multiple Text Filters
:::::::: columns
::: {.column width="48%"}
- A final client is looking for
- a house with a Fireplace AND Hardwood Floors.
- Filter `OTHER AMENITIES` using both of the 'Text Filter' options shown.
- How many properties meet these criteria?
:::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center" height="2in"}
:::
::: fragment
{fig-align="center" height="2in"}
:::
:::::
::::::::
##
### Lecture 3 In-class Exercise - Q6
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
How many properties in our data claim to have both hardwood floors and a fireplace?
## Data Cleaning
- In Lecture 2 we discussed how to remove duplicates.
- Other data cleaning tasks include finding and correcting typos and missing values
- The data on Worksheet 2 show the top grossing Movie in the United States for each day in 2025.
- These data have some missing values and typos that we will find and address.
- First, we convert this sheet to an Excel table.
::: fragment
{fig-align="center"}
:::
##
::::::::: columns
::: {.column width="48%"}
### Missing Values
{fig-align="center" height="1in"}
1. Click the triangle next to `MONTH` and scroll through the listed values to see if `(Blanks)` are listed at the bottom.
2. Deselect all values EXCEPT `(Blanks)`.
3. Fill in these values using the month information in the adjacent date column.
4. Check other column to confirm there are no missing values, `(Blanks)`.
:::
::: {.column width="4%"}
:::
:::::: {.column width="48%"}
::: fragment
{fig-align="center" height="4in"}
:::
::: fragment
{fig-align="center" height="1in"}
:::
::: fragment
{fig-align="center" height="1in"}
:::
::::::
:::::::::
##
### Lecture 3 In-class Exercise - Q7
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
{fig-align="center" height="1in"}
<br>
How many missing values were there in the `MONTH` column of movie data on Worksheet 2?
## Missing Values and Imputation
- In this case the missing value was straightforward to find.
- What if we have data with a missing numeric value?
- It is never a good idea to delete observations, because that will bias the data.
- Instead, we maintain all the data AND document the missing values.
- If possible, we do some research to find the missing value(s) in other data sources.
- **One Advanced method (IMPUTATION):**
- Replace missing value(s) with imputed value(s).
- Imputation estimates a logical substitute for a missing value based on full data set.
- Goal of imputation is to not bias data summaries.
- Imputation methods are covered in more advanced analytics courses.
## Typos
{fig-align="center" height="1in"}
- Text data is prone to typos.
- We will carefully check the `TITLE` column.
- The key is to look for text that is similar to another title above or below it but seems incorrect.
- If you are unsure if the text value is an actual typo, you can use an AI tool to verify.
##
### Finding and Correcting Typos
::::::: columns
::: {.column width="48%"}
1. Carefully examine the values in the `TITLE` column by scolling through the list of unique values.
2. Use AI to determine if the suspected typo is a movie that was released in 2025.
3. Correct the value if needed.
:::
::: {.column width="4%"}
:::
:::: {.column width="48%"}
::: fragment
{fig-align="center" height="3in"}
:::
::::
:::::::
::: fragment
{fig-align="center" height="2in"}
:::
##
### Lecture 3 In-class Exercise - Q8
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
{fig-align="center" height="1in"}
<br>
**Fill in the blank:**
Sinners was the number one movie in the United States in 2025 for `____` days.
## Conditional Formatting
::::::: columns
::: {.column width="48%"}
- After cleaning data, the next common step is descriptive analytics which can be done with conditional formatting.
- Conditional Formatting allows us to visually compare data values efficiently.
- For example, how do these airports compare for percent of flights on time?
- Convert these data to an Excel Table. (`Insert` \> `Table`)
:::
::: {.column width="4%"}
:::
:::: {.column width="48%"}
::: fragment
{fig-align="center"}
:::
::::
:::::::
##
:::::::: columns
::: {.column width="48%"}
### Formatting by Color
Cells are assigned colors based on their value.
1. Go to Home Tab.
2. Select `% ON-TIME` column.
3. Click
- `Conditional Formatting`
- `Color Scales`
- Select first color scale option in the upper right corner.
- Values are now color-coded
- Which two airports are in the red (worst) category?
:::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center" height="3in"}
:::
::: fragment
{fig-align="center" height="3in"}
:::
:::::
::::::::
##
### Lecture 3 In-class Exercise - Q9
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
Which two airports are in the red (worst) category?
::::::: columns
::: {.column width="48%"}
- `ALT` & `EWR`
- `IAH` & `MIA`
- `DEN` & `MCO`
- `EWR` & `LAX`
- `PHL` & `LAX`
<br>
- **After answering this question, clear this formatting.**
:::
::: {.column width="4%"}
:::
:::: {.column width="48%"}
::: fragment
{fig-align="center" height="3in"}
:::
::::
:::::::
##
:::::::: columns
::: {.column width="58%"}
### Formatting by Icon
- If intended audience is color-blind, color coding is not helpful.
- Another option is conditional formatting with ‘Icon Sets’
- Value ranges are assigned a symbol.
1. Go to Home Tab.
2. Select `% ON-TIME` column.
3. Click
- `Conditional Formatting`
- `Icon Sets`
- Select first option in upper left corner.
- Sort `% ON-TIME` 'Largest to Smallest'
:::
::: {.column width="4%"}
:::
::::: {.column width="38%"}
::: fragment
{fig-align="center" height="3in"}
:::
::: fragment
{fig-align="center" height="3in"}
:::
:::::
::::::::
##
### Lecture 3 In-class Exercise - Q10
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
How many airports get a green checkmark using the icon sets (3 categories) in conditional formatting?
<br>
**NOTE:** It is easier to answer this question after you have sorted ‘% On-TIME’ from ‘Largest to Smallest’.
<br>
- **After answering this question, DO NOT clear this formatting.**
##
### Format Choice can Effect Results
::::::: columns
::: {.column width="48%"}
- We can use both the color formatting and the icon set on the same variable column.
- Note that they they have different cutoffs becuase they have different numbers of categories.
- We can see which airport(s) are classified differently using these two formatting options.
:::
::: {.column width="4%"}
:::
:::: {.column width="48%"}
::: fragment
{fig-align="center"}
:::
::::
:::::::
- When you are done you can save this worksheet with the conditional formatting or clear all formatting.
##
### Key Points from Today
- **Filtering**
- Number Filters – required for continuous data
- Text filter – useful for filtering by category or description
- **Data cleaning**
- Checking data for duplicates (Lecture 2), missing values, and typos
- Never delete data without justification
- Document and report any changes made to data
- **Introduction to Conditional Formatting**
- Very useful for visualizing data
- Number of categories affects classification
::: fragment
**HW 2 is due Monday, 1/26/26**
**To submit an Engagement Question or Comment about material from Lecture 3:** Submit it by midnight today (day of lecture).
:::