Housekeeping

HW 2 is due Monday, 1/27/25

Today’s plan 📋

  • Review Question

  • Data Preparation Continued

    • More on Excel Tables
    • Number Filters
    • Text Filters
    • Data Cleaning – Missing Values
    • Data Cleaning – Typos
  • Conditional Formatting - Introduction


In-class Polling (Session ID: bua345s25)

💥 Lecture 3 In-class Exercise - Q1 💥

Open the data file Lecture 3 Review Question Titanic Data.xlsx.

  1. Convert these data to an Excel Table.
  2. Check for duplicates based on name and age and remove if needed.
  3. Find the highest fare price (in 1912 UK pounds) paid by a female in 2nd class that is 50 or older (rounded to closest £).

HINT: After you convert the data to an Excel Table and remove duplicates: - filter data for Gender (female) and CLASS (Second) - sort the data by AGE (Largest to smallest)

Excel Tables

  • You are a real estate agent and have data with 1037 obs. (rows).

  • Convert these to an Excel Table:

Remove Duplicates

  1. Check for duplicates using all variables except OBS.NO.

    • OBS.NO is unique for each row.
  2. In the next polling question, indicate how many duplicates were found and removed.

💥 Lecture 3 In-class Exercise - Q2 💥


How many duplicates were removed from the Lecture 3 Real Estate dataset?

Number Filters

  • Now we will use Number Filters to find a house that meets our client’s needs.

  • They want a house with

    • 4 or more bedrooms

      • BEDROOMS filtered to 4 and 5
    • 3 or more bathrooms

      • BATHROOMS filtered to 3 or more
    • a listed price of $200K or less.

  • These first two conditions are straightforward to implement.

Number Filters Cont’d

  1. In the price column click the filter icon.

  2. Click Number Filters > Less Than or Equal To...

  3. Complete the pop-up box as shown and click OK.

💥 Lecture 3 In-class Exercise - Q3 💥


What is the year of construction (CONSTRUCTED IN) of the oldest house that meets these criteria:

  • 4 or more bedrooms
  • 3 or more bathrooms
  • a listed price of $200K or less.

Clearing Filters

  • Now that we’ve answered this question, we clear all filters and RESET our worksheet.

  • Clear Filter from

    • PRICE
    • BEDROOMS
    • BATHROOMS
  • Verify that we have all 1009 observations showing.

  • Select any column and look at lower right corner.

  • It should say Count 1010 which includes top row of column names.

Another Number Filters Example

  • A DIFFERENT client is looking for a house and they want:

    • Newer Construction
      • CONSTRUCTED IN filtered to 2010 or greater
    • A price of $250K or less.
      • PRICE filtered to 250000 or less
  • How many houses fit their criteria?

💥 Lecture 3 In-class Exercise - Q4 💥


How many houses in the real estate dataset meet these criteria?

  • Newer Construction (constructed in 2010 or later)
  • A price (starting price) of $250K or less.


  • HINT: The row numbers in filtered data are not helpful, but you can select one column and look at the count in the lower corner.

    • Count = Number of Houses + Variable Row

    • Number of Houses = Count - 1


  • Clear all filters after answering this question.

Text Filters and Sorting

  • Another client is looking for a house with

    • a solar power generator
  • What is the LEAST EXPENSIVE house with this amenity?

  1. Filter OTHER AMENITIES using the ‘Text Filter’ option ‘Contains’ with the text solar (not case-sensitive)

  2. Sort PRICE from ‘Smallest to largest’

💥 Lecture 3 In-class Exercise - Q5 💥


  • What is the price of the least expensive house that claims to have solar power?


  • Not on PointSolutions:

    • Given this listed price, do you think this claim should be verified with in person?


  • Clear all filters after answering this question.

Multiple Text Filters

  • A final client is looking for

    • a house with a Fireplace AND Hardwood Floors.
  • Filter OTHER AMENITIES using both of the ‘Text Filter’ options shown.

  • How many properties meet these criteria?

💥 Lecture 3 In-class Exercise - Q6 💥


How many properties in our data claim to have both hardwood floors and a fireplace?

Data Cleaning

  • In Lecture 2 we discussed how to remove duplicates.

  • Other data cleaning tasks include finding and correcting typos and missing values

  • The data on Worksheet 2 show the top grossing Movie in the United States for each day in 2024.

  • These data have some missing values and typos that we will find and address.

  • First, we convert this sheet to an Excel table.

Missing Values

  1. Click the triangle next to MONTH and scroll through the listed values to see if (Blanks) are listed at the bottom.

  2. Deselect all values EXCEPT (Blanks).

  3. Fill in these values using the month information in the adjacent date column.

  4. Check other column to confirm there are no missing values, (Blanks).

💥 Lecture 3 In-class Exercise - Q7 💥



How many missing values were there in the MONTH column of movie data on Worksheet 2?

Missing Values and Imputation

  • In this case the missing value was straightforward to find.

  • What if we have data with a missing numeric value?

  • It is never a good idea to delete observations, because that will bias the data.

  • Instead, we maintain all the data AND document the missing values.

  • If possible, we do some research to find the missing value(s) in other data sources.

  • One Advanced method (IMPUTATION):

    • Replace missing value(s) with imputed value(s).

    • Imputation estimates a logical substitute for a missing value based on full data set.

    • Goal of imputation is to not bias data summaries.

    • Imputation methods are covered in more advanced analytics courses.

Typos

  • Text data is prone to typos.

  • We will carefully check the TITLE column.

  • The key is to look for text that is similar to another title above or below it but seems incorrect.

  • If you are unsure if the text value is an actual typo, you can use an AI tool to verify.

Finding and Correcting Typos

  1. Carefully examine the values in the TITLE column by scolling through the list of unique values.

  2. Use AI to determine if the suspected typo is movie that was released in 2024.

  3. Correct the value if needed.

💥 Lecture 3 In-class Exercise - Q8 💥



Fill in the blank:

Wicked was the number one movie in the United States in 2024 for ____ days.

Conditional Formatting

  • After cleaning data, the next common step is descriptive analytics which can be done with conditional formatting.

  • Conditional Formatting allows us to visually compare data values efficiently.

  • For example, how do these airports compare for percent of flights on time?

  • Convert these data to an Excel Table. (Insert > Table)

Formatting by Color

Cells are assigned colors based on their value.

  1. Go to Home Tab.

  2. Select % ON-TIME column.

  3. Click

    • Conditional Formatting
    • Color Scales
    • Select first color scale option in the upper right corner.
  • Values are now color-coded

  • Which two airports are in the red (worst) category?

💥 Lecture 3 In-class Exercise - Q9 💥


Which two airports are in the red (worst) category?

A. ALT & EWR

B. IAH & MIA

C. DEN & MCO

D. EWR & LAX

E. PHL & LAX

  • After answering this question, clear this formatting.

Formatting by Icon

  • If intended audience is color-blind, color coding is not helpful.

  • Another option is conditional formatting with ‘Icon Sets’

  • Value ranges are assigned a symbol.

  1. Go to Home Tab.

  2. Select % ON-TIME column.

  3. Click

    • Conditional Formatting
    • Icon Sets
    • Select first option in upper left corner.
    • Sort % ON-TIME ‘Largest to Smallest’

💥 Lecture 3 In-class Exercise - Q10 💥


How many airports get a green checkmark using the icon sets (3 categories) in conditional formatting?


NOTE: It is easier to answer this question after you have sorted ‘% On-TIME’ from ‘Largest to Smallest’.


  • After answering this question, DO NOT clear this formatting.

Format Choice can Effect Results

  • We can use both the color formatting and the icon set on the same variable column.

  • Note that they they have different cutoffs becuase they have different numbers of categories.

  • We can see which airport(s) are classified differently using these two formatting options.

  • When you are done you can save this worksheet with the conditional formatting or clear all formatting.