Data Preparation and Conditional Formatting
2025-01-20
HW 2 is due Monday, 1/27/25
Review Question
Data Preparation Continued
Conditional Formatting - Introduction
In-class Polling (Session ID: bua345s25)
Open the data file Lecture 3 Review Question Titanic Data.xlsx
.
HINT: After you convert the data to an Excel Table and remove duplicates: - filter data for Gender (female) and CLASS (Second) - sort the data by AGE (Largest to smallest)
How many duplicates were removed from the Lecture 3 Real Estate dataset?
What is the year of construction (CONSTRUCTED IN) of the oldest house that meets these criteria:
Now that we’ve answered this question, we clear all filters and RESET our worksheet.
Clear Filter from
Verify that we have all 1009 observations showing.
Select any column and look at lower right corner.
It should say Count 1010
which includes top row of column names.
How many houses in the real estate dataset meet these criteria?
HINT: The row numbers in filtered data are not helpful, but you can select one column and look at the count in the lower corner.
Count = Number of Houses + Variable Row
Number of Houses = Count - 1
Not on PointSolutions:
How many properties in our data claim to have both hardwood floors and a fireplace?
In Lecture 2 we discussed how to remove duplicates.
Other data cleaning tasks include finding and correcting typos and missing values
The data on Worksheet 2 show the top grossing Movie in the United States for each day in 2024.
These data have some missing values and typos that we will find and address.
First, we convert this sheet to an Excel table.
Click the triangle next to MONTH
and scroll through the listed values to see if (Blanks)
are listed at the bottom.
Deselect all values EXCEPT (Blanks)
.
Fill in these values using the month information in the adjacent date column.
Check other column to confirm there are no missing values, (Blanks)
.
How many missing values were there in the MONTH
column of movie data on Worksheet 2?
In this case the missing value was straightforward to find.
What if we have data with a missing numeric value?
It is never a good idea to delete observations, because that will bias the data.
Instead, we maintain all the data AND document the missing values.
If possible, we do some research to find the missing value(s) in other data sources.
One Advanced method (IMPUTATION):
Replace missing value(s) with imputed value(s).
Imputation estimates a logical substitute for a missing value based on full data set.
Goal of imputation is to not bias data summaries.
Imputation methods are covered in more advanced analytics courses.
Text data is prone to typos.
We will carefully check the TITLE
column.
The key is to look for text that is similar to another title above or below it but seems incorrect.
If you are unsure if the text value is an actual typo, you can use an AI tool to verify.
Carefully examine the values in the TITLE
column by scolling through the list of unique values.
Use AI to determine if the suspected typo is movie that was released in 2024.
Correct the value if needed.
Fill in the blank:
Wicked was the number one movie in the United States in 2024 for ____
days.
After cleaning data, the next common step is descriptive analytics which can be done with conditional formatting.
Conditional Formatting allows us to visually compare data values efficiently.
For example, how do these airports compare for percent of flights on time?
Convert these data to an Excel Table. (Insert
> Table
)
Which two airports are in the red (worst) category?
If intended audience is color-blind, color coding is not helpful.
Another option is conditional formatting with ‘Icon Sets’
Value ranges are assigned a symbol.
Go to Home Tab.
Select % ON-TIME
column.
Click
Conditional Formatting
Icon Sets
% ON-TIME
‘Largest to Smallest’How many airports get a green checkmark using the icon sets (3 categories) in conditional formatting?
NOTE: It is easier to answer this question after you have sorted ‘% On-TIME’ from ‘Largest to Smallest’.
We can use both the color formatting and the icon set on the same variable column.
Note that they they have different cutoffs becuase they have different numbers of categories.
We can see which airport(s) are classified differently using these two formatting options.
Filtering
Data cleaning
Introduction to Conditional Formatting
HW 2 is due Monday, 1/27/25
To submit an Engagement Question or Comment about material from Lecture 3: Submit it by midnight today (day of lecture).