BUA 345 - Lecture 2

Data Preparation in Excel

Author

Penelope Pooler Eisenbies

Published

January 13, 2025

Housekeeping

Today’s plan

  • Review Question

  • Text Functions

    • Parsing (Splitting) Text
    • Concatenating (Joining) Text
  • Excel Tables

    • Removing Duplicates
    • Quick Calculations
    • Intro to Sorting and Filtering


In-class Polling (Session ID: bua345s25)

Lecture 2 In-class Exercise - Q1

Review from Lecture 1 and MAS 261 (or previous course)

Open the small data file, Lecture 2 Review Question.xlsx

Use relative and absolute cell references as shown to determine the value of the SMALLEST NEGATIVE deviation.

Round your answer to two decimal places.

Recall from MAS 261: A deviation is the difference from the mean, Observation - Mean

Parsing Text

  • You work at Jet Blue and plan to offer flights to Rapid City, South Dakota

  • Flight data for the state of South Dakota for 2024 is shown in the first sheet of Lecture 2 Excel Worksheets.xlsx.

  • We ONLY want data for flights TO Rapid City so we need to

    1. Separate ‘Origin’ and ‘Destination’ into TWO separate columns

    2. Subset (filter) data to observations with Rapid City as ‘Destination’

Parsing Text Con’t

In Excel:

  1. Select column with text we want to ‘parse’ or ‘split’, column D.

  2. Notice that between each Origin and Destination in the data there is a ‘-’.

  3. Click on ‘Text to Columns’ in Data tab.

  4. In the pop-up window, select Delimited and click Next.

Parsing Text Con’t

In Excel:

  1. Under Delimiters select Other and specify .

    • is the delimiter or separator between ORIGIN and DESTINATION in our data.
  2. The Data preview window below should show that the two variables are split correctly.

  3. Click Finish (No need for Step 3).

Data set now has two SEPARATE variables, ORIGIN, and DESTINATION.

Filtering Data

We want to filter the data to rows (observations) that have Rapid City, SD as the destination.

  1. Select the DESTINATION column and click on the Filter icon.

  2. Click on the triangle that appears next to the variable name.

  3. Uncheck ‘Select All’

Filtering Data Con’t

We want to filter the data to rows (observations) that have Rapid City, SD as the destination.

  1. Scroll down, or enter Rapid City, SD in the search window.

  2. Check the box next to Rapid City, SD and click OK.

  3. Click grey triangle in upper left corner of sheet to select whole sheet and then copy and paste it into Sheet 2.

    • Green dashed lines between rows verify that only the filtered rows are selected.

Formatting Filtered Data

  1. With whole second sheet (Flights - Worksheet 2) selected:

    • click Cells > Format > Autofit Column Width.
  2. Save your Excel file.

Lecture 2 In-class Exercise - Q2-Q3


Question 2. How many observations are in this new data set? That is, how many flights were there to Rapid City, SD in 2024?

  • HINT: Exclude first row which lists the variable names.


Question 3. Now filter the data on Worksheet 2 to determine how many of the flights to Rapid City in 2024 were on Allegiant Air.

  • HINT: Select ‘CARRIER’ column and filter for ‘Allegiant Air’ only.


Undo the filter option from Question 3 before we continue.

Concatentating Text

  • Sometimes we may want to concatenate or join text from two variables into one.

  • This is the reverse of the parsing what we just did.

  • We will take the full data set (148 observations) on Worksheet 2 and create a new variable Origin / Destination.

  • This is done using the =CONCAT command in Excel and uses the RELATIVE cell references from Lecture 1.

  • We start by labeling our new column:

Concatentating Text Cont’d

  1. In cell F2, type the following formula and click Enter or Return:
    • =CONCAT(D2, “ / ”, E2)
    • Note: Quotes are REQUIRED and that there is a space before and after the slash.

  1. Copy and paste cell F2 into cells F3 through F149.

Delinking Columns

  • Column F is LINKED to columns D and E.

  • If we delete the text in cell D2, it will also disappear, in cell F2.

  • We can resolve this issue for ANY column with a formula, by replacing the formula with the values shown in the cells.

Delinking Columns

  1. Select column we want to delink, column F and copy it.

    • Ctrl + C or Cmd + C
  2. Right click to bring up Paste Options and select Values (V) (clipboard with numbers).

  • This replaces formulas with the values in the cells and delinks that text from columns used to create it.

  • We can then DELETE Origin and Destination and our new column will not change.

Excel Tables

  • Worksheet 3 has a famous dataset that shows the passengers on the ill-fated Titanic that sank on April 15th, 1912.

  • We want to convert these data to an Excel Table to make our work more efficient.

  • Note that these data are located in

    • Columns: A – F
    • Rows: 1 - 1311

Excel Tables Cont’d

  1. Click on any NON-EMPTY cell in the worksheet.

  2. In the ‘Tables’ section of the ‘Insert’ tab, click on ‘Table’

  3. Verify (or fill in if incorrect) location of all of your data, and click ‘OK’

Excel Table Created!

  • This Excel table has useful features so we can:

    • remove duplicates automatically.

    • do quick(er) calculations.

    • sort and filter data efficiently.

Duplicates

  • Notice that the 2nd and 3rd rows appear to be duplicates.

  • These errors are not always so easy to spot and there may be others in the data.

Removing Duplicates

  1. Click on any cell in table so that Table Design option appears and then click on it.

  2. Click on Remove Duplicates

Removing Duplicates Cont’d

  1. We opt to select only NAME and AGE to maximize the chance of finding all duplicates. Cick OK.

  2. We see that only one duplicate was removed.

  • Note that the variables selected affects how duplicates are determined.

  • This choice depends on your data and may require trial and error.

Excel Table Calculations

  • The Fares (ticket prices) are what the passengers paid in 1912 in British Pounds.

  • Convert these fares to today’s British Pound (Fare x 146) as follows:

  1. Type FARE TODAY in G1 and notice that table automatically expands.

    • Widen column if needed.
  2. In cell G2 type =F2*146 and click Enter.

  3. Entire column is instantly populated.

Excel Table Calculations Cont’d

  • Convert these fares in today’s British Pounds to today’s U.S. dollars.
  1. Type FARE TODAY ($) in H1 and notice (again) that table automatically expands.

    • Widen column if needed.
  2. In cell H2 type =G2*1.23 and click Enter.

  3. Entire column is instantly populated.

  4. Select column H and convert

Sorting Excel Tables

  • We can easily sort and filter data in Excel tablesto answer questions.

  • For example:

    • How much was the most expensive ticket in US Dollars?
  1. Click triangle next to FARE TODAY ($).

    • This opens a menu for sorting and filtering.
  2. Select Sort Largest to Smallest.

Lecture 2 In-class Exercise - Q4


Based on the sorted data from the Titanic, what was the price of the most expensive fare sold in today’s U.S. dollars?

Round answer to closest whole dollar.

Sorting and Filtering Excel Tables

  • We can sort and filter the data to find the age of the oldest male passenger that survived.
  1. Filter SURVIVED? to Yes.

  2. Filter GENDER to male.

  3. Sort AGE Largest to Smallest.

Lecture 2 In-class Exercise - Q5-Q6


Question 5. What is the age of the oldest male to survive the titanic disaster?


Question 6. What was the age of the oldest female in THIRD class to survive?

  • HINT: Switch the GENDER filter to female and filter CLASS to Third.

Key Points from Today

  • Text Functions

    • Text to Columns is used for parsing (splitting) data.
    • =CONCAT function is used for joining text from 2 columns.
  • Excel Tables

    • Expedites data cleaning
    • Automates column calcuations
    • Includes efficient sorting and filtering options
  • Note: These skills are excellent for single tasks.

    • If you plan to repeat these tasks on multiple similar data sets 5 or 10 or 100 times, then it’s time to learn some coding in R or Python.

    • For 1 or 2 (or 3) tasks, Excel may be more efficient

To submit an Engagement Question or Comment about material from Lecture 2: Submit it by midnight today (day of lecture).