Data Preparation in Excel
2025-01-13
Review Question
Text Functions
Excel Tables
In-class Polling (Session ID: bua345s25)
Review from Lecture 1 and MAS 261 (or previous course)
Open the small data file, Lecture 2 Review Question.xlsx
Use relative and absolute cell references as shown to determine the value of the SMALLEST NEGATIVE deviation.
Round your answer to two decimal places.
Recall from MAS 261: A deviation is the difference from the mean, Observation - Mean
You work at Jet Blue and plan to offer flights to Rapid City, South Dakota
Flight data for the state of South Dakota for 2024 is shown in the first sheet of Lecture 2 Excel Worksheets.xlsx
.
We ONLY want data for flights TO Rapid City so we need to
Separate ‘Origin’ and ‘Destination’ into TWO separate columns
Subset (filter) data to observations with Rapid City as ‘Destination’
In Excel:
Under Delimiters
select Other
and specify –
.
–
is the delimiter or separator between ORIGIN
and DESTINATION
in our data.The Data preview
window below should show that the two variables are split correctly.
Click Finish
(No need for Step 3).
Data set now has two SEPARATE variables, ORIGIN, and DESTINATION.
We want to filter the data to rows (observations) that have Rapid City, SD
as the destination.
Scroll down, or enter Rapid City, SD
in the search window.
Check the box next to Rapid City, SD
and click OK.
Click grey triangle in upper left corner of sheet to select whole sheet and then copy and paste it into Sheet 2.
With whole second sheet (Flights - Worksheet 2
) selected:
Cells
> Format
> Autofit Column Width
.Save your Excel file.
Question 2. How many observations are in this new data set? That is, how many flights were there to Rapid City, SD in 2024?
Question 3. Now filter the data on Worksheet 2 to determine how many of the flights to Rapid City in 2024 were on Allegiant Air.
Undo the filter option from Question 3 before we continue.
Sometimes we may want to concatenate or join text from two variables into one.
This is the reverse of the parsing what we just did.
We will take the full data set (148 observations) on Worksheet 2 and create a new variable Origin / Destination
.
This is done using the =CONCAT
command in Excel and uses the RELATIVE cell references from Lecture 1.
We start by labeling our new column:
F2
, type the following formula and click Enter
or Return
:
=CONCAT(D2, “ / ”, E2)
F2
into cells F3
through F149
.Column F
is LINKED to columns D
and E
.
If we delete the text in cell D2
, it will also disappear, in cell F2
.
Select column we want to delink, column F
and copy it.
Ctrl + C
or Cmd + C
Right click to bring up Paste Options
and select Values (V)
(clipboard with numbers).
This replaces formulas with the values in the cells and delinks that text from columns used to create it.
We can then DELETE Origin and Destination and our new column will not change.
Worksheet 3 has a famous dataset that shows the passengers on the ill-fated Titanic that sank on April 15th, 1912.
We want to convert these data to an Excel Table to make our work more efficient.
Note that these data are located in
This Excel table has useful features so we can:
remove duplicates automatically.
do quick(er) calculations.
sort and filter data efficiently.
Notice that the 2nd and 3rd rows appear to be duplicates.
These errors are not always so easy to spot and there may be others in the data.
Click on any cell in table so that Table Design
option appears and then click on it.
Click on Remove Duplicates
We opt to select only NAME and AGE to maximize the chance of finding all duplicates. Cick OK.
We see that only one duplicate was removed.
Note that the variables selected affects how duplicates are determined.
This choice depends on your data and may require trial and error.
The Fares
(ticket prices) are what the passengers paid in 1912 in British Pounds.
Convert these fares to today’s British Pound (Fare
x 146) as follows:
Based on the sorted data from the Titanic, what was the price of the most expensive fare sold in today’s U.S. dollars?
Round answer to closest whole dollar.
Question 5. What is the age of the oldest male to survive the titanic disaster?
Question 6. What was the age of the oldest female in THIRD class to survive?
GENDER
filter to female
and filter CLASS
to Third
.Text Functions
Text to Columns
is used for parsing (splitting) data.=CONCAT
function is used for joining text from 2 columns.Excel Tables
Note: These skills are excellent for single tasks.
If you plan to repeat these tasks on multiple similar data sets 5 or 10 or 100 times, then it’s time to learn some coding in R or Python.
For 1 or 2 (or 3) tasks, Excel may be more efficient
To submit an Engagement Question or Comment about material from Lecture 2: Submit it by midnight today (day of lecture).