---
title: "BUA 345 - Lecture 2"
subtitle: "Data Preparation in Excel"
author: "Penelope Pooler Eisenbies"
date: last-modified
lightbox: true
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
---
## Housekeeping
```{r setup, echo=FALSE, warning=F, message=F, include=F}
#| include: false
# this line specifies options for default options for all R Chunks
knitr::opts_chunk$set(echo=F)
# suppress scientific notation
options(scipen=100)
# install helper package that loads and installs other packages, if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
# install and load required packages
pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra,
countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt,
mosaicData, epiDisplay, vistributions, psych, tidyquant, dygraphs)
# verify packages
# p_loaded()
```
### Today's plan
- Review Question
- Text Functions
- Parsing (Splitting) Text
- Concatenating (Joining) Text
- Excel Tables
- Removing Duplicates
- Quick Calculations
- Intro to Sorting and Filtering
<br>
##
### :boom: Lecture 2 In-class Exercise - Q1 :boom:
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
**Review from Lecture 1 and MAS 261 (or previous course)**
<br>
1. Open the small data file, Lecture 2 Review Question.xlsx
2. Use relative and absolute cell references as shown to determine the value of the SMALLEST NEGATIVE deviation.
3. Round your answer to two decimal places.
<br>
::: fragment
**Recall from MAS 261: A deviation is the difference from the mean, Observation - Mean**
:::
## Parsing Text
- You work at Jet Blue and plan to offer flights to Rapid City, South Dakota
- Flight data for the state of South Dakota for 2024 is shown in the first sheet of `Lecture 2 Excel Worksheets.xlsx`.
::: fragment
{fig-align="center"}
:::
- We ONLY want data for flights TO Rapid City so we need to
1. Separate ‘Origin’ and ‘Destination’ into TWO separate columns
2. Subset (filter) data to observations with Rapid City as ‘Destination’
##
:::::::: columns
::: {.column width="48%"}
### Parsing Text Con't
In Excel:
1. Select column with text we want to ‘parse’ or ‘split’, column `D`.
2. Notice that between each Origin and Destination in the data there is a ‘-’.
3. Click on ‘Text to Columns’ in Data tab.
4. In the pop-up window, select `Delimited` and click `Next`.
:::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center" height="2.5in"}
:::
::: fragment
{fig-align="center" height="3.5in"}
:::
:::::
::::::::
##
::::::::: columns
:::: {.column width="48%"}
### Parsing Text Con't
In Excel:
1. Under `Delimiters` select `Other` and specify `–`.
- `–` is the delimiter or separator between `ORIGIN` and `DESTINATION` in our data.
2. The `Data preview` window below should show that the two variables are split correctly.
3. Click `Finish` (No need for Step 3).
::: fragment
Data set now has two SEPARATE variables, ORIGIN, and DESTINATION.
:::
::::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center" height="3.5in"}
:::
::: fragment
{fig-align="center"}
:::
:::::
:::::::::
## Filtering Data
:::::::: columns
::: {.column width="48%"}
We want to filter the data to rows (observations) that have `Rapid City, SD` as the destination.
1. Select the `DESTINATION` column and click on the `Filter` icon.
2. Click on the triangle that appears next to the variable name.
3. Uncheck ‘Select All’
:::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center" height="2.5in"}
:::
::: fragment
{fig-align="center" height="3.5in"}
:::
:::::
::::::::
##
::::::::: columns
::: {.column width="48%"}
### Filtering Data Con't
We want to filter the data to rows (observations) that have `Rapid City, SD` as the destination.
4. Scroll down, or enter `Rapid City, SD` in the search window.
5. Check the box next to `Rapid City, SD` and click OK.
6. Click grey triangle in upper left corner of sheet to select whole sheet and then copy and paste it into Sheet 2.
- Green dashed lines between rows verify that only the filtered rows are selected.
:::
::: {.column width="4%"}
:::
:::::: {.column width="48%"}
::: fragment
{fig-align="center" height="2.5in"}
:::
::: fragment
{fig-align="center" height="1in"}
:::
::: fragment
{fig-align="center" height="1in"}
:::
::::::
:::::::::
## Formatting Filtered Data
1. With whole second sheet (`Flights - Worksheet 2`) selected:
- click `Cells` \> `Format` \> `Autofit Column Width`.
2. Save your Excel file.
::: fragment
{height="2.5in"}
:::
::: fragment

:::
##
### :boom: Lecture 2 In-class Exercise - Q2-Q3 :boom:
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
**Question 2.** How many observations are in this new data set? That is, how many flights were there to Rapid City, SD in 2024?
- HINT: Exclude first row which lists the variable names.
<br>
**Question 3.** Now filter the data on Worksheet 2 to determine how many of the flights to Rapid City in 2024 were on Allegiant Air.
- HINT: Select ‘CARRIER’ column and filter for ‘Allegiant Air’ only.
<br>
::: fragment
**Undo the filter option from Question 3 before we continue.**
:::
## Concatentating Text
- Sometimes we may want to concatenate or join text from two variables into one.
- This is the reverse of the parsing what we just did.
- We will take the full data set (148 observations) on Worksheet 2 and create a new variable `Origin / Destination`.
- This is done using the `=CONCAT` command in Excel and uses the RELATIVE cell references from Lecture 1.
- We start by labeling our new column:
::: fragment

:::
##
### Concatentating Text Cont'd
1. In cell `F2`, type the following formula and click `Enter` or `Return`:
- `=CONCAT(D2, “ / ”, E2)`
- **Note:** Quotes are REQUIRED and that there is a space before and after the slash.
::: fragment

:::
2. Copy and paste cell `F2` into cells `F3` through `F149`.
::: fragment

:::
## Delinking Columns
- Column `F` is LINKED to columns `D` and `E`.
- If we delete the text in cell `D2`, it will also disappear, in cell `F2`.
::: fragment

:::
- We can resolve this issue for ANY column with a formula, by replacing the formula with the values shown in the cells.
##
::::::::: columns
::: {.column width="48%"}
### Delinking Columns
1. Select column we want to delink, column `F` and copy it.
- `Ctrl + C` or `Cmd + C`
2. Right click to bring up `Paste Options` and select `Values (V)` (clipboard with numbers).
- This replaces formulas with the values in the cells and delinks that text from columns used to create it.
- We can then DELETE Origin and Destination and our new column will not change.
:::
::: {.column width="4%"}
:::
:::::: {.column width="48%"}
::: fragment
{fig-align="center"}
:::
::: fragment
{fig-align="center"}
:::
::: fragment
{fig-align="center"}
:::
::::::
:::::::::
## Excel Tables
:::::: columns
::: {.column width="58%"}
- Worksheet 3 has a famous dataset that shows the passengers on the ill-fated Titanic that sank on April 15th, 1912.
- We want to convert these data to an Excel Table to make our work more efficient.
- Note that these data are located in
- Columns: A – F
- Rows: 1 - 1311
:::
::: {.column width="4%"}
:::
::: {.column width="38%"}
{fig-align="center"}
:::
::::::
::: fragment
{fig-align="center"}
:::
##
:::::::: columns
::: {.column width="58%"}
### Excel Tables Cont'd
{fig-align="center"}
1. Click on any NON-EMPTY cell in the worksheet.
2. In the ‘Tables’ section of the ‘Insert’ tab, click on ‘Table’
3. Verify (or fill in if incorrect) location of all of your data, and click ‘OK’
:::
::: {.column width="4%"}
:::
::::: {.column width="38%"}
::: fragment
{fig-align="center"}
:::
::: fragment
{fig-align="center"}
:::
:::::
::::::::
##
:::::: columns
::: {.column width="58%"}
### Excel Table Created!
- This Excel table has useful features so we can:
- remove duplicates automatically.
- do quick(er) calculations.
- sort and filter data efficiently.
:::
::: {.column width="4%"}
:::
::: {.column width="38%"}
{fig-align="center"}
:::
::::::
::: fragment
{fig-align="center"}
:::
## Duplicates
:::::: columns
::: {.column width="58%"}
- Notice that the 2nd and 3rd rows appear to be duplicates.
- These errors are not always so easy to spot and there may be others in the data.
:::
::: {.column width="4%"}
:::
::: {.column width="38%"}
{fig-align="center"}
:::
::::::
::: fragment
{fig-align="center"}
:::
## Removing Duplicates
1. Click on any cell in table so that `Table Design` option appears and then click on it.
2. Click on `Remove Duplicates`
::: fragment
{fig-align="center"}
:::
::: fragment
{fig-align="center" height="3in"}
:::
##
:::::::: columns
::: {.column width="48%"}
### Removing Duplicates Cont'd
1. We opt to select only NAME and AGE to maximize the chance of finding all duplicates. Cick OK.
2. We see that only one duplicate was removed.
- Note that the variables selected affects how duplicates are determined.
- This choice depends on your data and may require trial and error.
:::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center"}
:::
::: fragment
{fig-align="center"}
:::
:::::
::::::::
## Excel Table Calculations
- The `Fares` (ticket prices) are what the passengers paid in 1912 in British Pounds.
- Convert these fares to today’s British Pound (`Fare` x 146) as follows:
:::::::: columns
::: {.column width="48%"}
1. Type `FARE TODAY` in `G1` and notice that table automatically expands.
- Widen column if needed.
2. In cell `G2` type `=F2*146` and click `Enter`.
3. Entire column is instantly populated.
:::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center"}
:::
::: fragment
{fig-align="center"}
:::
:::::
::::::::
##
### Excel Table Calculations Cont'd
- Convert these fares in today’s British Pounds to today's U.S. dollars.
:::::::: columns
::: {.column width="48%"}
1. Type `FARE TODAY ($)` in `H1` and notice (again) that table automatically expands.
- Widen column if needed.
2. In cell `H2` type `=G2*1.23` and click `Enter`.
3. Entire column is instantly populated.
4. Select column `H` and convert
:::
::: {.column width="4%"}
:::
::::: {.column width="48%"}
::: fragment
{fig-align="center"}
:::
::: fragment
{fig-align="center"}
:::
:::::
::::::::
## Sorting Excel Tables
::::::: columns
::: {.column width="58%"}
- We can easily sort and filter data in Excel tablesto answer questions.
- For example:
- How much was the most expensive ticket in US Dollars?
1. Click triangle next to `FARE TODAY ($)`.
- This opens a menu for sorting and filtering.
2. Select `Sort Largest to Smallest`.
:::
::: {.column width="4%"}
:::
:::: {.column width="38%"}
::: fragment
{fig-align="center"}
:::
::::
:::::::
##
### :boom: Lecture 2 In-class Exercise - Q4 :boom:
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
Based on the sorted data from the Titanic, what was the price of the most expensive fare sold in today’s U.S. dollars?
Round answer to closest whole dollar.
##
::::::::: columns
::: {.column width="48%"}
### Sorting and Filtering Excel Tables
- We can sort and filter the data to find the age of the oldest male passenger that survived.
1. Filter `SURVIVED?` to `Yes`.
2. Filter `GENDER` to `male`.
3. Sort `AGE` `Largest to Smallest`.
:::
::: {.column width="4%"}
:::
:::::: {.column width="48%"}
::: fragment
{fig-align="center" height="2in"}
:::
::: fragment
{fig-align="center" height="2in"}
:::
::: fragment
{fig-align="center" height="2in"}
:::
::::::
:::::::::
##
### :boom: Lecture 2 In-class Exercise - Q5-Q6 :boom:
[***Poll Everywhere***](https://pollev.com/penelopepoolereisenbies685){target="_blank"} - My User Name: **penelopepoolereisenbies685**
<br>
**Question 5.** What is the age of the oldest male to survive the titanic disaster?
<br>
**Question 6.** What was the age of the oldest female in THIRD class to survive?
- **HINT:** Switch the `GENDER` filter to `female` and filter `CLASS` to `Third`.
##
### Key Points from Today
- Text Functions
- `Text to Columns` is used for parsing (splitting) data.
- `=CONCAT` function is used for joining text from 2 columns.
- Excel Tables
- Expedites data cleaning
- Automates column calcuations
- Includes efficient sorting and filtering options
- Note: These skills are excellent for single tasks.\
- If you plan to repeat these tasks on multiple similar data sets 5 or 10 or 100 times, then it’s time to learn some coding in R or Python.
- For 1 or 2 (or 3) tasks, Excel may be more efficient
::: fragment
**To submit an Engagement Question or Comment about material from Lecture 2:** Submit it by midnight today (day of lecture).
:::