Data Wrangling in Pandas

Nick Huntington-Klein w/Andrew Hornstra

19 February, 2021

Data Wrangling

Welcome to the Data Wrangling Workshop!

  • The goal of data wrangling
  • How to think about data wrangling
  • Technical tips for data wrangling in Python using the pandas package
  • A walkthrough example

Limitations

  • I will assume you already have some familiarity with Python in general
  • We only have so much time! I won’t be going into great detail on the use of all the technical commands, but by the end of this you will know what’s out there and generally how it’s used
  • Shorthand: pd is from import pandas as pd, and df will be shorthand for our DataFrame object
  • As with any computer skill, a teacher’s comparative advantage is in letting you know what’s out there. The real learning comes from practice and Googling. So take what you see here today, find yourself a project, and do it! It will be awful but you will learn an astounding amount by the end

Data Wrangling

What is data wrangling?

  • You have data
  • It’s not ready for you to run your model
  • You want to get it ready to run your model
  • Ta-da!

The Core of Data Wrangling

  • Always look directly at your data so you know what it looks like
  • Always think about what you want your data to look like when you’re done
  • Think about how you can take information from where it is and put it where you want it to be
  • After every step, look directly at your data again to make sure it’s doing what you think it’s doing

I help a lot of people with their problems with data wrangling. Their issues are almost always not doing one of these four things, much more so than having trouble coding or anything like that

The Core of Data Wrangling

  • How can you “look at your data”?
  • Literally is one way - print the DataFrame to have it print out
  • Summary statistics tables: df.describe(include = 'all')
  • Checking what values it takes: pd.unique() on individual variables
  • Look for: What values are there, what the observations look like, presence of missing or unusable data, how the data is structured

The Stages of Data Wrangling

  • From records to data
  • From data to tidy data
  • From tidy data to data for your analysis

From Records to Data

From Records to Data

Not something we’ll be focusing on today! But any time the data isn’t in a workable format, like a spreadsheet or database, someone’s got to get it there!

  • “Google Trends has information on the popularity of our marketing terms, go get it!”
  • “Here’s a 600-page unformatted PDF of our sales records for the past three years. Turn it into a database.”
  • “Here are scans of the 15,000 handwritten doctor’s notes at the hospital over the past year”
  • “Here’s access to the website. The records are in there somewhere.”
  • “Go do a survey”

From Records to Data: Tips!

  • Do as little by hand as possible. It’s a lot of work and you will make mistakes
  • Look at the data a lot!
  • Check for changes in formatting - it’s common for things like “this enormous PDF of our tables” or “eight hundred text files with the different responses/orders” to change formatting halfway through
  • When working with something like a PDF or a bunch of text files, think “how can I tell a computer to spot where the actual data is?”
  • If push comes to shove, or if the data set is small enough, you can do by-hand data entry. Be very careful!

Reading Files

One common thing you run across is data split into multiple files. How can we read these in and compile them?

  • grob from the grob pakcage produces a vector of filenames
  • Use a for loop to iterate over that vector and read in the data, as well as any processing
  • Combine the results with df.append()!

Reading Files

For example, imagine you have 200 monthly sales reports in Excel files. You just want to pull cell C2 (total sales) and cell B43 (employee of the month) and combine them together.

import glob
import os

# Get relative filepaths
partial_paths = glob.glob("../Monthly_reports/*sales*")
# turn them into absolute filepaths
file_list = [os.path.abspath(file) for file in partial_paths]

Reading Files

We can simplify by making a little function that processes each of the reports as it’s read. Then, usewithpd.read_excel()` and then our function, then appendit together!

How do I get df[1,3], etc.? Because I look straight at the files and check where the data I want is, so I can pull it and put it where I want it!

Reading Files

# Initialize place for data to go
df = pd.DataFrame(columns=["sales", "employee"])
for file in file_list:
    report = pd.read_excel(file)
    sales = report.iloc[1, 3]
    employee = report.iloc[42, 1]
    df = df.append(
        pd.DataFrame(
            {
                "sales": sales,
                "employee": employee
            }, index=[0]
        )
    )

From Data to Tidy Data

From Data to Tidy Data

  • Data is any time you have your records stored in some structured format
  • But there are many such structures! They could be across a bunch of different tables, or perhaps a spreadsheet with different variables stored randomly in different areas, or one table per observation
  • These structures can be great for looking up values. That’s why they are often used in business or other settings where you say “I wonder what the value of X is for person/day/etc. N”
  • They’re rarely good for doing analysis (calculating statistics, fitting models, making visualizations)
  • For that, we will aim to get ourselves tidy data (see this walkthrough )

Tidy Data

In tidy data:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Tidy Data

The variables in tidy data come in two types:

  1. Identifying Variables/Keys are the columns you’d look at to locate a particular observation.
  2. Measures/Values are the actual data.

Which are they in this data?

Tidy Data

  • Person and Year are our identifying variables. The combination of person and year uniquely identifies a row in the data. Our “observation level” is person and year. There’s only one row with Person == “Chidi” and Year == 2018
  • Points and ShrimpConsumption are our measures. They are the things we have measured for each of our observations
  • Notice how there’s one row per observation (combination of Person and Year), and one column per variable
  • Also this table contains only variables that are at the Person-Year observation level. Variables at a different level (perhaps things that vary between Person but don’t change over Year) would go in a different table, although this last one is less important

Tidying Non-Tidy Data

  • So what might data look like when it’s not like this, and how can we get it this way?
  • Here’s one common example, a count table (not tidy!) where each column is a value, not a variable

Tidying Non-tidy Data

  • Here’s another, where the “chart position” variable is split across 52 columns, one for each week

Tidying Non-Tidy Data

  • The first big tool in our tidying toolbox is the pivot
  • A pivot takes a single row with K columns and turns it into K rows with 1 column, using the identifying variables/keys to keep things lined up.
  • This can also be referred to as going from “wide” data to “long” data
  • Long to wide is also an option
  • In every statistics package, pivot functions are notoriously fiddly. Always read the help file, and do trial-and-error! Make sure it worked as intended.

Tidying Non-Tidy Data

Check our steps!

  • We looked at the data
  • Think about how we want the data to look - one row per (keys) artist, track, and week, and a column for the chart position of that artist/track in that week, and the date entered for that artist/track (value)
  • How can we carry information from where it is to where we want it to be? With a pivot!
  • And afterwards we’ll look at the result (and, likely, go back and fix our pivot code - the person who gets a pivot right the first try is a mysterious genius)

Pivot

  • In pandas we have the functions pd.wide_to_long() and pd.long_to_wide() (there is also the more-powerful pd.melt() and pd.pivot_table() but these may be trickier to use). Here we want wide-to-long so we use pd.wide_to_long()
  • This asks for:
  • df (the data set you’re working with)
  • stubnames (the columns to pivot) - a string (or vector) with the characters that start the cols to pivot
  • i (the existing ID variables)
  • j (the name of the new ID variable)

Pivot

pd.wide_to_long(
    billboard,
    "wk",
    i=["artist", "track", "date.entered"],
    j="week"
).rename(
    {"wk": "chart_position"},
    axis=1
).dropna()
##                                                           chart_position
## artist          track                   date.entered week               
## 2 Pac           Baby Don't Cry (Keep... 2000-02-26   1                87
##                                                      2                82
##                                                      3                72
##                                                      4                77
##                                                      5                87
## ...                                                                  ...
## matchbox twenty Bent                    2000-04-29   72               NA
##                                                      73               NA
##                                                      74               NA
##                                                      75               NA
##                                                      76               NA
## 
## [24092 rows x 1 columns]

Variables Stored as Rows

  • Here we have tax form data where each variable is a row, but we have multiple tables For this one we can use pivot_wider(), and then combine multiple individuals with bind_rows()
##    index       Value     TaxFormRow
## 0      0      Person  James Acaster
## 1      0      Income         112341
## 2      0  Deductible          24000
## 3      0         AGI          88341

Variables Stored as Rows

  • pivot() is a DataFrame method that needs:
  • index (the columns that give us the key - what should it be here?)
  • columns (the column containing what will be the new variable names)
  • values (the column containing the new values)

Variables Stored as Rows

tax_data.pivot(
    index="index",
    columns="Value",
    values="TaxFormRow"
)
## Value    AGI Deductible  Income         Person
## index                                         
## 0      88341      24000  112341  James Acaster

Variables Stored as Rows

We can use .append() to stack data sets with the same variables together, handy for compiling data from different sources

tax_data.pivot(
    index="index",
    columns="Value",
    values="TaxFormRow"
).append(tax_data2.pivot(
    index="index",
    columns="Value",
    values="TaxFormRow"
))
## Value     AGI Deductible  Income         Person
## index                                          
## 0       88341      24000  112341  James Acaster
## 1      309122      16000  325122   Eddie Izzard

Merging Data

  • Commonly, you will need to link two datasets together based on some shared keys
  • For example, if one dataset has the variables “Person”, “Year”, and “Income” and the other has “Person” and “Birthplace”
##     Person  Year  Income
## 0   Ramesh  2014   81314
## 1   Ramesh  2015   82155
## 2  Whitney  2014  131292
## 3  Whitney  2015  141262
## 4    David  2014  102452
## 5    David  2015  105133

Merging Data

That was person_year_data. And now for person_data:

##     Person       Birthplace
## 0   Ramesh          Crawley
## 1  Whitney  Washington D.C.

Merging Data

  • The .merge() method will do this. The different how options varieties just determine what to do with rows you don’t find a match for. 'left' keeps non-matching rows from the first dataset but not the second, 'right' from the second not the first, 'outer' from both, 'inner' from neither
  • Can deal with mismatched named on either side by using 'left_on' etc. instead of 'on'

Merging Data

person_year_data.merge(
    person_data,
    how='left',
    on='Person'
)
##     Person  Year  Income       Birthplace
## 0   Ramesh  2014   81314          Crawley
## 1   Ramesh  2015   82155          Crawley
## 2  Whitney  2014  131292  Washington D.C.
## 3  Whitney  2015  141262  Washington D.C.
## 4    David  2014  102452              NaN
## 5    David  2015  105133              NaN
person_year_data.merge(
    person_data,
    how='right',
    on='Person'
)
##     Person  Year  Income       Birthplace
## 0   Ramesh  2014   81314          Crawley
## 1   Ramesh  2015   82155          Crawley
## 2  Whitney  2014  131292  Washington D.C.
## 3  Whitney  2015  141262  Washington D.C.

Merging Data

  • Things work great if the list of variables in by is the exact observation level in at least one of the two data sets
  • But if there are multiple observations per combination of by variables in both, that’s a problem! It will create all the potential matches, which may not be what you want:
a = pd.DataFrame({"name": ["A", "A", "B", "C"],
        "Year": [2014, 2015, 2014, 2014], "Value": range(1, 5) })
b = pd.DataFrame({"name": ["A", "A", "B", "C", "C"],
        "Characteristic": ["Up", "Down", "Up", "Left", "Right"]})
a.merge(b, how='left', on="name")
##   name  Year  Value Characteristic
## 0    A  2014      1             Up
## 1    A  2014      1           Down
## 2    A  2015      2             Up
## 3    A  2015      2           Down
## 4    B  2014      3             Up
## 5    C  2014      4           Left
## 6    C  2014      4          Right

Merging Data

  • This is why it’s super important to always know the observation level of your data. You can check it by seeing if there are any duplicate rows among what you think are your key variables: if we think that Person is a key for data set a, then a.duplicated(["Person"]).max() will return True, showing us we’re wrong
  • At that point you can figure out how you want to proceed - drop observations so it’s the observation level in one? Accept the multi-match? Pick only one of the multi-matches?

From Tidy Data to Your Analysis

From Tidy Data to Your Analysis

  • Okay! We now have, hopefully, a nice tidy data set with one column per variable, one row per observation, we know what the observation level is!
  • That doesn’t mean our data is ready to go! We likely have plenty of cleaning and manipulation to go before we are ready for analysis

Filtering

  • Filtering limits the data to the observations that fulfill a certain logical condition. It picks rows.
  • For example, Income > 100000 is True for everyone with income above 100000, and False otherwise. So filtering on Income > 100000 should give you every row with income above 100000.
  • Two main ways in pandas: .query() and .loc[]
full_person_merge = person_year_data.merge(person_data, how='left', on='Person')
full_person_merge.query("Income > 100000")
##     Person  Year  Income       Birthplace
## 2  Whitney  2014  131292  Washington D.C.
## 3  Whitney  2015  141262  Washington D.C.
## 4    David  2014  102452              NaN
## 5    David  2015  105133              NaN
full_person_merge.loc[full_person_merge["Income"] > 100000]
##     Person  Year  Income       Birthplace
## 2  Whitney  2014  131292  Washington D.C.
## 3  Whitney  2015  141262  Washington D.C.
## 4    David  2014  102452              NaN
## 5    David  2015  105133              NaN

Logical Conditions

  • A lot of programming in general is based on writing logical conditions that check whether something is true
  • In Python, if the condition is true, it returns True, which turns into 1 if you do a calculation with it. If false, it returns False, which turns into 0.

Logical Conditions Tips

Handy tools for constructing logical conditions:

a > b, a >= b, a < b, a <= b, a == b, or a != b to compare two numbers and check if a is above, above-or-equal, below, below-or-equal, equal (note == to check equality, not =), or not equal

a in c(b, c, d, e, f) checks whether a is any of the values b, c, d, e, or f. Works for text too!

Logical Conditions Tips

  • Whatever your condition is (condition), just put a not in front to reverse True/False. 2 + 2 == 4 is True, but not (2 + 2 == 4) is False
  • Chain multiple conditions together! Use and and or. Be careful with parentheses if combining them!

Selecting columns

  • Indexing and .drop() give you back just a subset of the columns. They pick columns
  • It can do this by name (with a vector of column names) or by column number (with .iloc[])
  • Use .drop() to not pick certain columns

If our data has the columns “Person”, “Year”, and “Income”, then all of these do the same thing:

no_income = person_year_data[["Person", "Year"]]
# a few ways to do this, but this is the most readable
no_income = person_year_data.drop("Income", axis=1)
no_income = person_year_data.iloc[0:1]
print(no_income)

.sort_values()

  • .sort_values() sorts the data. That’s it! Give it the column names and it will sort the data by those columns.
  • It’s often a good idea to sort your data before saving it (or looking at it) as it makes it easier to navigate
  • There are also some data manipulation tricks that rely on the position of the data
person_year_data.sort_values(["Person","Year"])
##     Person  Year  Income
## 4    David  2014  102452
## 5    David  2015  105133
## 0   Ramesh  2014   81314
## 1   Ramesh  2015   82155
## 2  Whitney  2014  131292
## 3  Whitney  2015  141262

Assigning Variables

  • We can assign columns/variables by declaring their column names
person_year_data["NextYear"] = person_year_data["Year"] + 1
person_year_data["Above100k"] = person_year_data["Income"] > 100000
print(person_year_data)
##     Person  Year  Income  NextYear  Above100k
## 0   Ramesh  2014   81314      2015      False
## 1   Ramesh  2015   82155      2016      False
## 2  Whitney  2014  131292      2015       True
## 3  Whitney  2015  141262      2016       True
## 4    David  2014  102452      2015       True
## 5    David  2015  105133      2016       True

Case assignment

  • A common need is in creating a categorical variable
  • Use .loc[] to determine which rows to update, and then assign them
  • This is known as a boolean mask
  • (here we will also use between to help with our .loc[])

Case assignment

person_year_data["IncomeBracket"] = "Under 50k"
person_year_data.loc[person_year_data["Income"].between(
    50001, 100000
), "IncomeBracket"] = "50-100k"
person_year_data.loc[person_year_data["Income"].between(
    100001, 120000
), "IncomeBracket"] = "100-120k"
person_year_data.loc[person_year_data["Income"]
                     > 120000, "IncomeBracket"] = "Above 120k"

Case assignment

##     Person  Year  Income  NextYear  Above100k IncomeBracket
## 0   Ramesh  2014   81314      2015      False       50-100k
## 1   Ramesh  2015   82155      2016      False       50-100k
## 2  Whitney  2014  131292      2015       True    Above 120k
## 3  Whitney  2015  141262      2016       True    Above 120k
## 4    David  2014  102452      2015       True      100-120k
## 5    David  2015  105133      2016       True      100-120k

Case assignment

  • Note that the assignment doesn’t have to be a value, it can be a calculation, for example
person_year_data["Inflation_Adjusted_Income"] = person_year_data["Income"]
person_year_data.loc[person_year_data["Year"] ==
                     2014, "Inflation_Adjusted_Income"] *= 1.001
  • Note in that last step we are using boolean masking to change the value of just some of the observations, also handy

.groupby()

  • .groupby() turns the dataset into a grouped data set, splitting each combination of the grouping variables
  • Calculations like .transform() then process the data separately by each group
person_year_data["Income_Relative_to_Mean"] = (person_year_data["Income"]
    - person_year_data.groupby("Person")["Income"].transform("mean"))

.groupby()

  • How is this useful in preparing data?
  • Remember, we want to look at where information is and think about how we can get it where we need it to be
  • .groupby() helps us move information from one row to another in a key variable - otherwise a difficult move!
  • It can also let us change the observation level with .agg()
  • Tip: "count" gives the number of rows in the group - handy!

.agg()

  • .agg() changes the observation level to a broader level
  • It returns only one row per group (or one row total if the data is ungrouped)
  • So now your keys are whatever you gave to .groupby()
person_year_data.groupby(
    "Person"
).agg(
    {"Income": "mean", "Person": "count"}
).rename({"Person": "YearsTracked"}, axis=1)
##            Income  YearsTracked
## Person                         
## David    103792.5             2
## Ramesh    81734.5             2
## Whitney  136277.0             2

Variable Types

Manipulating Variables

  • Those are the base data manipulation approaches we need to think about
  • They can be combined to do all sorts of things!
  • But important in using them is thinking about what kinds of variable manipulations we’re doing
  • That will feed into our variable assignments and our .agg()s
  • A lot of data cleaning is making an already-tidy variable usable!

Variable Types

Common variable types:

  • Numeric (many types!)
  • Character/string
  • Categorical
  • Date

Variable Types

  • You can check the types of your variables with .dtypes
  • You can generally convert between types using .astype
tax_data.pivot(
    index="index",
    columns="Value",
    values="TaxFormRow"
).astype(
    {
        "AGI": "float64",
        "Deductible": "float64",
        "Income": "float64",
        "Person": "category"
    }
).reset_index(drop=True)

Numeric Notes

  • Numeric data actually comes in multiple formats based on the level of acceptable precision: float, int, and so on
  • You can generally convert between types with functions like int()
  • But a common problem is that reading in very big integers (like ID numbers) will sometimes create doubles that are stored in scientific notation - lumping multiple groups together! Avoid this with options like col_types in your data-reading function

Character/string

  • Specified with "", and '' is also OK, especially if you need a " in the string
  • Use + to stick stuff together, or .join() to paste together a vector! "h"+"ello" is "hello", "_".join(["h","ello"]) is "h_ello"
  • Messy data often defaults to character. For example, a “1,000,000” in your Excel sheet might not be parsed as 1000000 but instead as a literal “1,000,000” with commas
  • Lots of details on working with these - back to them in a moment

Categorical/factor variables

  • Categorical variables are for when you’re in one category or another
  • The Categorical() function lets you specify these - and they can be ordered!

Categorical/factor variables

unsorted = pd.Categorical(
    pd.Series(
        [
            "50k-100k", "Less than 50k", "50k-100k", "100k+", "100k+"
        ]
    ),
    categories=[
        "Less than 50k", "50k-100k", "100k+"
    ],
    ordered=True
)
unsorted.sort_values()
## ['Less than 50k', '50k-100k', '50k-100k', '100k+', '100k+']
## Categories (3, object): ['Less than 50k' < '50k-100k' < '100k+']

Dates

  • Dates are the scourge of data cleaners everywhere. They’re just plain hard to work with!
  • Thankfully pandas does at least consolidate variable types into datetime
  • I won’t go into detail here, but there is a good guide here. Even then it’s tricky - dates never want to do what you want them to!

Characters/strings

  • Back to strings!
  • Even if your data isn’t textual, working with strings is a very common aspect of preparing data for analysis
  • Some are straightforward, for example using boolean masks to fix typos/misspellings in the data
  • But other common tasks in data cleaning include: getting substrings, splitting strings, cleaning strings, and detecting patterns in strings

Getting Substrings

  • When working with things like nested IDs (for example, NAICS codes are six digits, but the first two and first four digits have their own meaning), you will commonly want to pick just a certain range of characters
  • You can index the characters of the string like it’s an array
  • string[start:end] will do this. "hello"[1:3] is 'ell'
  • Note negative values read from end of string. "hello"[-1] is 'o'

Getting Substrings

  • For example, geographic Census Block Group indicators are 13 digits, the first two of which are the state FIPS code
cbg = pd.DataFrame({"cbg":[152371824231, 1031562977281]},dtype=str)
cbg["state_fips"] = cbg["cbg"].apply(lambda x: x[0:2] if len(x) == 13 else x[0:1])
cbg
##              cbg state_fips
## 0   152371824231          1
## 1  1031562977281         10

Strings

  • Lots of data will try to stick multiple pieces of information in a single cell, so you need to split it out!
  • Generically, str.split() will do this. "a,b".split(",") is ["a","b"]
  • Often in already-tidy data, you want .str.split(). Make sure to rename as appropriate!
category = pd.DataFrame({"category": ["Sales,Marketing", "H&R,Marketing"]})
category["category"].str.split(",", expand=True).rename({0: "Category1",1: "Category2"},axis=1)
##   Category1  Category2
## 0     Sales  Marketing
## 1       H&R  Marketing

Cleaning Strings

  • Strings sometimes come with unwelcome extras! Garbage or extra whitespace at the beginning or end, or badly-used characters
  • .strip() removes beginning/end whitespace, " hi hello ".strip() is "hi hello". See also .rstrip() and lstrip() for one-sided versions
  • .str.replace() is often handy for eliminating (or fixing) unwanted characters
number = pd.DataFrame({"number": ["1,000", "2,003,124"]})
number["number"].str.replace(",", "").astype(int)
## 0       1000
## 1    2003124
## Name: number, dtype: int32

Detecting Patterns in Strings

  • Often we want to do something a bit more complex. Unfortunately, this requires we dip our toes into the bottomless well that is regular expressions
  • Regular expressions are ways of describing patterns in strings so that the computer can recognize them. Technically this is what we did with .str.replace(",","") - "," is a regular expression saying “look for a comma”
  • There are a lot of options here. See the guide
  • Common: [0-9] to look for a digit, [a-zA-Z] for letters, * to repeat until you see the next thing… hard to condense here. Read the guide.

Detecting Patterns in Strings

  • For example, some companies are publicly listed and we want to indicate that but not keep the ticker. separate() won’t do it here, not easily!
    • On the next page we’ll use the regular expression '\\([A-Z].*\\)'
  • '\\([A-Z].*\\)' says “look for a (” (note the \\ to treat the usually-special ( character as an actual character), then “Look for a capital letter [A-Z]”, then “keep looking for capital letters .*”, then “look for a )”

Detecting Patterns in Strings

companies = pd.DataFrame({"name":["Amazon (AMZN) Holdings", "Cargill Corp. (cool place!)"]})
companies["publicly_listed"] = companies["name"].str.contains("\\([A-Z].*\\)")
companies["name"] = companies["name"].str.replace("\\([A-Z].*\\)", "")
print(companies)
##                           name  publicly_listed
## 0             Amazon  Holdings             True
## 1  Cargill Corp. (cool place!)            False

Using Data Structure

Using Data Structure

  • One of the core steps of data wrangling we discussed is thinking about how to get information from where it is now to where you want it
  • A tough thing about tidy data is that it can be a little tricky to move data into different rows than it currently is
  • This is often necessary when .agg()ing, or when doing things like “calculate growth from an initial value”
  • But we can solve this with the use of sort_values() along with other-row-referencing functions like indexing, perhaps combined with .head()

Using Data Structure

stock_data = pd.DataFrame({"ticker": ["AMZN", "AMZN", "AMZN", "WMT", "WMT", "WMT"],
        "date": [ "2020-03-04", "2020-03-05", "2020-03-06", "2020-03-04","2020-03-05", "2020-03-06"],
        "stock_price": [103, 103.4, 107, 85.2, 86.3, 85.6]})
stock_data["date"] = pd.to_datetime(stock_data["date"])

Using Data Structure

  • .head() and .tail() refer to the first and last rows, naturally
stock_data["price_growth_since_march_4"] = stock_data.sort_values(["ticker", "date"]).groupby(
    "ticker")["stock_price"].apply(lambda x: x/x.head(1).values[0] - 1)
print(stock_data)
##   ticker       date  stock_price  price_growth_since_march_4
## 0   AMZN 2020-03-04        103.0                    0.000000
## 1   AMZN 2020-03-05        103.4                    0.003883
## 2   AMZN 2020-03-06        107.0                    0.038835
## 3    WMT 2020-03-04         85.2                    0.000000
## 4    WMT 2020-03-05         86.3                    0.012911
## 5    WMT 2020-03-06         85.6                    0.004695

Using Data Structure

  • shift() looks to the row a certain number above/below this one, based on the n argument
  • Careful! shift() doesn’t care about time structure, it only cares about data structure. If you want daily growth but the row above is last year, too bad!

Using Data Structure

stock_data["daily_price_growth"] = (stock_data["stock_price"]/stock_data.sort_values(["ticker", "date"]).groupby(
    "ticker")["stock_price"].shift(1) - 1)
stock_data
##   ticker       date  ...  price_growth_since_march_4  daily_price_growth
## 0   AMZN 2020-03-04  ...                    0.000000                 NaN
## 1   AMZN 2020-03-05  ...                    0.003883            0.003883
## 2   AMZN 2020-03-06  ...                    0.038835            0.034816
## 3    WMT 2020-03-04  ...                    0.000000                 NaN
## 4    WMT 2020-03-05  ...                    0.012911            0.012911
## 5    WMT 2020-03-06  ...                    0.004695           -0.008111
## 
## [6 rows x 5 columns]

Trickier Stuff

  • Sometimes the kind of data you want to move from one row to another is more complex!
  • You can get stuff that might not normally be first or last by filtering on the values you want before .transform()ing

Trickier Stuff

##     person  school_grade  subject  test_score
## 0     Adam             6     Math          80
## 1    James             7     Math          84
## 2    Diego             7  English          67
## 3     Beth             8  Science          87
## 4  Francis             6  English          55
## 5     Qian             7  Science          75
## 6     Ryan             8     Math          85
## 7    Selma             8       PE          70

Trickier Stuff

grades["math_scores"] = grades.loc[
    grades["subject"] == "Math"
].groupby(
    ["school_grade"]
)["test_score"].transform("mean")
grades["Math_Average_In_This_Grade"] = grades.groupby(
    "school_grade"
)["math_scores"].transform("max")
grades.drop("math_scores", axis=1)
##     person  school_grade  subject  test_score  Math_Average_In_This_Grade
## 0     Adam             6     Math          80                        80.0
## 1    James             7     Math          84                        84.0
## 2    Diego             7  English          67                        84.0
## 3     Beth             8  Science          87                        85.0
## 4  Francis             6  English          55                        80.0
## 5     Qian             7  Science          75                        84.0
## 6     Ryan             8     Math          85                        85.0
## 7    Selma             8       PE          70                        85.0

Trickier Stuff

print(grades)
##     person  school_grade  ... math_scores  Math_Average_In_This_Grade
## 0     Adam             6  ...        80.0                        80.0
## 1    James             7  ...        84.0                        84.0
## 2    Diego             7  ...         NaN                        84.0
## 3     Beth             8  ...         NaN                        85.0
## 4  Francis             6  ...         NaN                        80.0
## 5     Qian             7  ...         NaN                        84.0
## 6     Ryan             8  ...        85.0                        85.0
## 7    Selma             8  ...         NaN                        85.0
## 
## [8 rows x 6 columns]

Automation

Automation

  • Data cleaning is often very repetitive
  • You shouldn’t let it be!
  • Not just to save yourself work and tedium, but also because standardizing your process so you only have to write the code once both reduces errors and means that if you have to change something you only have to change it once
  • So let’s automate! Two ways we’ll do it here: for loops across columns, for loops more generally, and writing functions

for loops across columns

  • If you have a lot of variables, cleaning them all can be a pain. Who wants to write out the same thing a million times, say to convert all those read-in-as-text variables to numeric?
  • Variable selectors like .startswith() helps you apply a given function to a lot of the right columns at once in addition to regular ways like 1:5

startswith

stock_data["price_growth_since_march_4"] = stock_data.sort_values(["ticker", "date"]).groupby("ticker")["stock_price"].apply(
    lambda x: x/x.head(1).values[0] - 1)

stock_data["price_growth_daily"] = (stock_data["stock_price"] / stock_data.sort_values(["ticker", "date"]).groupby(
    "ticker")["stock_price"].shift(1) - 1)

startswith

  • .startswith("price_growth") is the same here as 4:5 or ["price_growth_since_march_4", "price_growth_daily"]
growth_cols = [col for col in stock_data.columns if col.startswith("price_growth")]
stock_growth = stock_data.copy()
stock_growth[growth_cols] *= 10000

print(stock_growth)
##   ticker       date  ...  daily_price_growth  price_growth_daily
## 0   AMZN 2020-03-04  ...                 NaN                 NaN
## 1   AMZN 2020-03-05  ...            0.003883           38.834951
## 2   AMZN 2020-03-06  ...            0.034816          348.162476
## 3    WMT 2020-03-04  ...                 NaN                 NaN
## 4    WMT 2020-03-05  ...            0.012911          129.107981
## 5    WMT 2020-03-06  ...           -0.008111          -81.112399
## 
## [6 rows x 6 columns]

Multiple functions at once

# Undo what we just did
stock_growth[growth_cols] /= 10000
for col in growth_cols:
    stock_growth[col+"_pct"] = stock_growth[col] * 100
    stock_growth[col+"_bps"] = stock_growth[col] * 10000
print(stock_growth)
##   ticker       date  ...  price_growth_daily_pct  price_growth_daily_bps
## 0   AMZN 2020-03-04  ...                     NaN                     NaN
## 1   AMZN 2020-03-05  ...                0.388350               38.834951
## 2   AMZN 2020-03-06  ...                3.481625              348.162476
## 3    WMT 2020-03-04  ...                     NaN                     NaN
## 4    WMT 2020-03-05  ...                1.291080              129.107981
## 5    WMT 2020-03-06  ...               -0.811124              -81.112399
## 
## [6 rows x 10 columns]
## C:/Users/nhuntington-klein/AppData/Local/r-miniconda/envs/r-reticulate/python.exe:4: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Writing Functions

  • Generally, if you’re going to do the same thing more than once, you’re probably better off writing a function
    • Reduces errors, saves time, makes code reusable later!
def function_name(argument1: list = None,
                  argument2: set = set()) -> set:
    """This function has type hints AND a doc string. What
    a life of luxury this is."""
    # do some stuff
    some_value = 100*argument1
    another_value = argument2/set(some_value)
    return another_value
    # alternatively, without saving another_value
    # return argument2/some_value

Function-writing tips

  • Make sure to think about what kind of values your function accepts and make sure that what it returns is consistent so you know what you’re getting
  • This is a really deep topic to cover in two slides, and mostly I just want to poke you and encourage you to do it. At least, if you find yourself doing something a bunch of times in a row, just take the code, stick it inside a def wrapper, and instead use a bunch of calls to that function in a row

Finishing Up, and an Example!

Some Final Notes

  • We can’t possibly cover everything. So one last note, about saving your data!
  • What to do when you’re done and want to save your processed data?
  • There are a bunch of formats, one of which is parquet with to_parquet()
  • Saving data for sharing: to_csv() makes a CSV. Yay!

Some Final Notes

  • Also, please, please, please DOCUMENT YOUR DATA
  • At the very least, keep a spreadsheet/with a set of descriptions for each of your variables

A Walkthrough

  • Let’s clean some data!