Welcome to the Data Wrangling Workshop!
pd is from import pandas as pd, and df will be shorthand for our DataFrame objectWhat is data wrangling?
I help a lot of people with their problems with data wrangling. Their issues are almost always not doing one of these four things, much more so than having trouble coding or anything like that
print the DataFrame to have it print outdf.describe(include = 'all')pd.unique() on individual variablesNot something we’ll be focusing on today! But any time the data isn’t in a workable format, like a spreadsheet or database, someone’s got to get it there!
One common thing you run across is data split into multiple files. How can we read these in and compile them?
grob from the grob pakcage produces a vector of filenamesfor loop to iterate over that vector and read in the data, as well as any processingdf.append()!For example, imagine you have 200 monthly sales reports in Excel files. You just want to pull cell C2 (total sales) and cell B43 (employee of the month) and combine them together.
We can simplify by making a little function that processes each of the reports as it’s read. Then, usewithpd.read_excel()` and then our function, then appendit together!
How do I get df[1,3], etc.? Because I look straight at the files and check where the data I want is, so I can pull it and put it where I want it!
In tidy data:
The variables in tidy data come in two types:
Which are they in this data?
Check our steps!
pd.wide_to_long() and pd.long_to_wide() (there is also the more-powerful pd.melt() and pd.pivot_table() but these may be trickier to use). Here we want wide-to-long so we use pd.wide_to_long()df (the data set you’re working with)stubnames (the columns to pivot) - a string (or vector) with the characters that start the cols to pivoti (the existing ID variables)j (the name of the new ID variable)pd.wide_to_long(
billboard,
"wk",
i=["artist", "track", "date.entered"],
j="week"
).rename(
{"wk": "chart_position"},
axis=1
).dropna()## chart_position
## artist track date.entered week
## 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 82
## 3 72
## 4 77
## 5 87
## ... ...
## matchbox twenty Bent 2000-04-29 72 NA
## 73 NA
## 74 NA
## 75 NA
## 76 NA
##
## [24092 rows x 1 columns]
pivot_wider(), and then combine multiple individuals with bind_rows()## index Value TaxFormRow
## 0 0 Person James Acaster
## 1 0 Income 112341
## 2 0 Deductible 24000
## 3 0 AGI 88341
pivot() is a DataFrame method that needs:index (the columns that give us the key - what should it be here?)columns (the column containing what will be the new variable names)values (the column containing the new values)## Value AGI Deductible Income Person
## index
## 0 88341 24000 112341 James Acaster
We can use .append() to stack data sets with the same variables together, handy for compiling data from different sources
tax_data.pivot(
index="index",
columns="Value",
values="TaxFormRow"
).append(tax_data2.pivot(
index="index",
columns="Value",
values="TaxFormRow"
))## Value AGI Deductible Income Person
## index
## 0 88341 24000 112341 James Acaster
## 1 309122 16000 325122 Eddie Izzard
## Person Year Income
## 0 Ramesh 2014 81314
## 1 Ramesh 2015 82155
## 2 Whitney 2014 131292
## 3 Whitney 2015 141262
## 4 David 2014 102452
## 5 David 2015 105133
That was person_year_data. And now for person_data:
## Person Birthplace
## 0 Ramesh Crawley
## 1 Whitney Washington D.C.
.merge() method will do this. The different how options varieties just determine what to do with rows you don’t find a match for. 'left' keeps non-matching rows from the first dataset but not the second, 'right' from the second not the first, 'outer' from both, 'inner' from neither'left_on' etc. instead of 'on'## Person Year Income Birthplace
## 0 Ramesh 2014 81314 Crawley
## 1 Ramesh 2015 82155 Crawley
## 2 Whitney 2014 131292 Washington D.C.
## 3 Whitney 2015 141262 Washington D.C.
## 4 David 2014 102452 NaN
## 5 David 2015 105133 NaN
## Person Year Income Birthplace
## 0 Ramesh 2014 81314 Crawley
## 1 Ramesh 2015 82155 Crawley
## 2 Whitney 2014 131292 Washington D.C.
## 3 Whitney 2015 141262 Washington D.C.
by is the exact observation level in at least one of the two data setsby variables in both, that’s a problem! It will create all the potential matches, which may not be what you want:a = pd.DataFrame({"name": ["A", "A", "B", "C"],
"Year": [2014, 2015, 2014, 2014], "Value": range(1, 5) })
b = pd.DataFrame({"name": ["A", "A", "B", "C", "C"],
"Characteristic": ["Up", "Down", "Up", "Left", "Right"]})
a.merge(b, how='left', on="name")## name Year Value Characteristic
## 0 A 2014 1 Up
## 1 A 2014 1 Down
## 2 A 2015 2 Up
## 3 A 2015 2 Down
## 4 B 2014 3 Up
## 5 C 2014 4 Left
## 6 C 2014 4 Right
Person is a key for data set a, then a.duplicated(["Person"]).max() will return True, showing us we’re wrongIncome > 100000 is True for everyone with income above 100000, and False otherwise. So filtering on Income > 100000 should give you every row with income above 100000..query() and .loc[]full_person_merge = person_year_data.merge(person_data, how='left', on='Person')
full_person_merge.query("Income > 100000")## Person Year Income Birthplace
## 2 Whitney 2014 131292 Washington D.C.
## 3 Whitney 2015 141262 Washington D.C.
## 4 David 2014 102452 NaN
## 5 David 2015 105133 NaN
## Person Year Income Birthplace
## 2 Whitney 2014 131292 Washington D.C.
## 3 Whitney 2015 141262 Washington D.C.
## 4 David 2014 102452 NaN
## 5 David 2015 105133 NaN
True, which turns into 1 if you do a calculation with it. If false, it returns False, which turns into 0.Handy tools for constructing logical conditions:
a > b, a >= b, a < b, a <= b, a == b, or a != b to compare two numbers and check if a is above, above-or-equal, below, below-or-equal, equal (note == to check equality, not =), or not equal
a in c(b, c, d, e, f) checks whether a is any of the values b, c, d, e, or f. Works for text too!
condition), just put a not in front to reverse True/False. 2 + 2 == 4 is True, but not (2 + 2 == 4) is Falseand and or. Be careful with parentheses if combining them!.drop() give you back just a subset of the columns. They pick columns.iloc[]).drop() to not pick certain columnsIf our data has the columns “Person”, “Year”, and “Income”, then all of these do the same thing:
no_income = person_year_data[["Person", "Year"]]
# a few ways to do this, but this is the most readable
no_income = person_year_data.drop("Income", axis=1)
no_income = person_year_data.iloc[0:1]
print(no_income)
.sort_values() sorts the data. That’s it! Give it the column names and it will sort the data by those columns.## Person Year Income
## 4 David 2014 102452
## 5 David 2015 105133
## 0 Ramesh 2014 81314
## 1 Ramesh 2015 82155
## 2 Whitney 2014 131292
## 3 Whitney 2015 141262
person_year_data["NextYear"] = person_year_data["Year"] + 1
person_year_data["Above100k"] = person_year_data["Income"] > 100000
print(person_year_data)## Person Year Income NextYear Above100k
## 0 Ramesh 2014 81314 2015 False
## 1 Ramesh 2015 82155 2016 False
## 2 Whitney 2014 131292 2015 True
## 3 Whitney 2015 141262 2016 True
## 4 David 2014 102452 2015 True
## 5 David 2015 105133 2016 True
.loc[] to determine which rows to update, and then assign thembetween to help with our .loc[])person_year_data["IncomeBracket"] = "Under 50k"
person_year_data.loc[person_year_data["Income"].between(
50001, 100000
), "IncomeBracket"] = "50-100k"
person_year_data.loc[person_year_data["Income"].between(
100001, 120000
), "IncomeBracket"] = "100-120k"
person_year_data.loc[person_year_data["Income"]
> 120000, "IncomeBracket"] = "Above 120k"## Person Year Income NextYear Above100k IncomeBracket
## 0 Ramesh 2014 81314 2015 False 50-100k
## 1 Ramesh 2015 82155 2016 False 50-100k
## 2 Whitney 2014 131292 2015 True Above 120k
## 3 Whitney 2015 141262 2016 True Above 120k
## 4 David 2014 102452 2015 True 100-120k
## 5 David 2015 105133 2016 True 100-120k
person_year_data["Inflation_Adjusted_Income"] = person_year_data["Income"]
person_year_data.loc[person_year_data["Year"] ==
2014, "Inflation_Adjusted_Income"] *= 1.001.groupby() turns the dataset into a grouped data set, splitting each combination of the grouping variables.transform() then process the data separately by each group.groupby() helps us move information from one row to another in a key variable - otherwise a difficult move!.agg()"count" gives the number of rows in the group - handy!.agg() changes the observation level to a broader level.groupby()person_year_data.groupby(
"Person"
).agg(
{"Income": "mean", "Person": "count"}
).rename({"Person": "YearsTracked"}, axis=1)## Income YearsTracked
## Person
## David 103792.5 2
## Ramesh 81734.5 2
## Whitney 136277.0 2
.agg()sCommon variable types:
.dtypes.astypefloat, int, and so onint()doubles that are stored in scientific notation - lumping multiple groups together! Avoid this with options like col_types in your data-reading function"", and '' is also OK, especially if you need a " in the string+ to stick stuff together, or .join() to paste together a vector! "h"+"ello" is "hello", "_".join(["h","ello"]) is "h_ello"1000000 but instead as a literal “1,000,000” with commasCategorical() function lets you specify these - and they can be ordered!unsorted = pd.Categorical(
pd.Series(
[
"50k-100k", "Less than 50k", "50k-100k", "100k+", "100k+"
]
),
categories=[
"Less than 50k", "50k-100k", "100k+"
],
ordered=True
)
unsorted.sort_values()## ['Less than 50k', '50k-100k', '50k-100k', '100k+', '100k+']
## Categories (3, object): ['Less than 50k' < '50k-100k' < '100k+']
datetimestring[start:end] will do this. "hello"[1:3] is 'ell'"hello"[-1] is 'o'cbg = pd.DataFrame({"cbg":[152371824231, 1031562977281]},dtype=str)
cbg["state_fips"] = cbg["cbg"].apply(lambda x: x[0:2] if len(x) == 13 else x[0:1])
cbg## cbg state_fips
## 0 152371824231 1
## 1 1031562977281 10
str.split() will do this. "a,b".split(",") is ["a","b"].str.split(). Make sure to rename as appropriate!category = pd.DataFrame({"category": ["Sales,Marketing", "H&R,Marketing"]})
category["category"].str.split(",", expand=True).rename({0: "Category1",1: "Category2"},axis=1)## Category1 Category2
## 0 Sales Marketing
## 1 H&R Marketing
.strip() removes beginning/end whitespace, " hi hello ".strip() is "hi hello". See also .rstrip() and lstrip() for one-sided versions.str.replace() is often handy for eliminating (or fixing) unwanted charactersnumber = pd.DataFrame({"number": ["1,000", "2,003,124"]})
number["number"].str.replace(",", "").astype(int)## 0 1000
## 1 2003124
## Name: number, dtype: int32
.str.replace(",","") - "," is a regular expression saying “look for a comma”[0-9] to look for a digit, [a-zA-Z] for letters, * to repeat until you see the next thing… hard to condense here. Read the guide.separate() won’t do it here, not easily!
'\\([A-Z].*\\)''\\([A-Z].*\\)' says “look for a (” (note the \\ to treat the usually-special ( character as an actual character), then “Look for a capital letter [A-Z]”, then “keep looking for capital letters .*”, then “look for a )”companies = pd.DataFrame({"name":["Amazon (AMZN) Holdings", "Cargill Corp. (cool place!)"]})
companies["publicly_listed"] = companies["name"].str.contains("\\([A-Z].*\\)")
companies["name"] = companies["name"].str.replace("\\([A-Z].*\\)", "")
print(companies)## name publicly_listed
## 0 Amazon Holdings True
## 1 Cargill Corp. (cool place!) False
.agg()ing, or when doing things like “calculate growth from an initial value”.head().head() and .tail() refer to the first and last rows, naturallystock_data["price_growth_since_march_4"] = stock_data.sort_values(["ticker", "date"]).groupby(
"ticker")["stock_price"].apply(lambda x: x/x.head(1).values[0] - 1)
print(stock_data)## ticker date stock_price price_growth_since_march_4
## 0 AMZN 2020-03-04 103.0 0.000000
## 1 AMZN 2020-03-05 103.4 0.003883
## 2 AMZN 2020-03-06 107.0 0.038835
## 3 WMT 2020-03-04 85.2 0.000000
## 4 WMT 2020-03-05 86.3 0.012911
## 5 WMT 2020-03-06 85.6 0.004695
shift() looks to the row a certain number above/below this one, based on the n argumentshift() doesn’t care about time structure, it only cares about data structure. If you want daily growth but the row above is last year, too bad!stock_data["daily_price_growth"] = (stock_data["stock_price"]/stock_data.sort_values(["ticker", "date"]).groupby(
"ticker")["stock_price"].shift(1) - 1)
stock_data## ticker date ... price_growth_since_march_4 daily_price_growth
## 0 AMZN 2020-03-04 ... 0.000000 NaN
## 1 AMZN 2020-03-05 ... 0.003883 0.003883
## 2 AMZN 2020-03-06 ... 0.038835 0.034816
## 3 WMT 2020-03-04 ... 0.000000 NaN
## 4 WMT 2020-03-05 ... 0.012911 0.012911
## 5 WMT 2020-03-06 ... 0.004695 -0.008111
##
## [6 rows x 5 columns]
.transform()ing## person school_grade subject test_score
## 0 Adam 6 Math 80
## 1 James 7 Math 84
## 2 Diego 7 English 67
## 3 Beth 8 Science 87
## 4 Francis 6 English 55
## 5 Qian 7 Science 75
## 6 Ryan 8 Math 85
## 7 Selma 8 PE 70
grades["math_scores"] = grades.loc[
grades["subject"] == "Math"
].groupby(
["school_grade"]
)["test_score"].transform("mean")
grades["Math_Average_In_This_Grade"] = grades.groupby(
"school_grade"
)["math_scores"].transform("max")
grades.drop("math_scores", axis=1)## person school_grade subject test_score Math_Average_In_This_Grade
## 0 Adam 6 Math 80 80.0
## 1 James 7 Math 84 84.0
## 2 Diego 7 English 67 84.0
## 3 Beth 8 Science 87 85.0
## 4 Francis 6 English 55 80.0
## 5 Qian 7 Science 75 84.0
## 6 Ryan 8 Math 85 85.0
## 7 Selma 8 PE 70 85.0
## person school_grade ... math_scores Math_Average_In_This_Grade
## 0 Adam 6 ... 80.0 80.0
## 1 James 7 ... 84.0 84.0
## 2 Diego 7 ... NaN 84.0
## 3 Beth 8 ... NaN 85.0
## 4 Francis 6 ... NaN 80.0
## 5 Qian 7 ... NaN 84.0
## 6 Ryan 8 ... 85.0 85.0
## 7 Selma 8 ... NaN 85.0
##
## [8 rows x 6 columns]
.startswith() helps you apply a given function to a lot of the right columns at once in addition to regular ways like 1:5stock_data["price_growth_since_march_4"] = stock_data.sort_values(["ticker", "date"]).groupby("ticker")["stock_price"].apply(
lambda x: x/x.head(1).values[0] - 1)
stock_data["price_growth_daily"] = (stock_data["stock_price"] / stock_data.sort_values(["ticker", "date"]).groupby(
"ticker")["stock_price"].shift(1) - 1).startswith("price_growth") is the same here as 4:5 or ["price_growth_since_march_4", "price_growth_daily"]growth_cols = [col for col in stock_data.columns if col.startswith("price_growth")]
stock_growth = stock_data.copy()
stock_growth[growth_cols] *= 10000
print(stock_growth)## ticker date ... daily_price_growth price_growth_daily
## 0 AMZN 2020-03-04 ... NaN NaN
## 1 AMZN 2020-03-05 ... 0.003883 38.834951
## 2 AMZN 2020-03-06 ... 0.034816 348.162476
## 3 WMT 2020-03-04 ... NaN NaN
## 4 WMT 2020-03-05 ... 0.012911 129.107981
## 5 WMT 2020-03-06 ... -0.008111 -81.112399
##
## [6 rows x 6 columns]
# Undo what we just did
stock_growth[growth_cols] /= 10000
for col in growth_cols:
stock_growth[col+"_pct"] = stock_growth[col] * 100
stock_growth[col+"_bps"] = stock_growth[col] * 10000
print(stock_growth)## ticker date ... price_growth_daily_pct price_growth_daily_bps
## 0 AMZN 2020-03-04 ... NaN NaN
## 1 AMZN 2020-03-05 ... 0.388350 38.834951
## 2 AMZN 2020-03-06 ... 3.481625 348.162476
## 3 WMT 2020-03-04 ... NaN NaN
## 4 WMT 2020-03-05 ... 1.291080 129.107981
## 5 WMT 2020-03-06 ... -0.811124 -81.112399
##
## [6 rows x 10 columns]
## C:/Users/nhuntington-klein/AppData/Local/r-miniconda/envs/r-reticulate/python.exe:4: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
def function_name(argument1: list = None,
argument2: set = set()) -> set:
"""This function has type hints AND a doc string. What
a life of luxury this is."""
# do some stuff
some_value = 100*argument1
another_value = argument2/set(some_value)
return another_value
# alternatively, without saving another_value
# return argument2/some_valuedef wrapper, and instead use a bunch of calls to that function in a rowto_parquet()to_csv() makes a CSV. Yay!