Introduction to Pandas

In this document we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features. The following topics are covered in this notebook :

  • Series
  • DataFrames
  • Missing Data
  • GroupBy
  • Merging,Joining,and Concatenating
  • Operations
  • Data Input and Output

Series

The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [3]:
import numpy as np
import pandas as pd

Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [4]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

Using Lists

In [5]:
pd.Series(data=my_list)
Out[5]:
0    10
1    20
2    30
dtype: int64
In [6]:
pd.Series(data=my_list,index=labels)
Out[6]:
a    10
b    20
c    30
dtype: int64

NumPy Arrays

In [8]:
pd.Series(arr)
Out[8]:
0    10
1    20
2    30
dtype: int64
In [9]:
pd.Series(arr,labels)
Out[9]:
a    10
b    20
c    30
dtype: int64

Dictionary

In [10]:
pd.Series(d)
Out[10]:
a    10
b    20
c    30
dtype: int64

Data in a Series

A pandas Series can hold a variety of object types:

In [11]:
pd.Series(data=labels)
Out[11]:
0    a
1    b
2    c
dtype: object

Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [12]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])      
ser1
Out[12]:
USA        1
Germany    2
USSR       3
Japan      4
dtype: int64
In [13]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])   
ser2
Out[13]:
USA        1
Germany    2
Italy      5
Japan      4
dtype: int64
In [14]:
ser1['USA']
Out[14]:
1
In [15]:
ser2['Japan']
Out[15]:
4

Operations are then also done based off of index:

In [16]:
ser1 + ser2
Out[16]:
Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [17]:
import pandas as pd
import numpy as np
In [18]:
from numpy.random import randn
np.random.seed(101)
In [22]:
# create a sample dataframe using random number 
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split()) 
df
Out[22]:
W X Y Z
A 0.386030 2.084019 -0.376519 0.230336
B 0.681209 1.035125 -0.031160 1.939932
C -1.005187 -0.741790 0.187125 -0.732845
D -1.382920 1.482495 0.961458 -2.141212
E 0.992573 1.192241 -1.046780 1.292765

Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [23]:
df['W'] # grab the W column
Out[23]:
A    0.386030
B    0.681209
C   -1.005187
D   -1.382920
E    0.992573
Name: W, dtype: float64
In [24]:
# Pass a list of column names
df[['W','Z']]
Out[24]:
W Z
A 0.386030 0.230336
B 0.681209 1.939932
C -1.005187 -0.732845
D -1.382920 -2.141212
E 0.992573 1.292765

DataFrame Columns are just Series

In [26]:
type(df['W'])
Out[26]:
pandas.core.series.Series

Creating a new column:

In [27]:
df['new'] = df['W'] + df['Y']
In [28]:
df
Out[28]:
W X Y Z new
A 0.386030 2.084019 -0.376519 0.230336 0.009512
B 0.681209 1.035125 -0.031160 1.939932 0.650049
C -1.005187 -0.741790 0.187125 -0.732845 -0.818062
D -1.382920 1.482495 0.961458 -2.141212 -0.421462
E 0.992573 1.192241 -1.046780 1.292765 -0.054206

Removing Columns

In [29]:
df.drop('new',axis=1) # use axis = 1 for columns 
Out[29]:
W X Y Z
A 0.386030 2.084019 -0.376519 0.230336
B 0.681209 1.035125 -0.031160 1.939932
C -1.005187 -0.741790 0.187125 -0.732845
D -1.382920 1.482495 0.961458 -2.141212
E 0.992573 1.192241 -1.046780 1.292765
In [30]:
# Not inplace unless specified!
df
Out[30]:
W X Y Z new
A 0.386030 2.084019 -0.376519 0.230336 0.009512
B 0.681209 1.035125 -0.031160 1.939932 0.650049
C -1.005187 -0.741790 0.187125 -0.732845 -0.818062
D -1.382920 1.482495 0.961458 -2.141212 -0.421462
E 0.992573 1.192241 -1.046780 1.292765 -0.054206
In [31]:
df.drop('new',axis=1,inplace=True)
In [32]:
df
Out[32]:
W X Y Z
A 0.386030 2.084019 -0.376519 0.230336
B 0.681209 1.035125 -0.031160 1.939932
C -1.005187 -0.741790 0.187125 -0.732845
D -1.382920 1.482495 0.961458 -2.141212
E 0.992573 1.192241 -1.046780 1.292765

Can also drop rows this way:

In [33]:
df.drop('E',axis=0) # specify axis = 0 for deleting a row
Out[33]:
W X Y Z
A 0.386030 2.084019 -0.376519 0.230336
B 0.681209 1.035125 -0.031160 1.939932
C -1.005187 -0.741790 0.187125 -0.732845
D -1.382920 1.482495 0.961458 -2.141212

Selecting Rows

In [34]:
df.loc['A']
Out[34]:
W    0.386030
X    2.084019
Y   -0.376519
Z    0.230336
Name: A, dtype: float64

Or select based off of position instead of label

In [35]:
df.iloc[2] # can also choose a row using index (it will select the 3rd row)
Out[35]:
W   -1.005187
X   -0.741790
Y    0.187125
Z   -0.732845
Name: C, dtype: float64

Selecting subset of rows and columns

In [36]:
df.loc['B','Y']
Out[36]:
-0.031160481493099617
In [37]:
df.loc[['A','B'],['W','Y']]
Out[37]:
W Y
A 0.386030 -0.376519
B 0.681209 -0.031160

Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [38]:
df
Out[38]:
W X Y Z
A 0.386030 2.084019 -0.376519 0.230336
B 0.681209 1.035125 -0.031160 1.939932
C -1.005187 -0.741790 0.187125 -0.732845
D -1.382920 1.482495 0.961458 -2.141212
E 0.992573 1.192241 -1.046780 1.292765
In [39]:
df>0
Out[39]:
W X Y Z
A True True False True
B True True False True
C False False True False
D False True True False
E True True False True
In [40]:
df[df>0]  # select all elements greater than 0. The ones that are not are displayed as NaN
Out[40]:
W X Y Z
A 0.386030 2.084019 NaN 0.230336
B 0.681209 1.035125 NaN 1.939932
C NaN NaN 0.187125 NaN
D NaN 1.482495 0.961458 NaN
E 0.992573 1.192241 NaN 1.292765
In [44]:
df[df['W']>0] # select all elements in dataframe where W column is greater than 0 and select all columns 
Out[44]:
W X Y Z
A 0.386030 2.084019 -0.376519 0.230336
B 0.681209 1.035125 -0.031160 1.939932
E 0.992573 1.192241 -1.046780 1.292765
In [45]:
df[df['W']>0]['Y'] # select all elements in dataframe where W column is greater than 0 and select only Y column 
Out[45]:
A   -0.376519
B   -0.031160
E   -1.046780
Name: Y, dtype: float64
In [43]:
df[df['W']>0][['Y','X']] # select all elements in dataframe where W column is greater than 0 and select Y, X columns 
Out[43]:
Y X
A -0.376519 2.084019
B -0.031160 1.035125
E -1.046780 1.192241

For two conditions you can use | and & with parenthesis:

In [50]:
df[(df['W']>0) & (df['Z'] > 0)] # Multiple conditions 
Out[50]:
W X Y Z
A 0.386030 2.084019 -0.376519 0.230336
B 0.681209 1.035125 -0.031160 1.939932
E 0.992573 1.192241 -1.046780 1.292765

More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [51]:
df
Out[51]:
W X Y Z
A 0.386030 2.084019 -0.376519 0.230336
B 0.681209 1.035125 -0.031160 1.939932
C -1.005187 -0.741790 0.187125 -0.732845
D -1.382920 1.482495 0.961458 -2.141212
E 0.992573 1.192241 -1.046780 1.292765
In [52]:
# Reset to default 0,1...n index
df.reset_index()
Out[52]:
index W X Y Z
0 A 0.386030 2.084019 -0.376519 0.230336
1 B 0.681209 1.035125 -0.031160 1.939932
2 C -1.005187 -0.741790 0.187125 -0.732845
3 D -1.382920 1.482495 0.961458 -2.141212
4 E 0.992573 1.192241 -1.046780 1.292765
In [53]:
newind = 'CA NY WY OR CO'.split() # creates a list 
In [54]:
df['States'] = newind  # create a new column named States
In [55]:
df
Out[55]:
W X Y Z States
A 0.386030 2.084019 -0.376519 0.230336 CA
B 0.681209 1.035125 -0.031160 1.939932 NY
C -1.005187 -0.741790 0.187125 -0.732845 WY
D -1.382920 1.482495 0.961458 -2.141212 OR
E 0.992573 1.192241 -1.046780 1.292765 CO
In [56]:
df.set_index('States') # use States column as an index
Out[56]:
W X Y Z
States
CA 0.386030 2.084019 -0.376519 0.230336
NY 0.681209 1.035125 -0.031160 1.939932
WY -1.005187 -0.741790 0.187125 -0.732845
OR -1.382920 1.482495 0.961458 -2.141212
CO 0.992573 1.192241 -1.046780 1.292765
In [57]:
df # It is not yet saved unless you specify
Out[57]:
W X Y Z States
A 0.386030 2.084019 -0.376519 0.230336 CA
B 0.681209 1.035125 -0.031160 1.939932 NY
C -1.005187 -0.741790 0.187125 -0.732845 WY
D -1.382920 1.482495 0.961458 -2.141212 OR
E 0.992573 1.192241 -1.046780 1.292765 CO
In [58]:
df.set_index('States',inplace=True) # now it will be saved 
In [59]:
df
Out[59]:
W X Y Z
States
CA 0.386030 2.084019 -0.376519 0.230336
NY 0.681209 1.035125 -0.031160 1.939932
WY -1.005187 -0.741790 0.187125 -0.732845
OR -1.382920 1.482495 0.961458 -2.141212
CO 0.992573 1.192241 -1.046780 1.292765

Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [60]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
In [61]:
hier_index
Out[61]:
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
In [62]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
Out[62]:
A B
G1 1 -1.467514 -0.494095
2 -0.162535 0.485809
3 0.392489 0.221491
G2 1 -0.855196 1.541990
2 0.666319 -0.538235
3 -0.568581 1.407338

Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [63]:
df.loc['G1']
Out[63]:
A B
1 -1.467514 -0.494095
2 -0.162535 0.485809
3 0.392489 0.221491
In [64]:
df.loc['G1'].loc[1]
Out[64]:
A   -1.467514
B   -0.494095
Name: 1, dtype: float64
In [65]:
df.index.names
Out[65]:
FrozenList([None, None])
In [67]:
df.index.names = ['Group','Num']
df
Out[67]:
A B
Group Num
G1 1 -1.467514 -0.494095
2 -0.162535 0.485809
3 0.392489 0.221491
G2 1 -0.855196 1.541990
2 0.666319 -0.538235
3 -0.568581 1.407338
In [68]:
df.xs('G1')
Out[68]:
A B
Num
1 -1.467514 -0.494095
2 -0.162535 0.485809
3 0.392489 0.221491
In [69]:
df.xs(['G1',1])
Out[69]:
A   -1.467514
B   -0.494095
Name: (G1, 1), dtype: float64
In [70]:
df.xs(1,level='Num')
Out[70]:
A B
Group
G1 -1.467514 -0.494095
G2 -0.855196 1.541990

Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

In [9]:
import numpy as np
import pandas as pd
In [10]:
# create a sample dataframe with NaN values
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan], 
                  'C':[1,2,3]})
In [11]:
df
Out[11]:
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
In [12]:
df.dropna() # drop all the rows which has even a single NaN value
Out[12]:
A B C
0 1.0 5.0 1
In [13]:
df.dropna(axis=1) # only include the column with no NaN values 
Out[13]:
C
0 1
1 2
2 3
In [14]:
df.dropna(thresh=2) # eliminate only those rows with 2 or more NaN values (using the thresh parameter)
Out[14]:
A B C
0 1.0 5.0 1
1 2.0 NaN 2
In [77]:
df['A'].fillna(value=df['A'].mean()) # fill the missing values using mean imputation
Out[77]:
0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

Groupby

The groupby method allows you to group rows of data together and call aggregate functions

In [2]:
# create a sample dataframe

import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
In [4]:
df = pd.DataFrame(data)
df
Out[4]:
Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350

Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:

In [5]:
df.groupby('Company')
Out[5]:
<pandas.core.groupby.DataFrameGroupBy object at 0x110333e80>

You can save this object as a new variable:

In [7]:
by_comp = df.groupby("Company")

And then call aggregate methods off the object:

In [8]:
by_comp.mean()
Out[8]:
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0
In [9]:
df.groupby('Company').mean()  # one liner
Out[9]:
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0

More examples of aggregate methods:

In [10]:
by_comp.std() # standard deviation of sales 
Out[10]:
Sales
Company
FB 75.660426
GOOG 56.568542
MSFT 152.735065
In [11]:
by_comp.min() # minimum sales 
Out[11]:
Person Sales
Company
FB Carl 243
GOOG Charlie 120
MSFT Amy 124
In [12]:
by_comp.max() # maximum sales 
Out[12]:
Person Sales
Company
FB Sarah 350
GOOG Sam 200
MSFT Vanessa 340
In [13]:
by_comp.count()    # total counts of each company
Out[13]:
Person Sales
Company
FB 2 2
GOOG 2 2
MSFT 2 2
In [14]:
by_comp.describe() # describe method gives a detailed summary statistics
Out[14]:
Sales
Company
FB count 2.000000
mean 296.500000
std 75.660426
min 243.000000
25% 269.750000
50% 296.500000
75% 323.250000
max 350.000000
GOOG count 2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
MSFT count 2.000000
mean 232.000000
std 152.735065
min 124.000000
25% 178.000000
50% 232.000000
75% 286.000000
max 340.000000
In [15]:
by_comp.describe().transpose() # if you prefer this way 
Out[15]:
Company FB GOOG MSFT
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Sales 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0 2.0 160.0 ... 180.0 200.0 2.0 232.0 152.735065 124.0 178.0 232.0 286.0 340.0

1 rows × 24 columns

In [16]:
by_comp.describe().transpose()['GOOG'] # can also select a particular company name 
Out[16]:
count mean std min 25% 50% 75% max
Sales 2.0 160.0 56.568542 120.0 140.0 160.0 180.0 200.0

Merging, Joining, and Concatenating

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. We will discuss these 3 methods with examples.

Example DataFrames

In [17]:
import pandas as pd

Create three sample dataframes

In [18]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])
In [19]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 
In [20]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])
In [21]:
df1
Out[21]:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
In [22]:
df2
Out[22]:
A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
In [23]:
df3
Out[23]:
A B C D
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

Concatenation

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:

In [24]:
pd.concat([df1,df2,df3])
Out[24]:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
In [25]:
pd.concat([df1,df2,df3],axis=1)
Out[25]:
A B C D A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN
5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN
6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN
7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8
9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9
10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10
11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11

Example DataFrames

In [26]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})    
In [27]:
left
Out[27]:
A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K2
3 A3 B3 K3
In [28]:
right
Out[28]:
C D key
0 C0 D0 K0
1 C1 D1 K1
2 C2 D2 K2
3 C3 D3 K3

Merging

The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:

In [29]:
pd.merge(left,right,how='inner',on='key')   # inner join 
Out[29]:
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3

Or to show a more complicated example:

In [30]:
left_complicated = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right_complicated = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})
In [31]:
left_complicated
Out[31]:
A B key1 key2
0 A0 B0 K0 K0
1 A1 B1 K0 K1
2 A2 B2 K1 K0
3 A3 B3 K2 K1
In [32]:
right_complicated
Out[32]:
C D key1 key2
0 C0 D0 K0 K0
1 C1 D1 K1 K0
2 C2 D2 K1 K0
3 C3 D3 K2 K0
In [34]:
pd.merge(left_complicated, right_complicated, on=['key1', 'key2'])  #inner join based on two key columns 
Out[34]:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
In [35]:
pd.merge(left_complicated, right_complicated, how='outer', on=['key1', 'key2'])  # full outer join
Out[35]:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3
In [36]:
pd.merge(left_complicated, right_complicated, how='right', on=['key1', 'key2'])   # right join
Out[36]:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
3 NaN NaN K2 K0 C3 D3
In [37]:
pd.merge(left_complicated, right_complicated, how='left', on=['key1', 'key2'])   # left join
Out[37]:
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN

Joining

Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

In [38]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])
In [39]:
left
Out[39]:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
In [40]:
right
Out[40]:
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
In [41]:
left.join(right)
Out[41]:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
In [42]:
left.join(right, how='outer')
Out[42]:
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3

Operations

There are lots of operations with pandas that will be really useful to you, lets explore some of them :

In [43]:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
In [44]:
df.head() # view the first few rows of the dataframe
Out[44]:
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz

Info on Unique Values

In [45]:
df['col2'].unique()       # returns an array of unique values 
Out[45]:
array([444, 555, 666])
In [46]:
df['col2'].nunique()     # returns the number of unique values 
Out[46]:
3
In [47]:
df['col2'].value_counts()  # returns value counts
Out[47]:
444    2
555    1
666    1
Name: col2, dtype: int64

Selecting Data

In [48]:
#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]
In [49]:
newdf
Out[49]:
col1 col2 col3
3 4 444 xyz

Applying Functions

In [50]:
def times2(x):
    return x*2    # create a function which takes a value and returns value multiplied by two
In [51]:
df['col1'].apply(times2)   # apply the function on the col1 column
Out[51]:
0    2
1    4
2    6
3    8
Name: col1, dtype: int64
In [52]:
df['col3'].apply(len)   # return length 
Out[52]:
0    3
1    3
2    3
3    3
Name: col3, dtype: int64
In [53]:
df['col1'].sum()   # return sum 
Out[53]:
10

Permanently Removing a Column

In [54]:
del df['col1']
In [55]:
df
Out[55]:
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Get column and index names:

In [56]:
df.columns
Out[56]:
Index(['col2', 'col3'], dtype='object')
In [57]:
df.index
Out[57]:
RangeIndex(start=0, stop=4, step=1)

Sorting and Ordering a DataFrame:

In [58]:
df
Out[58]:
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
In [59]:
df.sort_values(by='col2') #inplace=False by default
Out[59]:
col2 col3
0 444 abc
3 444 xyz
1 555 def
2 666 ghi

Find Null Values or Check for Null Values

In [60]:
df.isnull()
Out[60]:
col2 col3
0 False False
1 False False
2 False False
3 False False
In [61]:
# Drop rows with NaN Values; in this case no NaN values
df.dropna()
Out[61]:
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz

Pivot Table example

In [62]:
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)
In [63]:
df
Out[63]:
A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1
In [64]:
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
Out[64]:
C x y
A B
bar one 4.0 1.0
two NaN 5.0
foo one 1.0 3.0
two 2.0 NaN

Data Input and Output

This final section focuses on importing and exporting data in Python with pandas. Pandas can read a variety of file types using its pd.read_ methods. Let's take a look at the most common data types:

In [2]:
import numpy as np
import pandas as pd

CSV

CSV Input

In [67]:
df = pd.read_csv('example')     # example is a csv file which is in the current working directory
df
Out[67]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15

CSV Output

In [68]:
df.to_csv('example',index=False)     # export the csv file to your current working directory 

Excel

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.

Excel Input

In [70]:
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')  # the sample file is in current working directory
Out[70]:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15

Excel Output

In [71]:
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')    # export the excel file to the CWD

HTML

You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:

conda install lxml
conda install html5lib
conda install BeautifulSoup4

Then restart Jupyter Notebook. (or use pip install if you aren't using the Anaconda Distribution)

Pandas can read table tabs off of html. For example:

HTML Input

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

In [3]:
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
In [6]:
df[0]
Out[6]:
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date
0 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... Milwaukee WI 30003 First-Citizens Bank & Trust Company May 5, 2017 May 11, 2017
1 First NBC Bank New Orleans LA 58302 Whitney Bank April 28, 2017 May 10, 2017
2 Proficio Bank Cottonwood Heights UT 35495 Cache Valley Bank March 3, 2017 April 13, 2017
3 Seaway Bank and Trust Company Chicago IL 19328 State Bank of Texas January 27, 2017 April 21, 2017
4 Harvest Community Bank Pennsville NJ 34951 First-Citizens Bank & Trust Company January 13, 2017 April 13, 2017
5 Allied Bank Mulberry AR 91 Today's Bank September 23, 2016 November 17, 2016
6 The Woodbury Banking Company Woodbury GA 11297 United Bank August 19, 2016 April 27, 2017
7 First CornerStone Bank King of Prussia PA 35312 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016
8 Trust Company Bank Memphis TN 9956 The Bank of Fayette County April 29, 2016 September 6, 2016
9 North Milwaukee State Bank Milwaukee WI 20364 First-Citizens Bank & Trust Company March 11, 2016 March 13, 2017
10 Hometown National Bank Longview WA 35156 Twin City Bank October 2, 2015 April 13, 2016
11 The Bank of Georgia Peachtree City GA 35259 Fidelity Bank October 2, 2015 October 24, 2016
12 Premier Bank Denver CO 34112 United Fidelity Bank, fsb July 10, 2015 August 17, 2016
13 Edgebrook Bank Chicago IL 57772 Republic Bank of Chicago May 8, 2015 July 12, 2016
14 Doral Bank En Espanol San Juan PR 32102 Banco Popular de Puerto Rico February 27, 2015 May 13, 2015
15 Capitol City Bank & Trust Company Atlanta GA 33938 First-Citizens Bank & Trust Company February 13, 2015 April 21, 2015
16 Highland Community Bank Chicago IL 20290 United Fidelity Bank, fsb January 23, 2015 April 21, 2015
17 First National Bank of Crestview Crestview FL 17557 First NBC Bank January 16, 2015 May 8, 2017
18 Northern Star Bank Mankato MN 34983 BankVista December 19, 2014 January 6, 2016
19 Frontier Bank, FSB D/B/A El Paseo Bank Palm Desert CA 34738 Bank of Southern California, N.A. November 7, 2014 November 10, 2016
20 The National Republic Bank of Chicago Chicago IL 916 State Bank of Texas October 24, 2014 January 6, 2016
21 NBRS Financial Rising Sun MD 4862 Howard Bank October 17, 2014 March 26, 2015
22 GreenChoice Bank, fsb Chicago IL 28462 Providence Bank, LLC July 25, 2014 December 12, 2016
23 Eastside Commercial Bank Conyers GA 58125 Community & Southern Bank July 18, 2014 July 11, 2016
24 The Freedom State Bank Freedom OK 12483 Alva State Bank & Trust Company June 27, 2014 March 25, 2016
25 Valley Bank Fort Lauderdale FL 21793 Landmark Bank, National Association June 20, 2014 June 29, 2015
26 Valley Bank Moline IL 10450 Great Southern Bank June 20, 2014 June 26, 2015
27 Slavie Federal Savings Bank Bel Air MD 32368 Bay Bank, FSB May 30, 2014 December 12, 2016
28 Columbia Savings Bank Cincinnati OH 32284 United Fidelity Bank, fsb May 23, 2014 November 10, 2016
29 AztecAmerica Bank En Espanol Berwyn IL 57866 Republic Bank of Chicago May 16, 2014 October 20, 2016
... ... ... ... ... ... ... ...
522 ANB Financial, NA Bentonville AR 33901 Pulaski Bank and Trust Company May 9, 2008 August 28, 2012
523 Hume Bank Hume MO 1971 Security Bank March 7, 2008 August 28, 2012
524 Douglass National Bank Kansas City MO 24660 Liberty Bank and Trust Company January 25, 2008 October 26, 2012
525 Miami Valley Bank Lakeview OH 16848 The Citizens Banking Company October 4, 2007 September 12, 2016
526 NetBank Alpharetta GA 32575 ING DIRECT September 28, 2007 August 28, 2012
527 Metropolitan Savings Bank Pittsburgh PA 35353 Allegheny Valley Bank of Pittsburgh February 2, 2007 October 27, 2010
528 Bank of Ephraim Ephraim UT 1249 Far West Bank June 25, 2004 April 9, 2008
529 Reliance Bank White Plains NY 26778 Union State Bank March 19, 2004 April 9, 2008
530 Guaranty National Bank of Tallahassee Tallahassee FL 26838 Hancock Bank of Florida March 12, 2004 June 5, 2012
531 Dollar Savings Bank Newark NJ 31330 No Acquirer February 14, 2004 April 9, 2008
532 Pulaski Savings Bank Philadelphia PA 27203 Earthstar Bank November 14, 2003 July 22, 2005
533 First National Bank of Blanchardville Blanchardville WI 11639 The Park Bank May 9, 2003 June 5, 2012
534 Southern Pacific Bank Torrance CA 27094 Beal Bank February 7, 2003 October 20, 2008
535 Farmers Bank of Cheneyville Cheneyville LA 16445 Sabine State Bank & Trust December 17, 2002 October 20, 2004
536 Bank of Alamo Alamo TN 9961 No Acquirer November 8, 2002 March 18, 2005
537 AmTrade International Bank En Espanol Atlanta GA 33784 No Acquirer September 30, 2002 September 11, 2006
538 Universal Federal Savings Bank Chicago IL 29355 Chicago Community Bank June 27, 2002 April 9, 2008
539 Connecticut Bank of Commerce Stamford CT 19183 Hudson United Bank June 26, 2002 February 14, 2012
540 New Century Bank Shelby Township MI 34979 No Acquirer March 28, 2002 March 18, 2005
541 Net 1st National Bank Boca Raton FL 26652 Bank Leumi USA March 1, 2002 April 9, 2008
542 NextBank, NA Phoenix AZ 22314 No Acquirer February 7, 2002 February 5, 2015
543 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company February 1, 2002 October 25, 2012
544 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos January 18, 2002 November 6, 2003
545 Hamilton Bank, NA En Espanol Miami FL 24382 Israel Discount Bank of New York January 11, 2002 September 21, 2015
546 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank September 7, 2001 February 10, 2004
547 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB July 27, 2001 August 19, 2014
548 Malta National Bank Malta OH 6629 North Valley Bank May 3, 2001 November 18, 2002
549 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust February 2, 2001 February 18, 2003
550 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion December 14, 2000 March 17, 2005
551 Bank of Honolulu Honolulu HI 21029 Bank of the Orient October 13, 2000 March 17, 2005

552 rows × 7 columns

I hope you liked this comprehensive notebook on Pandas module in Python to perform advanced data analysis using pandas dataframes. If you want to explore more I recommend visiting the pandas official documentation page which has tons of useful examples and other useful resources Official Documentation

In [ ]: