In this document we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features. The following topics are covered in this notebook :
The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.
A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
Let's explore this concept through some examples:
import numpy as np
import pandas as pd
You can convert a list,numpy array, or dictionary to a Series:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}
Using Lists
pd.Series(data=my_list)
pd.Series(data=my_list,index=labels)
NumPy Arrays
pd.Series(arr)
pd.Series(arr,labels)
Dictionary
pd.Series(d)
A pandas Series can hold a variety of object types:
pd.Series(data=labels)
The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).
Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])      
ser1
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])   
ser2
ser1['USA']
ser2['Japan']
Operations are then also done based off of index:
ser1 + ser2
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)
# create a sample dataframe using random number 
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split()) 
df
Let's learn the various methods to grab data from a DataFrame
df['W'] # grab the W column
# Pass a list of column names
df[['W','Z']]
DataFrame Columns are just Series
type(df['W'])
Creating a new column:
df['new'] = df['W'] + df['Y']
df
Removing Columns
df.drop('new',axis=1) # use axis = 1 for columns 
# Not inplace unless specified!
df
df.drop('new',axis=1,inplace=True)
df
Can also drop rows this way:
df.drop('E',axis=0) # specify axis = 0 for deleting a row
Selecting Rows
df.loc['A']
Or select based off of position instead of label
df.iloc[2] # can also choose a row using index (it will select the 3rd row)
Selecting subset of rows and columns
df.loc['B','Y']
df.loc[['A','B'],['W','Y']]
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
df
df>0
df[df>0]  # select all elements greater than 0. The ones that are not are displayed as NaN
df[df['W']>0] # select all elements in dataframe where W column is greater than 0 and select all columns 
df[df['W']>0]['Y'] # select all elements in dataframe where W column is greater than 0 and select only Y column 
df[df['W']>0][['Y','X']] # select all elements in dataframe where W column is greater than 0 and select Y, X columns 
For two conditions you can use | and & with parenthesis:
df[(df['W']>0) & (df['Z'] > 0)] # Multiple conditions 
Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!
df
# Reset to default 0,1...n index
df.reset_index()
newind = 'CA NY WY OR CO'.split() # creates a list 
df['States'] = newind  # create a new column named States
df
df.set_index('States') # use States column as an index
df # It is not yet saved unless you specify
df.set_index('States',inplace=True) # now it will be saved 
df
Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
df.loc['G1']
df.loc['G1'].loc[1]
df.index.names
df.index.names = ['Group','Num']
df
df.xs('G1')
df.xs(['G1',1])
df.xs(1,level='Num')
Let's show a few convenient methods to deal with Missing Data in pandas:
import numpy as np
import pandas as pd
# create a sample dataframe with NaN values
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan], 
                  'C':[1,2,3]})
df
df.dropna() # drop all the rows which has even a single NaN value
df.dropna(axis=1) # only include the column with no NaN values 
df.dropna(thresh=2) # eliminate only those rows with 2 or more NaN values (using the thresh parameter)
df['A'].fillna(value=df['A'].mean()) # fill the missing values using mean imputation
The groupby method allows you to group rows of data together and call aggregate functions
# create a sample dataframe
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:
df.groupby('Company')
You can save this object as a new variable:
by_comp = df.groupby("Company")
And then call aggregate methods off the object:
by_comp.mean()
df.groupby('Company').mean()  # one liner
More examples of aggregate methods:
by_comp.std() # standard deviation of sales 
by_comp.min() # minimum sales 
by_comp.max() # maximum sales 
by_comp.count()    # total counts of each company
by_comp.describe() # describe method gives a detailed summary statistics
by_comp.describe().transpose() # if you prefer this way 
by_comp.describe().transpose()['GOOG'] # can also select a particular company name 
There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. We will discuss these 3 methods with examples.
import pandas as pd
Create three sample dataframes
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])
df1
df2
df3
Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:
pd.concat([df1,df2,df3])
pd.concat([df1,df2,df3],axis=1)
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})    
left
right
The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:
pd.merge(left,right,how='inner',on='key')   # inner join 
Or to show a more complicated example:
left_complicated = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
right_complicated = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                               'key2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})
left_complicated
right_complicated
pd.merge(left_complicated, right_complicated, on=['key1', 'key2'])  #inner join based on two key columns 
pd.merge(left_complicated, right_complicated, how='outer', on=['key1', 'key2'])  # full outer join
pd.merge(left_complicated, right_complicated, how='right', on=['key1', 'key2'])   # right join
pd.merge(left_complicated, right_complicated, how='left', on=['key1', 'key2'])   # left join
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])
left
right
left.join(right)
left.join(right, how='outer')
There are lots of operations with pandas that will be really useful to you, lets explore some of them :
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head() # view the first few rows of the dataframe
df['col2'].unique()       # returns an array of unique values 
df['col2'].nunique()     # returns the number of unique values 
df['col2'].value_counts()  # returns value counts
#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
def times2(x):
    return x*2    # create a function which takes a value and returns value multiplied by two
df['col1'].apply(times2)   # apply the function on the col1 column
df['col3'].apply(len)   # return length 
df['col1'].sum()   # return sum 
Permanently Removing a Column
del df['col1']
df
Get column and index names:
df.columns
df.index
Sorting and Ordering a DataFrame:
df
df.sort_values(by='col2') #inplace=False by default
Find Null Values or Check for Null Values
df.isnull()
# Drop rows with NaN Values; in this case no NaN values
df.dropna()
Pivot Table example
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
df
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
This final section focuses on importing and exporting data in Python with pandas. Pandas can read a variety of file types using its pd.read_ methods. Let's take a look at the most common data types:
import numpy as np
import pandas as pd
df = pd.read_csv('example')     # example is a csv file which is in the current working directory
df
df.to_csv('example',index=False)     # export the csv file to your current working directory 
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')  # the sample file is in current working directory
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')    # export the excel file to the CWD
You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:
conda install lxml
conda install html5lib
conda install BeautifulSoup4
Then restart Jupyter Notebook. (or use pip install if you aren't using the Anaconda Distribution)
Pandas can read table tabs off of html. For example:
Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
df[0]
I hope you liked this comprehensive notebook on Pandas module in Python to perform advanced data analysis using pandas dataframes. If you want to explore more I recommend visiting the pandas official documentation page which has tons of useful examples and other useful resources Official Documentation