NumPy Arrays


Numpy arrays can be vectors or matrices. Vectors are 1d arrays and matrices are 2d (but a matrix can still have only one row or one column).

Creating NumPy Arrays

From a Python List
> import numpy as np
+ my_list = [1,2,3]
+ np.array(my_list)
array([1, 2, 3])
> my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
+ my_matrix
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
> # Cast to 2d array
+ np.array(my_matrix)
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Built-in Methods

arange

Returns evenly spaced values within a given interval.

> # up to but not including 10
+ np.arange(0,10) 
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
> #by 2
+ np.arange(0,11,2) 
array([ 0,  2,  4,  6,  8, 10])
zeros and ones

Generate arrays of zeros or ones.

> np.zeros(3)
array([0., 0., 0.])
> # tuple (rows, columns)
+ np.zeros((5,4)) 
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
> np.ones(3)
array([1., 1., 1.])
> np.ones((3,4))
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])
linspace

Returns evenly spaced numbers over a specified interval. start, stop, num

> np.linspace(0,10,3)
array([ 0.,  5., 10.])
> np.linspace(0,5,10)
array([0.        , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
       2.77777778, 3.33333333, 3.88888889, 4.44444444, 5.        ])
> # one opening bracket
+ # still 1 dimensional
+ np.linspace(0,10,50)
array([ 0.        ,  0.20408163,  0.40816327,  0.6122449 ,  0.81632653,
        1.02040816,  1.2244898 ,  1.42857143,  1.63265306,  1.83673469,
        2.04081633,  2.24489796,  2.44897959,  2.65306122,  2.85714286,
        3.06122449,  3.26530612,  3.46938776,  3.67346939,  3.87755102,
        4.08163265,  4.28571429,  4.48979592,  4.69387755,  4.89795918,
        5.10204082,  5.30612245,  5.51020408,  5.71428571,  5.91836735,
        6.12244898,  6.32653061,  6.53061224,  6.73469388,  6.93877551,
        7.14285714,  7.34693878,  7.55102041,  7.75510204,  7.95918367,
        8.16326531,  8.36734694,  8.57142857,  8.7755102 ,  8.97959184,
        9.18367347,  9.3877551 ,  9.59183673,  9.79591837, 10.        ])
eye

Creates an identity matrix

> np.eye(4)
array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

Random

rand

Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).

> np.random.rand(2)
array([0.59110145, 0.97873794])
> #2d
+ np.random.rand(5,4)
array([[0.49683779, 0.41011857, 0.49680767, 0.19783712],
       [0.38087874, 0.09527378, 0.53229016, 0.9158716 ],
       [0.9875367 , 0.82660478, 0.6605952 , 0.90010448],
       [0.05382331, 0.50698107, 0.58781927, 0.29309634],
       [0.46095471, 0.38694441, 0.87989227, 0.46206292]])
randn

Returns a sample (or samples) from the “standard normal” distribution.

> np.random.randn(2)
array([0.00592357, 1.21541429])
> np.random.randn(5,4)
array([[ 0.8479065 ,  1.22475886,  0.66784301, -0.96586068],
       [-1.42468007,  1.07965643, -0.36176629,  0.31531068],
       [-2.48265365,  0.51033764, -0.5693635 , -0.960074  ],
       [-0.89532982,  0.89758788, -0.62306251, -0.30047938],
       [-0.10816238,  0.54934125, -0.45687432,  2.90258011]])
randint

Return random integers from low (inclusive) to high (exclusive).

> np.random.randint(1,100)
24
> #10 numbers
+ np.random.randint(1,100,10)
array([70, 86, 33,  1, 66, 28, 86, 83, 61, 90])
> # set the seed so replicable
+ # will always produce 45
+ np.random.seed(0)
+ np.random.randint(1,100)
45

You can import the function directly to save some writing.

> from numpy.random import randint
+ randint(2,10)
9

Array Attributes and Methods

> #Some arrays to work with
+ arr = np.arange(25)
+ ranarr = np.random.randint(0,50,10)
+ arr
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])
> ranarr
array([ 0,  3,  3, 39,  9, 19, 21, 36, 23,  6])
reshape

Returns an array containing the same data with a new shape.

> #error if can't fill the space
+ arr.reshape(5,5) 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])
max,min,argmax,argmin

Finding max or min values or their index locations using argmin or argmax

> ranarr.max()
39
> #index location
+ ranarr.argmax() 
3
> ranarr.min()
0
> #index location
+ ranarr.argmin() 
0
Shape

Shape is an attribute that arrays have (not a method):

> arr.shape
(25,)
> # Notice the two sets of brackets
+ arr.reshape(5,5)
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])
> arr.reshape(5,5).shape
(5, 5)
dtype

You can also grab the data type of the object in the array:

> arr.dtype
dtype('int32')

Indexing and Selection


> #Creating sample array
+ arr = np.arange(0,11)
+ arr
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

Bracket Indexing and Selection

> #Get a value at an index
+ arr[8]
8
> #Get values in a range
+ #slice notation
+ arr[1:5]
array([1, 2, 3, 4])
> #Get values in a range
+ arr[0:5]
array([0, 1, 2, 3, 4])
> arr[:6]
array([0, 1, 2, 3, 4, 5])
> arr[5:]
array([ 5,  6,  7,  8,  9, 10])

Broadcasting

Numpy arrays differ from a normal Python list because of their ability to broadcast:

> #Setting a value with index range (Broadcasting)
+ arr[0:5]=100
+ arr
array([100, 100, 100, 100, 100,   5,   6,   7,   8,   9,  10])
> # Reset array to remove 100s
+ arr = np.arange(0,11)
+ arr
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
> # Take a slice
+ slice_of_arr = arr[0:6]
+ slice_of_arr
array([0, 1, 2, 3, 4, 5])
> #Change Slice
+ #select everything
+ slice_of_arr[:]=99
+ slice_of_arr
array([99, 99, 99, 99, 99, 99])

Assigning a slice of the array to slice_of_arr will not actually create a separate array. It will just point to the original. Any changes will impact both.

> arr
array([99, 99, 99, 99, 99, 99,  6,  7,  8,  9, 10])

If you don’t want to change the original you must make a copy with arr.copy().

> #To get a copy, need to be explicit
+ arr_copy = arr.copy()
+ arr_copy
array([99, 99, 99, 99, 99, 99,  6,  7,  8,  9, 10])
> arr_copy[:]=100
+ arr_copy
array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100])

Now the original is not affected.

> arr 
array([99, 99, 99, 99, 99, 99,  6,  7,  8,  9, 10])

Indexing a 2D Array (Matrices)

The general format is arr_2d[row][col] or arr_2d[row,col].

> arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))
+ arr_2d
array([[ 5, 10, 15],
       [20, 25, 30],
       [35, 40, 45]])
> #Indexing row
+ arr_2d[1]
array([20, 25, 30])
> # Getting individual element value
+ arr_2d[1][0]
20
> arr_2d[1,0]
20
> # 2D array slicing
+ #Shape (2,2) from top right corner
+ arr_2d[:2,1:]
array([[10, 15],
       [25, 30]])
> #Shape bottom row
+ arr_2d[2]
array([35, 40, 45])
> arr_2d[2,:]
array([35, 40, 45])
Fancy Indexing

Fancy indexing allows you to select entire rows or columns out of order.

> #Set up matrix
+ arr2d = np.zeros((10,10))
> #Length of array
+ arr_length = arr2d.shape[1]
> #Set up array
+ for i in range(arr_length):
+     arr2d[i] = i
+     
+ arr2d
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
       [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
       [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
       [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
       [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])

Fancy indexing allows the following:

> arr2d[[2,4,6,8]]
array([[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
       [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
       [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.]])
> #Allows in any order
+ arr2d[[6,4,2,7]]
array([[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
       [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.]])

Selection

You can use brackets for selection with comparison operators.

> arr = np.arange(1,11)
+ arr
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
> arr > 4
array([False, False, False, False,  True,  True,  True,  True,  True,
        True])
> bool_arr = arr>4
+ bool_arr
array([False, False, False, False,  True,  True,  True,  True,  True,
        True])
> arr[bool_arr]
array([ 5,  6,  7,  8,  9, 10])
> arr[arr>2]
array([ 3,  4,  5,  6,  7,  8,  9, 10])
> x = 2
+ arr[arr>x]
array([ 3,  4,  5,  6,  7,  8,  9, 10])

NumPy Operations


Arithmetic

You can easily perform array with array arithmetic, or scalar with array arithmetic.

> arr = np.arange(0,10)
+ arr + arr
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])
> arr * arr
array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])
> arr+100
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109])
> arr*100
array([  0, 100, 200, 300, 400, 500, 600, 700, 800, 900])
> arr - arr
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
> # Warning on division by zero
+ # Just replaced with nan
+ arr/arr
array([nan,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

C:/Users/pbj20/Anaconda3/python.exe:1: RuntimeWarning: invalid value encountered in true_divide
> # Also warning
+ # infinity instead of error
+ 1/arr
array([       inf, 1.        , 0.5       , 0.33333333, 0.25      ,
       0.2       , 0.16666667, 0.14285714, 0.125     , 0.11111111])

C:/Users/pbj20/Anaconda3/python.exe:1: RuntimeWarning: divide by zero encountered in true_divide
> arr**3
array([  0,   1,   8,  27,  64, 125, 216, 343, 512, 729], dtype=int32)

Universal Array Functions

Universal array functions perform a mathematical operation across the array.

See here for more details: http://docs.scipy.org/doc/numpy/reference/ufuncs.html

Some common ones are shown below:

> #Taking Square Roots
+ np.sqrt(arr)
array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])
> #Calcualting exponential (e^)
+ np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])
> np.max(arr) #same as arr.max()
9
> np.sin(arr)
array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ,
       -0.95892427, -0.2794155 ,  0.6569866 ,  0.98935825,  0.41211849])
> np.log(arr)
array([      -inf, 0.        , 0.69314718, 1.09861229, 1.38629436,
       1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458])

C:/Users/pbj20/Anaconda3/python.exe:1: RuntimeWarning: divide by zero encountered in log

NumPy Exercises


  1. Create an array of 10 zeros
> np.zeros(10)
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
  1. Create an array of 10 ones
> np.ones(10)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
  1. Create an array of 10 fives
> np.ones(10) * 5
array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])
> np.zeros(10) + 5
array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])
  1. Create an array of the integers from 10 to 50
> np.arange(10,51)
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
       44, 45, 46, 47, 48, 49, 50])
  1. Create an array of all the even integers from 10 to 50
> np.arange(10,51,2)
array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,
       44, 46, 48, 50])
  1. Create a 3x3 matrix with values ranging from 0 to 8
> np.arange(9).reshape(3,3)
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
  1. Create a 3x3 identity matrix
> np.eye(3)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
  1. Use NumPy to generate a random number between 0 and 1
> np.random.rand(1)
array([0.05671298])
  1. Use NumPy to generate an array of 25 random numbers sampled from a standard normal distribution
> np.random.randn(25)
array([-0.1730696 , -1.76165167, -0.08767307,  1.36687937,  1.1253141 ,
       -0.35899555,  1.2206081 , -1.33949555,  0.42837337, -0.12346315,
        1.41437719, -0.12405066,  2.00815709,  0.22988654,  0.60489373,
        1.62715982,  1.59456053,  0.23043417, -0.06491034, -0.96898025,
        0.59124281, -0.7827755 , -0.44423283, -0.34518616, -0.88180055])
  1. Create the following matrix:
array([[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
       [0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ],
       [0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 ],
       [0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 ],
       [0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ],
       [0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 ],
       [0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ],
       [0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ],
       [0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 ],
       [0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.  ]])
> np.arange(1,101).reshape(10,10) / 100
array([[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
       [0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ],
       [0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 ],
       [0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 ],
       [0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ],
       [0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 ],
       [0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ],
       [0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ],
       [0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 ],
       [0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.  ]])
> np.linspace(0.01,1,100).reshape(10,10)
array([[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
       [0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ],
       [0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 ],
       [0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 ],
       [0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ],
       [0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 ],
       [0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ],
       [0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ],
       [0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 ],
       [0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.  ]])
  1. Create an array of 20 linearly spaced points between 0 and 1
> np.linspace(0,1,20)
array([0.        , 0.05263158, 0.10526316, 0.15789474, 0.21052632,
       0.26315789, 0.31578947, 0.36842105, 0.42105263, 0.47368421,
       0.52631579, 0.57894737, 0.63157895, 0.68421053, 0.73684211,
       0.78947368, 0.84210526, 0.89473684, 0.94736842, 1.        ])
  1. Create the following matrix, named mat
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])
> mat = np.arange(1,26).reshape(5,5)
+ mat
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])
  1. Create the following matrix
array([[12, 13, 14, 15],
       [17, 18, 19, 20],
       [22, 23, 24, 25]])
> mat[2:,1:]
array([[12, 13, 14, 15],
       [17, 18, 19, 20],
       [22, 23, 24, 25]])
  1. Extract 20
> mat[3,4]
20
  1. Create the following matrix
array([[ 2],
       [ 7],
       [12]])
> # note that 1:2 makes it 2d
+ # just 1 makes it 1d
+ mat[:3,1:2]
array([[ 2],
       [ 7],
       [12]])
  1. Create the following matrix
array([21, 22, 23, 24, 25])
> mat[4,:]
array([21, 22, 23, 24, 25])
  1. Create the following matrix
array([[16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])
> mat[3:5,:]
array([[16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])
  1. Get the sum of all the values in mat
> mat.sum()
325
  1. Get the standard deviation of the values in mat
> mat.std()
7.211102550927978
  1. Get the sum of all the columns in mat
> mat.sum(axis=0)
array([55, 60, 65, 70, 75])

Pandas DataFrames


Series

A Series is very similar to a NumPy array. What differentiates the NumPy array from a Series is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data, it can hold any arbitrary Python Object.

Creating a Series
  • You can convert a list, numpy array, or dictionary to a Series:
> import pandas as pd
+ labels = ['a','b','c']
+ my_list = [10,20,30]
+ arr = np.array([10,20,30])
+ d = {'a':10,'b':20,'c':30}
Using Lists
> pd.Series(data=my_list)
0    10
1    20
2    30
dtype: int64
> pd.Series(data=my_list,index=labels)
a    10
b    20
c    30
dtype: int64
> pd.Series(my_list,labels)
a    10
b    20
c    30
dtype: int64
NumPy Arrays
> pd.Series(arr)
0    10
1    20
2    30
dtype: int32
> pd.Series(arr,labels)
a    10
b    20
c    30
dtype: int32
Dictionaries
> pd.Series(d)
a    10
b    20
c    30
dtype: int64
Data in a Series

A pandas Series can hold a variety of object types:

> pd.Series(data=labels)
0    a
1    b
2    c
dtype: object
> # Even functions 
+ pd.Series([sum,print,len])
0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object
Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information.

> ser1 = pd.Series([1,2,3,4],
+     index = ['USA', 'Germany','USSR', 'Japan'])  
+ ser1
USA        1
Germany    2
USSR       3
Japan      4
dtype: int64
> ser2 = pd.Series([1,2,5,4],
+     index = ['USA', 'Germany','Italy', 'Japan'])  
+ ser2
USA        1
Germany    2
Italy      5
Japan      4
dtype: int64
> ser1['USA']
1

Operations are then also done based off of index:

> # Null where can't find match
+ # integers converted to float
+ ser1 + ser2 
Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

DataFrames

DataFrames are inspired by the R programming language.

> from numpy.random import randn
+ np.random.seed(101) 
+ df = pd.DataFrame(randn(5,4),
+     index='A B C D E'.split(),
+     columns='W X Y Z'.split())
+ df
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509
Selection and Indexing
> df['W']
A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

DataFrame Columns are just Series

> type(df['W'])
<class 'pandas.core.series.Series'>
> type(df)
<class 'pandas.core.frame.DataFrame'>
> # Pass a list of column names
+ df[['W','Z']]
          W         Z
A  2.706850  0.503826
B  0.651118  0.605965
C -2.018168 -0.589001
D  0.188695  0.955057
E  0.190794  0.683509
Creating a New Column:
> df['new'] = df['W'] + df['Y']
+ df
          W         X         Y         Z       new
A  2.706850  0.628133  0.907969  0.503826  3.614819
B  0.651118 -0.319318 -0.848077  0.605965 -0.196959
C -2.018168  0.740122  0.528813 -0.589001 -1.489355
D  0.188695 -0.758872 -0.933237  0.955057 -0.744542
E  0.190794  1.978757  2.605967  0.683509  2.796762
Removing Columns
> df.drop('new',axis=1)
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509
> # Not inplace unless specified!
+ df
          W         X         Y         Z       new
A  2.706850  0.628133  0.907969  0.503826  3.614819
B  0.651118 -0.319318 -0.848077  0.605965 -0.196959
C -2.018168  0.740122  0.528813 -0.589001 -1.489355
D  0.188695 -0.758872 -0.933237  0.955057 -0.744542
E  0.190794  1.978757  2.605967  0.683509  2.796762
> df.drop('new',axis=1,inplace=True)
+ df
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

You can also drop rows this way:

> df.drop('E',axis=0)
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057

The shape is a tuple, with rows at index 0 and columns at index 1. You can think of the axis as the tuple index.

> df.shape
(5, 4)
Selecting Rows
> df.loc['A']
W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

Or select based off of position instead of label.

> df.iloc[2]
W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64
Selecting Subset of Rows and Columns
> df.loc['B','Y']
-0.8480769834036315
> df.loc[['A','B'],['W','Y']]
          W         Y
A  2.706850  0.907969
B  0.651118 -0.848077
> df.iloc[[0,1],[0,2]]
          W         Y
A  2.706850  0.907969
B  0.651118 -0.848077
Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

> df
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509
> df>0
       W      X      Y      Z
A   True   True   True   True
B   True  False  False   True
C  False   True   True  False
D   True  False  False   True
E   True   True   True   True
> df[df>0]
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118       NaN       NaN  0.605965
C       NaN  0.740122  0.528813       NaN
D  0.188695       NaN       NaN  0.955057
E  0.190794  1.978757  2.605967  0.683509
> df['W']>0
A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool
> df[df['W']>0]
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509
> df[df['Z']<0]
          W         X         Y         Z
C -2.018168  0.740122  0.528813 -0.589001
> df[df['W']>0]['Y']
A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64
> df[df['W']>0][['Y','X']]
          Y         X
A  0.907969  0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E  2.605967  1.978757

For two conditions you can use | (OR) and & (AND) with parenthesis:

> df[(df['W']>0) & (df['Y'] > 1)]
          W         X         Y         Z
E  0.190794  1.978757  2.605967  0.683509
More Index Details
  • Resetting the index
  • Setting it something else
  • Index hierarchy
> df
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509
> # Reset to default 0,1...n index
+ # old index will become column
+ # not inplace
+ df.reset_index()
  index         W         X         Y         Z
0     A  2.706850  0.628133  0.907969  0.503826
1     B  0.651118 -0.319318 -0.848077  0.605965
2     C -2.018168  0.740122  0.528813 -0.589001
3     D  0.188695 -0.758872 -0.933237  0.955057
4     E  0.190794  1.978757  2.605967  0.683509
> df
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509
> newind = 'CA NY WY OR CO'.split()
+ newind
['CA', 'NY', 'WY', 'OR', 'CO']
> df['States'] = newind
+ df
          W         X         Y         Z States
A  2.706850  0.628133  0.907969  0.503826     CA
B  0.651118 -0.319318 -0.848077  0.605965     NY
C -2.018168  0.740122  0.528813 -0.589001     WY
D  0.188695 -0.758872 -0.933237  0.955057     OR
E  0.190794  1.978757  2.605967  0.683509     CO
> df.set_index('States',inplace=True)
+ df
               W         X         Y         Z
States                                        
CA      2.706850  0.628133  0.907969  0.503826
NY      0.651118 -0.319318 -0.848077  0.605965
WY     -2.018168  0.740122  0.528813 -0.589001
OR      0.188695 -0.758872 -0.933237  0.955057
CO      0.190794  1.978757  2.605967  0.683509
Multi-Index and Index Hierarchy
> # Index Levels
+ outside = ['G1','G1','G1','G2','G2','G2']
+ inside = [1,2,3,1,2,3]
+ hier_index = list(zip(outside,inside))
+ hier_index = pd.MultiIndex.from_tuples(hier_index)
> hier_index #list of tuple pairs
MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )
> df = pd.DataFrame(np.random.randn(6,2),
+     index=hier_index,columns=['A','B'])
+ df
             A         B
G1 1  0.302665  1.693723
   2 -1.706086 -1.159119
   3 -0.134841  0.390528
G2 1  0.166905  0.184502
   2  0.807706  0.072960
   3  0.638787  0.329646

For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

> df.loc['G1']
          A         B
1  0.302665  1.693723
2 -1.706086 -1.159119
3 -0.134841  0.390528
> df.loc['G1'].loc[1]
A    0.302665
B    1.693723
Name: 1, dtype: float64
> # indices aren't labeled
+ df.index.names 
FrozenList([None, None])
> df.index.names = ['Group','Num']
+ df
                  A         B
Group Num                    
G1    1    0.302665  1.693723
      2   -1.706086 -1.159119
      3   -0.134841  0.390528
G2    1    0.166905  0.184502
      2    0.807706  0.072960
      3    0.638787  0.329646
> df.loc['G2'].loc[2]['B']
0.07295967531703869

Use .xs for cross section

> df.xs('G1') #cross section
            A         B
Num                    
1    0.302665  1.693723
2   -1.706086 -1.159119
3   -0.134841  0.390528
> df.xs(['G1',1])
A    0.302665
B    1.693723
Name: (G1, 1), dtype: float64
> df.xs(['G1',1])[1]
1.693722925204035
> df.xs(1,level='Num')
              A         B
Group                    
G1     0.302665  1.693723
G2     0.166905  0.184502

Missing Data

> df = pd.DataFrame({'A':[1,2,np.nan],
+                   'B':[5,np.nan,np.nan],
+                   'C':[1,2,3]})
+ df
     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3
> #rows with missing values dropped
+ df.dropna() 
     A    B  C
0  1.0  5.0  1
> df.dropna(axis=1)
   C
0  1
1  2
2  3
> df['A'].dropna()
0    1.0
1    2.0
Name: A, dtype: float64
> df.dropna(subset=['A', 'C'])
     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2

You can set a threshold of NAs with thresh.

> df.dropna(thresh=2)
     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
> df.fillna(value='FILL VALUE')
            A           B  C
0           1           5  1
1           2  FILL VALUE  2
2  FILL VALUE  FILL VALUE  3

Fill missing values the mean

> df['A'].fillna(value=df['A'].mean())
0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

Group, Merge, Join


Groupby

The groupby method allows you to group rows of data together and call aggregate functions.

> import pandas as pd
+ # Create dataframe
+ data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
+   'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
+   'Sales':[200,120,340,124,243,350]}
> df = pd.DataFrame(data)
+ df
  Company   Person  Sales
0    GOOG      Sam    200
1    GOOG  Charlie    120
2    MSFT      Amy    340
3    MSFT  Vanessa    124
4      FB     Carl    243
5      FB    Sarah    350

Now you can use the .groupby() method to group rows together based off of a column name. For instance grouping based off of Company will create a DataFrameGroupBy object:

> df.groupby('Company')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000049E0D508>

You can save this object as a new variable:

> by_comp = df.groupby("Company")

And then call aggregate methods off the object:

> by_comp.mean()
         Sales
Company       
FB       296.5
GOOG     160.0
MSFT     232.0
> df.groupby('Company').mean()
         Sales
Company       
FB       296.5
GOOG     160.0
MSFT     232.0
> by_comp.std()
              Sales
Company            
FB        75.660426
GOOG      56.568542
MSFT     152.735065
> by_comp.min()
          Person  Sales
Company                
FB          Carl    243
GOOG     Charlie    120
MSFT         Amy    124
> by_comp.max()
          Person  Sales
Company                
FB         Sarah    350
GOOG         Sam    200
MSFT     Vanessa    340
> by_comp.count()
         Person  Sales
Company               
FB            2      2
GOOG          2      2
MSFT          2      2
> by_comp.describe()
        Sales                                                        
        count   mean         std    min     25%    50%     75%    max
Company                                                              
FB        2.0  296.5   75.660426  243.0  269.75  296.5  323.25  350.0
GOOG      2.0  160.0   56.568542  120.0  140.00  160.0  180.00  200.0
MSFT      2.0  232.0  152.735065  124.0  178.00  232.0  286.00  340.0
> by_comp.describe().transpose()
Company              FB        GOOG        MSFT
Sales count    2.000000    2.000000    2.000000
      mean   296.500000  160.000000  232.000000
      std     75.660426   56.568542  152.735065
      min    243.000000  120.000000  124.000000
      25%    269.750000  140.000000  178.000000
      50%    296.500000  160.000000  232.000000
      75%    323.250000  180.000000  286.000000
      max    350.000000  200.000000  340.000000
> by_comp.describe().transpose()['GOOG']
Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64

Example DataFrames

> df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
+                         'B': ['B0', 'B1', 'B2', 'B3'],
+                         'C': ['C0', 'C1', 'C2', 'C3'],
+                         'D': ['D0', 'D1', 'D2', 'D3']},
+                         index=[0, 1, 2, 3])
> df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
+                         'B': ['B4', 'B5', 'B6', 'B7'],
+                         'C': ['C4', 'C5', 'C6', 'C7'],
+                         'D': ['D4', 'D5', 'D6', 'D7']},
+                          index=[4, 5, 6, 7]) 
> df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
+                         'B': ['B8', 'B9', 'B10', 'B11'],
+                         'C': ['C8', 'C9', 'C10', 'C11'],
+                         'D': ['D8', 'D9', 'D10', 'D11']},
+                         index=[8, 9, 10, 11])
> df1
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
> df2
    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7
> df3
      A    B    C    D
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

Concatenation

Concatenation glues together DataFrames. The dimensions should match along the axis on which you are concatenating. You can use pd.concat and pass in a list of DataFrames to concatenate together:

> pd.concat([df1,df2,df3])
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

There will be missing values where indices don’t match.

> pd.concat([df1,df2,df3],axis=1)
      A    B    C    D    A    B    C    D    A    B    C    D
0    A0   B0   C0   D0  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1    A1   B1   C1   D1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2    A2   B2   C2   D2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
3    A3   B3   C3   D3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
4   NaN  NaN  NaN  NaN   A4   B4   C4   D4  NaN  NaN  NaN  NaN
5   NaN  NaN  NaN  NaN   A5   B5   C5   D5  NaN  NaN  NaN  NaN
6   NaN  NaN  NaN  NaN   A6   B6   C6   D6  NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN   A7   B7   C7   D7  NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A8   B8   C8   D8
9   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A9   B9   C9   D9
10  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A10  B10  C10  D10
11  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A11  B11  C11  D11

Example DataFrames

> left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
+                      'A': ['A0', 'A1', 'A2', 'A3'],
+                      'B': ['B0', 'B1', 'B2', 'B3']})
+    
+ right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
+                           'C': ['C0', 'C1', 'C2', 'C3'],
+                           'D': ['D0', 'D1', 'D2', 'D3']})    
> left
  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3
> right
  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K2  C2  D2
3  K3  C3  D3

Merging

The merge function allows you to merge DataFrames together using logic similar to what’s found in SQL.

  • Inner Join - Matches observations where the keys are equal. Unmatched keys are not included in the result.
  • Left Join - Keeps all keys from the left table.
  • Right Join - Keeps all keys from the right table.
  • Outer Join - Keeps all keys from both the left and right tables. Known as Full Join in R.
> #default is inner join
+ pd.merge(left,right,how='inner',on='key')
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3

A more complicated example, with two keys:

> left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
+                      'key2': ['K0', 'K1', 'K0', 'K1'],
+                         'A': ['A0', 'A1', 'A2', 'A3'],
+                         'B': ['B0', 'B1', 'B2', 'B3']})
+     
+ right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
+                       'key2': ['K0', 'K0', 'K0', 'K0'],
+                          'C': ['C0', 'C1', 'C2', 'C3'],
+                          'D': ['D0', 'D1', 'D2', 'D3']})
+                          
+ left
  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3
> right
  key1 key2   C   D
0   K0   K0  C0  D0
1   K1   K0  C1  D1
2   K1   K0  C2  D2
3   K2   K0  C3  D3
> pd.merge(left, right, on=['key1', 'key2'])
  key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2
> pd.merge(left, right, how='outer', on=['key1', 'key2'])
  key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K1   A3   B3  NaN  NaN
5   K2   K0  NaN  NaN   C3   D3
> pd.merge(left, right, how='right', on=['key1', 'key2'])
  key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3
> pd.merge(left, right, how='left', on=['key1', 'key2'])
  key1 key2   A   B    C    D
0   K0   K0  A0  B0   C0   D0
1   K0   K1  A1  B1  NaN  NaN
2   K1   K0  A2  B2   C1   D1
3   K1   K0  A2  B2   C2   D2
4   K2   K1  A3  B3  NaN  NaN

Joining

Join uses DataFrame indices as keys

> left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
+                      'B': ['B0', 'B1', 'B2']},
+                       index=['K0', 'K1', 'K2']) 
+ 
+ right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
+                     'D': ['D0', 'D2', 'D3']},
+                       index=['K0', 'K2', 'K3'])
> left
     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
> right
     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3
> left.join(right) #left join default
     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2
> left.join(right, how='outer')
      A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3  NaN  NaN   C3   D3

Panda Operations


> import pandas as pd
+ df = pd.DataFrame({'col1':[1,2,3,4],
+                 'col2':[444,555,666,444],
+                 'col3':['abc','def','ghi','xyz']})
+ df.head()
   col1  col2 col3
0     1   444  abc
1     2   555  def
2     3   666  ghi
3     4   444  xyz
Info on Unique Values
> df['col2'].unique()
array([444, 555, 666], dtype=int64)
> len(df['col2'].unique())
3
> df['col2'].nunique()
3
> df['col2'].value_counts()
444    2
555    1
666    1
Name: col2, dtype: int64
Selecting Data
> #Select from DataFrame using 
+ #criteria from multiple columns
+ newdf = df[(df['col1']>2) & (df['col2']==444)]
+ newdf
   col1  col2 col3
3     4   444  xyz
Applying Functions
> def times2(x):
+     return x*2
> df['col1'].apply(times2)
0    2
1    4
2    6
3    8
Name: col1, dtype: int64
> df['col2'].apply(lambda x:x*2)
0     888
1    1110
2    1332
3     888
Name: col2, dtype: int64
> df['col3'].apply(len)
0    3
1    3
2    3
3    3
Name: col3, dtype: int64
> df['col1'].sum()
10
> df.drop('col1', axis=1)
   col2 col3
0   444  abc
1   555  def
2   666  ghi
3   444  xyz
Permanently Removing a Column
> del df['col1']
+ df
   col2 col3
0   444  abc
1   555  def
2   666  ghi
3   444  xyz
Get Column and Index Names
> df.columns
Index(['col2', 'col3'], dtype='object')
> df.index
RangeIndex(start=0, stop=4, step=1)
Sorting and Ordering a DataFrame
> df
   col2 col3
0   444  abc
1   555  def
2   666  ghi
3   444  xyz
> #inplace=False by default
+ df.sort_values(by='col2') 
   col2 col3
0   444  abc
3   444  xyz
1   555  def
2   666  ghi
Find Null Values or Check for Null Values
> df.isnull()
    col2   col3
0  False  False
1  False  False
2  False  False
3  False  False
> # Drop rows with NaN Values
+ df.dropna()
   col2 col3
0   444  abc
1   555  def
2   666  ghi
3   444  xyz
Filling in NaN values with something else
> df = pd.DataFrame({'col1':[1,2,3,np.nan],
+                    'col2':[np.nan,555,666,444],
+                    'col3':['abc','def','ghi','xyz']})
+ df.head()
   col1   col2 col3
0   1.0    NaN  abc
1   2.0  555.0  def
2   3.0  666.0  ghi
3   NaN  444.0  xyz
> df.fillna('FILL')
   col1  col2 col3
0     1  FILL  abc
1     2   555  def
2     3   666  ghi
3  FILL   444  xyz
> data = {'A':['foo','foo','foo','bar','bar','bar'],
+      'B':['one','one','two','two','one','one'],
+        'C':['x','y','x','y','x','y'],
+        'D':[1,3,2,5,4,1]}
+ 
+ df = pd.DataFrame(data)
+ df
     A    B  C  D
0  foo  one  x  1
1  foo  one  y  3
2  foo  two  x  2
3  bar  two  y  5
4  bar  one  x  4
5  bar  one  y  1
> df.pivot_table(values='D',
+             index=['A', 'B'],
+             columns=['C'])
C          x    y
A   B            
bar one  4.0  1.0
    two  NaN  5.0
foo one  1.0  3.0
    two  2.0  NaN

Data Input and Output


You may need to install some packages:

> conda install sqlalchemy
> conda install lxml
> conda install html5lib
> conda install BeautifulSoup4
> conda install xlrd
CSV Input
> df = pd.read_csv('C:\\Users\\pbj20\\Documents\\example')
> df
    a   b   c   d
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
CSV Output
> df.to_csv('C:\\Users\\pbj20\\Documents\\My_output',index=False)
Excel Input

Pandas can read and write excel files but this only imports data, not formulas or images Having images or macros may cause this read_excel method to crash.

> pd.read_excel('C:\\Users\\pbj20\\Documents\\Excel_Sample.xlsx',
> sheet_name='Sheet1')
   Unnamed: 0   a   b   c   d
0           0   0   1   2   3
1           1   4   5   6   7
2           2   8   9  10  11
3           3  12  13  14  15
Excel Output
> df.to_excel('C:\\Users\\pbj20\\Documents\\Excel_Sample2.xlsx',
> sheet_name='Sheet1')
HTML

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

> df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
> type(df)
<class 'list'>
> df[0]
                             Bank Name  ...       Closing Date
0                 The First State Bank  ...      April 3, 2020
1                   Ericson State Bank  ...  February 14, 2020
2     City National Bank of New Jersey  ...   November 1, 2019
3                        Resolute Bank  ...   October 25, 2019
4                Louisa Community Bank  ...   October 25, 2019
..                                 ...  ...                ...
556                 Superior Bank, FSB  ...      July 27, 2001
557                Malta National Bank  ...        May 3, 2001
558    First Alliance Bank & Trust Co.  ...   February 2, 2001
559  National State Bank of Metropolis  ...  December 14, 2000
560                   Bank of Honolulu  ...   October 13, 2000

[561 rows x 6 columns]

Python Exercises 1


These use the SF Salaries Dataset from Kaggle:

https://www.kaggle.com/kaggle/sf-salaries

  1. Read Salaries.csv as a dataframe called sal.
> sal = pd.read_csv('C:\\Users\\pbj20\\Documents\\Salaries.csv')
  1. Check the head of the DataFrame
> sal.head()
   Id       EmployeeName  ...         Agency  Status
0   1     NATHANIEL FORD  ...  San Francisco     NaN
1   2       GARY JIMENEZ  ...  San Francisco     NaN
2   3     ALBERT PARDINI  ...  San Francisco     NaN
3   4  CHRISTOPHER CHONG  ...  San Francisco     NaN
4   5    PATRICK GARDNER  ...  San Francisco     NaN

[5 rows x 13 columns]
  1. Use the .info() method to find out how many entries there are
> sal.info() # 148654 Entries
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
Id                  148654 non-null int64
EmployeeName        148654 non-null object
JobTitle            148654 non-null object
BasePay             148045 non-null float64
OvertimePay         148650 non-null float64
OtherPay            148650 non-null float64
Benefits            112491 non-null float64
TotalPay            148654 non-null float64
TotalPayBenefits    148654 non-null float64
Year                148654 non-null int64
Notes               0 non-null float64
Agency              148654 non-null object
Status              0 non-null float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB
  1. What is the average BasePay ?
> sal['BasePay'].mean()
66325.44884050643
  1. What is the highest amount of OvertimePay in the dataset ?
> sal['OvertimePay'].max()
245131.88
  1. What is the job title of JOSEPH DRISCOLL ? Note: Use all caps, otherwise you may get an answer that doesn’t match up (there is also a lowercase Joseph Driscoll)
> sal[sal['EmployeeName']=='JOSEPH DRISCOLL']['JobTitle']
24    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object
  1. How much does JOSEPH DRISCOLL make (including benefits)?
> sal[sal['EmployeeName']==
+     'JOSEPH DRISCOLL']['TotalPayBenefits']
24    270324.91
Name: TotalPayBenefits, dtype: float64
  1. What is the name of highest paid person (including benefits)?
> sal[sal['TotalPayBenefits']== 
+     sal['TotalPayBenefits'].max()]['EmployeeName']
0    NATHANIEL FORD
Name: EmployeeName, dtype: object
> sal.loc[sal['TotalPayBenefits'].
+         idxmax()]['EmployeeName']
'NATHANIEL FORD'
  1. What is the name of lowest paid person (including benefits)? Do you notice something strange about how much he or she is paid?
> sal[sal['TotalPayBenefits']==
+     sal['TotalPayBenefits'].min()]['EmployeeName']
148653    Joe Lopez
Name: EmployeeName, dtype: object
> sal.loc[sal['TotalPayBenefits'].
+       idxmin()]
Id                                      148654
EmployeeName                         Joe Lopez
JobTitle            Counselor, Log Cabin Ranch
BasePay                                      0
OvertimePay                                  0
OtherPay                               -618.13
Benefits                                     0
TotalPay                               -618.13
TotalPayBenefits                       -618.13
Year                                      2014
Notes                                      NaN
Agency                           San Francisco
Status                                     NaN
Name: 148653, dtype: object
  1. What was the average (mean) BasePay of all employees per year? (2011-2014) ?
> sal.groupby('Year').mean()['BasePay']
Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64
  1. How many unique job titles are there?
> sal['JobTitle'].nunique()
2159
  1. What are the top 5 most common jobs?
> sal['JobTitle'].value_counts().head(5)
Transit Operator                7036
Special Nurse                   4389
Registered Nurse                3736
Public Svc Aide-Public Works    2518
Police Officer 3                2421
Name: JobTitle, dtype: int64
  1. How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurence in 2013?)
> sum(sal[sal['Year']==2013]['JobTitle'].
+     value_counts() == 1)
202
  1. How many people have the word Chief in their job title?
> def chief_string(title):
+     if 'chief' in title.lower():
+         return True
+     else:
+         return False
+         
+ sum(sal['JobTitle'].apply(lambda x: chief_string(x)))
+   
627
  1. Is there a correlation between length of the Job Title string and Salary?
> # No Correlation
+ sal['title_len'] = sal['JobTitle'].apply(len)
+ sal[['title_len','TotalPayBenefits']].corr()
                  title_len  TotalPayBenefits
title_len          1.000000         -0.036878
TotalPayBenefits  -0.036878          1.000000

Python Exercises 2


These exercises involve fake data about some purchases made through Amazon.

  1. Read in the Ecommerce Purchases csv file and set it to a DataFrame called ecom.
> ecom = pd.read_csv('C:\\Users\\Documents\\Ecommerce Purchases')
  1. Check the head of the DataFrame.
> ecom.head()
                                             Address  ... Purchase Price
0  16629 Pace Camp Apt. 448\nAlexisborough, NE 77...  ...          98.14
1  9374 Jasmine Spurs Suite 508\nSouth John, TN 8...  ...          70.73
2                   Unit 0065 Box 5052\nDPO AP 27450  ...           0.95
3              7780 Julia Fords\nNew Stacy, WA 45798  ...          78.04
4  23012 Munoz Drive Suite 337\nNew Cynthia, TX 5...  ...          77.82

[5 rows x 14 columns]
  1. How many rows and columns are there?
> ecom.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
Address             10000 non-null object
Lot                 10000 non-null object
AM or PM            10000 non-null object
Browser Info        10000 non-null object
Company             10000 non-null object
Credit Card         10000 non-null int64
CC Exp Date         10000 non-null object
CC Security Code    10000 non-null int64
CC Provider         10000 non-null object
Email               10000 non-null object
Job                 10000 non-null object
IP Address          10000 non-null object
Language            10000 non-null object
Purchase Price      10000 non-null float64
dtypes: float64(1), int64(2), object(11)
memory usage: 1.1+ MB
  1. What is the average Purchase Price?
> ecom['Purchase Price'].mean()
50.34730200000025
  1. What were the highest and lowest purchase prices?
> ecom['Purchase Price'].max()
99.99
> ecom['Purchase Price'].min()
0.0
  1. How many people have English ‘en’ as their Language of choice on the website?
> ecom[ecom['Language']=='en'].count()
Address             1098
Lot                 1098
AM or PM            1098
Browser Info        1098
Company             1098
Credit Card         1098
CC Exp Date         1098
CC Security Code    1098
CC Provider         1098
Email               1098
Job                 1098
IP Address          1098
Language            1098
Purchase Price      1098
dtype: int64
  1. How many people have the job title of “Lawyer” ?
> ecom[ecom['Job'] == 'Lawyer'].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 470 to 9979
Data columns (total 14 columns):
Address             30 non-null object
Lot                 30 non-null object
AM or PM            30 non-null object
Browser Info        30 non-null object
Company             30 non-null object
Credit Card         30 non-null int64
CC Exp Date         30 non-null object
CC Security Code    30 non-null int64
CC Provider         30 non-null object
Email               30 non-null object
Job                 30 non-null object
IP Address          30 non-null object
Language            30 non-null object
Purchase Price      30 non-null float64
dtypes: float64(1), int64(2), object(11)
memory usage: 3.5+ KB
  1. How many people made the purchase during the AM and how many people made the purchase during PM?
> ecom['AM or PM'].value_counts()
PM    5068
AM    4932
Name: AM or PM, dtype: int64
  1. What are the 5 most common Job Titles?
> ecom['Job'].value_counts().head(5)
Interior and spatial designer        31
Lawyer                               30
Social researcher                    28
Research officer, political party    27
Designer, jewellery                  27
Name: Job, dtype: int64
  1. Someone made a purchase that came from Lot: “90 WT” , what was the Purchase Price for this transaction?
> ecom[ecom['Lot']=='90 WT']['Purchase Price']
513    75.1
Name: Purchase Price, dtype: float64
  1. What is the email of the person with the following Credit Card Number: 4926535242672853
> ecom[ecom["Credit Card"] == 4926535242672853]['Email'] 
1234    bondellen@williams-garza.com
Name: Email, dtype: object
  1. How many people have American Express as their Credit Card Provider and made a purchase above $95 ?
> ecom[(ecom['CC Provider']=='American Express') &
+      (ecom['Purchase Price']>95)].count()
Address             39
Lot                 39
AM or PM            39
Browser Info        39
Company             39
Credit Card         39
CC Exp Date         39
CC Security Code    39
CC Provider         39
Email               39
Job                 39
IP Address          39
Language            39
Purchase Price      39
dtype: int64
  1. Hard: How many people have a credit card that expires in 2025?
> sum(ecom['CC Exp Date'].apply(lambda x:
+                     x[3:]) == '25')
1033
  1. Hard: What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc…)
> ecom['Email'].apply(lambda x: 
+     x.split('@')[1]).value_counts().head(5)
hotmail.com     1638
yahoo.com       1616
gmail.com       1605
smith.com         42
williams.com      37
Name: Email, dtype: int64