NumPy Arrays

Numpy arrays can be vectors or matrices. Vectors are 1d arrays and matrices are 2d (but a matrix can still have only one row or one column).

Creating NumPy Arrays

From a Python List

> import numpy as np
+ my_list = [1,2,3]
+ np.array(my_list)

array([1, 2, 3])

> my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
+ my_matrix

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

> # Cast to 2d array
+ np.array(my_matrix)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Built-in Methods

arange

Returns evenly spaced values within a given interval.

> # up to but not including 10
+ np.arange(0,10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

> #by 2
+ np.arange(0,11,2)

array([ 0,  2,  4,  6,  8, 10])

zeros and ones

Generate arrays of zeros or ones.

> np.zeros(3)

array([0., 0., 0.])

> # tuple (rows, columns)
+ np.zeros((5,4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

> np.ones(3)

array([1., 1., 1.])

> np.ones((3,4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

linspace

Returns evenly spaced numbers over a specified interval. start, stop, num

> np.linspace(0,10,3)

array([ 0.,  5., 10.])

> np.linspace(0,5,10)

array([0.        , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
       2.77777778, 3.33333333, 3.88888889, 4.44444444, 5.        ])

> # one opening bracket
+ # still 1 dimensional
+ np.linspace(0,10,50)

array([ 0.        ,  0.20408163,  0.40816327,  0.6122449 ,  0.81632653,
        1.02040816,  1.2244898 ,  1.42857143,  1.63265306,  1.83673469,
        2.04081633,  2.24489796,  2.44897959,  2.65306122,  2.85714286,
        3.06122449,  3.26530612,  3.46938776,  3.67346939,  3.87755102,
        4.08163265,  4.28571429,  4.48979592,  4.69387755,  4.89795918,
        5.10204082,  5.30612245,  5.51020408,  5.71428571,  5.91836735,
        6.12244898,  6.32653061,  6.53061224,  6.73469388,  6.93877551,
        7.14285714,  7.34693878,  7.55102041,  7.75510204,  7.95918367,
        8.16326531,  8.36734694,  8.57142857,  8.7755102 ,  8.97959184,
        9.18367347,  9.3877551 ,  9.59183673,  9.79591837, 10.        ])

eye

Creates an identity matrix

> np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

Random

rand

Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).

> np.random.rand(2)

array([0.59110145, 0.97873794])

> #2d
+ np.random.rand(5,4)

array([[0.49683779, 0.41011857, 0.49680767, 0.19783712],
       [0.38087874, 0.09527378, 0.53229016, 0.9158716 ],
       [0.9875367 , 0.82660478, 0.6605952 , 0.90010448],
       [0.05382331, 0.50698107, 0.58781927, 0.29309634],
       [0.46095471, 0.38694441, 0.87989227, 0.46206292]])

randn

Returns a sample (or samples) from the “standard normal” distribution.

> np.random.randn(2)

array([0.00592357, 1.21541429])

> np.random.randn(5,4)

array([[ 0.8479065 ,  1.22475886,  0.66784301, -0.96586068],
       [-1.42468007,  1.07965643, -0.36176629,  0.31531068],
       [-2.48265365,  0.51033764, -0.5693635 , -0.960074  ],
       [-0.89532982,  0.89758788, -0.62306251, -0.30047938],
       [-0.10816238,  0.54934125, -0.45687432,  2.90258011]])

randint

Return random integers from low (inclusive) to high (exclusive).

> np.random.randint(1,100)

> #10 numbers
+ np.random.randint(1,100,10)

array([70, 86, 33,  1, 66, 28, 86, 83, 61, 90])

> # set the seed so replicable
+ # will always produce 45
+ np.random.seed(0)
+ np.random.randint(1,100)

You can import the function directly to save some writing.

> from numpy.random import randint
+ randint(2,10)

Array Attributes and Methods

> #Some arrays to work with
+ arr = np.arange(25)
+ ranarr = np.random.randint(0,50,10)
+ arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

> ranarr

array([ 0,  3,  3, 39,  9, 19, 21, 36, 23,  6])

reshape

Returns an array containing the same data with a new shape.

> #error if can't fill the space
+ arr.reshape(5,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

max,min,argmax,argmin

Finding max or min values or their index locations using argmin or argmax

> ranarr.max()

> #index location
+ ranarr.argmax()

> ranarr.min()

> #index location
+ ranarr.argmin()

Shape

Shape is an attribute that arrays have (not a method):

> arr.shape

(25,)

> # Notice the two sets of brackets
+ arr.reshape(5,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

> arr.reshape(5,5).shape

(5, 5)

dtype

You can also grab the data type of the object in the array:

> arr.dtype

dtype('int32')

Indexing and Selection

> #Creating sample array
+ arr = np.arange(0,11)
+ arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

Bracket Indexing and Selection

> #Get a value at an index
+ arr[8]

> #Get values in a range
+ #slice notation
+ arr[1:5]

array([1, 2, 3, 4])

> #Get values in a range
+ arr[0:5]

array([0, 1, 2, 3, 4])

> arr[:6]

array([0, 1, 2, 3, 4, 5])

> arr[5:]

array([ 5,  6,  7,  8,  9, 10])

Broadcasting

Numpy arrays differ from a normal Python list because of their ability to broadcast:

> #Setting a value with index range (Broadcasting)
+ arr[0:5]=100
+ arr

array([100, 100, 100, 100, 100,   5,   6,   7,   8,   9,  10])

> # Reset array to remove 100s
+ arr = np.arange(0,11)
+ arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

> # Take a slice
+ slice_of_arr = arr[0:6]
+ slice_of_arr

array([0, 1, 2, 3, 4, 5])

> #Change Slice
+ #select everything
+ slice_of_arr[:]=99
+ slice_of_arr

array([99, 99, 99, 99, 99, 99])

Assigning a slice of the array to slice_of_arr will not actually create a separate array. It will just point to the original. Any changes will impact both.

> arr

array([99, 99, 99, 99, 99, 99,  6,  7,  8,  9, 10])

If you don’t want to change the original you must make a copy with arr.copy().

> #To get a copy, need to be explicit
+ arr_copy = arr.copy()
+ arr_copy

array([99, 99, 99, 99, 99, 99,  6,  7,  8,  9, 10])

> arr_copy[:]=100
+ arr_copy

array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100])

Now the original is not affected.

> arr

array([99, 99, 99, 99, 99, 99,  6,  7,  8,  9, 10])

Indexing a 2D Array (Matrices)

The general format is arr_2d[row][col] or arr_2d[row,col].

> arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))
+ arr_2d

array([[ 5, 10, 15],
       [20, 25, 30],
       [35, 40, 45]])

> #Indexing row
+ arr_2d[1]

array([20, 25, 30])

> # Getting individual element value
+ arr_2d[1][0]

> arr_2d[1,0]

> # 2D array slicing
+ #Shape (2,2) from top right corner
+ arr_2d[:2,1:]

array([[10, 15],
       [25, 30]])

> #Shape bottom row
+ arr_2d[2]

array([35, 40, 45])

> arr_2d[2,:]

array([35, 40, 45])

Fancy Indexing

Fancy indexing allows you to select entire rows or columns out of order.

> #Set up matrix
+ arr2d = np.zeros((10,10))

> #Length of array
+ arr_length = arr2d.shape[1]

> #Set up array
+ for i in range(arr_length):
+     arr2d[i] = i
+     
+ arr2d

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
       [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
       [5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
       [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
       [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
       [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
       [9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])

Fancy indexing allows the following:

> arr2d[[2,4,6,8]]

array([[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
       [6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
       [8., 8., 8., 8., 8., 8., 8., 8., 8., 8.]])

> #Allows in any order
+ arr2d[[6,4,2,7]]

array([[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
       [4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [7., 7., 7., 7., 7., 7., 7., 7., 7., 7.]])

Selection

You can use brackets for selection with comparison operators.

> arr = np.arange(1,11)
+ arr

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

> arr > 4

array([False, False, False, False,  True,  True,  True,  True,  True,
        True])

> bool_arr = arr>4
+ bool_arr

array([False, False, False, False,  True,  True,  True,  True,  True,
        True])

> arr[bool_arr]

array([ 5,  6,  7,  8,  9, 10])

> arr[arr>2]

array([ 3,  4,  5,  6,  7,  8,  9, 10])

> x = 2
+ arr[arr>x]

array([ 3,  4,  5,  6,  7,  8,  9, 10])

NumPy Operations

Arithmetic

You can easily perform array with array arithmetic, or scalar with array arithmetic.

> arr = np.arange(0,10)
+ arr + arr

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

> arr * arr

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

> arr+100

array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109])

> arr*100

array([  0, 100, 200, 300, 400, 500, 600, 700, 800, 900])

> arr - arr

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

> # Warning on division by zero
+ # Just replaced with nan
+ arr/arr

array([nan,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

C:/Users/pbj20/Anaconda3/python.exe:1: RuntimeWarning: invalid value encountered in true_divide

> # Also warning
+ # infinity instead of error
+ 1/arr

array([       inf, 1.        , 0.5       , 0.33333333, 0.25      ,
       0.2       , 0.16666667, 0.14285714, 0.125     , 0.11111111])

C:/Users/pbj20/Anaconda3/python.exe:1: RuntimeWarning: divide by zero encountered in true_divide

> arr**3

array([  0,   1,   8,  27,  64, 125, 216, 343, 512, 729], dtype=int32)

Universal Array Functions

Universal array functions perform a mathematical operation across the array.

See here for more details: http://docs.scipy.org/doc/numpy/reference/ufuncs.html

Some common ones are shown below:

> #Taking Square Roots
+ np.sqrt(arr)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

> #Calcualting exponential (e^)
+ np.exp(arr)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

> np.max(arr) #same as arr.max()

> np.sin(arr)

array([ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ,
       -0.95892427, -0.2794155 ,  0.6569866 ,  0.98935825,  0.41211849])

> np.log(arr)

array([      -inf, 0.        , 0.69314718, 1.09861229, 1.38629436,
       1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458])

C:/Users/pbj20/Anaconda3/python.exe:1: RuntimeWarning: divide by zero encountered in log

NumPy Exercises

Create an array of 10 zeros

> np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Create an array of 10 ones

> np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Create an array of 10 fives

> np.ones(10) * 5

array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])

> np.zeros(10) + 5

array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])

Create an array of the integers from 10 to 50

> np.arange(10,51)

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
       44, 45, 46, 47, 48, 49, 50])

Create an array of all the even integers from 10 to 50

> np.arange(10,51,2)

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,
       44, 46, 48, 50])

Create a 3x3 matrix with values ranging from 0 to 8

> np.arange(9).reshape(3,3)

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

Create a 3x3 identity matrix

> np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Use NumPy to generate a random number between 0 and 1

> np.random.rand(1)

array([0.05671298])

Use NumPy to generate an array of 25 random numbers sampled from a standard normal distribution

> np.random.randn(25)

array([-0.1730696 , -1.76165167, -0.08767307,  1.36687937,  1.1253141 ,
       -0.35899555,  1.2206081 , -1.33949555,  0.42837337, -0.12346315,
        1.41437719, -0.12405066,  2.00815709,  0.22988654,  0.60489373,
        1.62715982,  1.59456053,  0.23043417, -0.06491034, -0.96898025,
        0.59124281, -0.7827755 , -0.44423283, -0.34518616, -0.88180055])

Create the following matrix:

array([[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
       [0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ],
       [0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 ],
       [0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 ],
       [0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ],
       [0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 ],
       [0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ],
       [0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ],
       [0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 ],
       [0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.  ]])

> np.arange(1,101).reshape(10,10) / 100

array([[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
       [0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ],
       [0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 ],
       [0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 ],
       [0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ],
       [0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 ],
       [0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ],
       [0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ],
       [0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 ],
       [0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.  ]])

> np.linspace(0.01,1,100).reshape(10,10)

array([[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
       [0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ],
       [0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 ],
       [0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 ],
       [0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ],
       [0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 ],
       [0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ],
       [0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ],
       [0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 ],
       [0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.  ]])

Create an array of 20 linearly spaced points between 0 and 1

> np.linspace(0,1,20)

array([0.        , 0.05263158, 0.10526316, 0.15789474, 0.21052632,
       0.26315789, 0.31578947, 0.36842105, 0.42105263, 0.47368421,
       0.52631579, 0.57894737, 0.63157895, 0.68421053, 0.73684211,
       0.78947368, 0.84210526, 0.89473684, 0.94736842, 1.        ])

Create the following matrix, named mat

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

> mat = np.arange(1,26).reshape(5,5)
+ mat

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

Create the following matrix

array([[12, 13, 14, 15],
       [17, 18, 19, 20],
       [22, 23, 24, 25]])

> mat[2:,1:]

array([[12, 13, 14, 15],
       [17, 18, 19, 20],
       [22, 23, 24, 25]])

Extract 20

> mat[3,4]

Create the following matrix

array([[ 2],
       [ 7],
       [12]])

> # note that 1:2 makes it 2d
+ # just 1 makes it 1d
+ mat[:3,1:2]

array([[ 2],
       [ 7],
       [12]])

Create the following matrix

array([21, 22, 23, 24, 25])

> mat[4,:]

array([21, 22, 23, 24, 25])

Create the following matrix

array([[16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

> mat[3:5,:]

array([[16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

Get the sum of all the values in mat

> mat.sum()

Get the standard deviation of the values in mat

> mat.std()

7.211102550927978

Get the sum of all the columns in mat

> mat.sum(axis=0)

array([55, 60, 65, 70, 75])

Pandas DataFrames

Series

A Series is very similar to a NumPy array. What differentiates the NumPy array from a Series is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data, it can hold any arbitrary Python Object.

Creating a Series

You can convert a list, numpy array, or dictionary to a Series:

> import pandas as pd
+ labels = ['a','b','c']
+ my_list = [10,20,30]
+ arr = np.array([10,20,30])
+ d = {'a':10,'b':20,'c':30}

Using Lists

> pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

> pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

> pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

NumPy Arrays

> pd.Series(arr)

0    10
1    20
2    30
dtype: int32

> pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

Dictionaries

> pd.Series(d)

a    10
b    20
c    30
dtype: int64

Data in a Series

A pandas Series can hold a variety of object types:

> pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

> # Even functions 
+ pd.Series([sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information.

> ser1 = pd.Series([1,2,3,4],
+     index = ['USA', 'Germany','USSR', 'Japan'])  
+ ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

> ser2 = pd.Series([1,2,5,4],
+     index = ['USA', 'Germany','Italy', 'Japan'])  
+ ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

> ser1['USA']

Operations are then also done based off of index:

> # Null where can't find match
+ # integers converted to float
+ ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

DataFrames

DataFrames are inspired by the R programming language.

> from numpy.random import randn
+ np.random.seed(101) 
+ df = pd.DataFrame(randn(5,4),
+     index='A B C D E'.split(),
+     columns='W X Y Z'.split())
+ df

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

Selection and Indexing

> df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

DataFrame Columns are just Series

> type(df['W'])

<class 'pandas.core.series.Series'>

> type(df)

<class 'pandas.core.frame.DataFrame'>

> # Pass a list of column names
+ df[['W','Z']]

          W         Z
A  2.706850  0.503826
B  0.651118  0.605965
C -2.018168 -0.589001
D  0.188695  0.955057
E  0.190794  0.683509

Creating a New Column:

> df['new'] = df['W'] + df['Y']
+ df

          W         X         Y         Z       new
A  2.706850  0.628133  0.907969  0.503826  3.614819
B  0.651118 -0.319318 -0.848077  0.605965 -0.196959
C -2.018168  0.740122  0.528813 -0.589001 -1.489355
D  0.188695 -0.758872 -0.933237  0.955057 -0.744542
E  0.190794  1.978757  2.605967  0.683509  2.796762

Removing Columns

> df.drop('new',axis=1)

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

> # Not inplace unless specified!
+ df

          W         X         Y         Z       new
A  2.706850  0.628133  0.907969  0.503826  3.614819
B  0.651118 -0.319318 -0.848077  0.605965 -0.196959
C -2.018168  0.740122  0.528813 -0.589001 -1.489355
D  0.188695 -0.758872 -0.933237  0.955057 -0.744542
E  0.190794  1.978757  2.605967  0.683509  2.796762

> df.drop('new',axis=1,inplace=True)
+ df

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

You can also drop rows this way:

> df.drop('E',axis=0)

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057

The shape is a tuple, with rows at index 0 and columns at index 1. You can think of the axis as the tuple index.

> df.shape

(5, 4)

Selecting Rows

> df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

Or select based off of position instead of label.

> df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Selecting Subset of Rows and Columns

> df.loc['B','Y']

-0.8480769834036315

> df.loc[['A','B'],['W','Y']]

          W         Y
A  2.706850  0.907969
B  0.651118 -0.848077

> df.iloc[[0,1],[0,2]]

          W         Y
A  2.706850  0.907969
B  0.651118 -0.848077

Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

> df

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

> df>0

       W      X      Y      Z
A   True   True   True   True
B   True  False  False   True
C  False   True   True  False
D   True  False  False   True
E   True   True   True   True

> df[df>0]

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118       NaN       NaN  0.605965
C       NaN  0.740122  0.528813       NaN
D  0.188695       NaN       NaN  0.955057
E  0.190794  1.978757  2.605967  0.683509

> df['W']>0

A     True
B     True
C    False
D     True
E     True
Name: W, dtype: bool

> df[df['W']>0]

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

> df[df['Z']<0]

          W         X         Y         Z
C -2.018168  0.740122  0.528813 -0.589001

> df[df['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

> df[df['W']>0][['Y','X']]

          Y         X
A  0.907969  0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E  2.605967  1.978757

For two conditions you can use | (OR) and & (AND) with parenthesis:

> df[(df['W']>0) & (df['Y'] > 1)]

          W         X         Y         Z
E  0.190794  1.978757  2.605967  0.683509

More Index Details

Resetting the index
Setting it something else
Index hierarchy

> df

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

> # Reset to default 0,1...n index
+ # old index will become column
+ # not inplace
+ df.reset_index()

  index         W         X         Y         Z
0     A  2.706850  0.628133  0.907969  0.503826
1     B  0.651118 -0.319318 -0.848077  0.605965
2     C -2.018168  0.740122  0.528813 -0.589001
3     D  0.188695 -0.758872 -0.933237  0.955057
4     E  0.190794  1.978757  2.605967  0.683509

> df

          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

> newind = 'CA NY WY OR CO'.split()
+ newind

['CA', 'NY', 'WY', 'OR', 'CO']

> df['States'] = newind
+ df

          W         X         Y         Z States
A  2.706850  0.628133  0.907969  0.503826     CA
B  0.651118 -0.319318 -0.848077  0.605965     NY
C -2.018168  0.740122  0.528813 -0.589001     WY
D  0.188695 -0.758872 -0.933237  0.955057     OR
E  0.190794  1.978757  2.605967  0.683509     CO

> df.set_index('States',inplace=True)
+ df

               W         X         Y         Z
States                                        
CA      2.706850  0.628133  0.907969  0.503826
NY      0.651118 -0.319318 -0.848077  0.605965
WY     -2.018168  0.740122  0.528813 -0.589001
OR      0.188695 -0.758872 -0.933237  0.955057
CO      0.190794  1.978757  2.605967  0.683509

Multi-Index and Index Hierarchy

> # Index Levels
+ outside = ['G1','G1','G1','G2','G2','G2']
+ inside = [1,2,3,1,2,3]
+ hier_index = list(zip(outside,inside))
+ hier_index = pd.MultiIndex.from_tuples(hier_index)

> hier_index #list of tuple pairs

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

> df = pd.DataFrame(np.random.randn(6,2),
+     index=hier_index,columns=['A','B'])
+ df

             A         B
G1 1  0.302665  1.693723
   2 -1.706086 -1.159119
   3 -0.134841  0.390528
G2 1  0.166905  0.184502
   2  0.807706  0.072960
   3  0.638787  0.329646

For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

> df.loc['G1']

          A         B
1  0.302665  1.693723
2 -1.706086 -1.159119
3 -0.134841  0.390528

> df.loc['G1'].loc[1]

A    0.302665
B    1.693723
Name: 1, dtype: float64

> # indices aren't labeled
+ df.index.names

FrozenList([None, None])

> df.index.names = ['Group','Num']
+ df

                  A         B
Group Num                    
G1    1    0.302665  1.693723
      2   -1.706086 -1.159119
      3   -0.134841  0.390528
G2    1    0.166905  0.184502
      2    0.807706  0.072960
      3    0.638787  0.329646

> df.loc['G2'].loc[2]['B']

0.07295967531703869

Use .xs for cross section

> df.xs('G1') #cross section

            A         B
Num                    
1    0.302665  1.693723
2   -1.706086 -1.159119
3   -0.134841  0.390528

> df.xs(['G1',1])

A    0.302665
B    1.693723
Name: (G1, 1), dtype: float64

> df.xs(['G1',1])[1]

1.693722925204035

> df.xs(1,level='Num')

              A         B
Group                    
G1     0.302665  1.693723
G2     0.166905  0.184502

Missing Data

> df = pd.DataFrame({'A':[1,2,np.nan],
+                   'B':[5,np.nan,np.nan],
+                   'C':[1,2,3]})
+ df

     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3

> #rows with missing values dropped
+ df.dropna()

     A    B  C
0  1.0  5.0  1

> df.dropna(axis=1)

> df['A'].dropna()

0    1.0
1    2.0
Name: A, dtype: float64

> df.dropna(subset=['A', 'C'])

     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2

You can set a threshold of NAs with thresh.

> df.dropna(thresh=2)

     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2

> df.fillna(value='FILL VALUE')

            A           B  C
0           1           5  1
1           2  FILL VALUE  2
2  FILL VALUE  FILL VALUE  3

Fill missing values the mean

> df['A'].fillna(value=df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

Group, Merge, Join

Groupby

The groupby method allows you to group rows of data together and call aggregate functions.

> import pandas as pd
+ # Create dataframe
+ data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
+   'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
+   'Sales':[200,120,340,124,243,350]}

> df = pd.DataFrame(data)
+ df

  Company   Person  Sales
0    GOOG      Sam    200
1    GOOG  Charlie    120
2    MSFT      Amy    340
3    MSFT  Vanessa    124
4      FB     Carl    243
5      FB    Sarah    350

Now you can use the .groupby() method to group rows together based off of a column name. For instance grouping based off of Company will create a DataFrameGroupBy object:

> df.groupby('Company')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000049E0D508>

You can save this object as a new variable:

> by_comp = df.groupby("Company")

And then call aggregate methods off the object:

> by_comp.mean()

         Sales
Company       
FB       296.5
GOOG     160.0
MSFT     232.0

> df.groupby('Company').mean()

         Sales
Company       
FB       296.5
GOOG     160.0
MSFT     232.0

> by_comp.std()

              Sales
Company            
FB        75.660426
GOOG      56.568542
MSFT     152.735065

> by_comp.min()

          Person  Sales
Company                
FB          Carl    243
GOOG     Charlie    120
MSFT         Amy    124

> by_comp.max()

          Person  Sales
Company                
FB         Sarah    350
GOOG         Sam    200
MSFT     Vanessa    340

> by_comp.count()

         Person  Sales
Company               
FB            2      2
GOOG          2      2
MSFT          2      2

> by_comp.describe()

        Sales                                                        
        count   mean         std    min     25%    50%     75%    max
Company                                                              
FB        2.0  296.5   75.660426  243.0  269.75  296.5  323.25  350.0
GOOG      2.0  160.0   56.568542  120.0  140.00  160.0  180.00  200.0
MSFT      2.0  232.0  152.735065  124.0  178.00  232.0  286.00  340.0

> by_comp.describe().transpose()

Company              FB        GOOG        MSFT
Sales count    2.000000    2.000000    2.000000
      mean   296.500000  160.000000  232.000000
      std     75.660426   56.568542  152.735065
      min    243.000000  120.000000  124.000000
      25%    269.750000  140.000000  178.000000
      50%    296.500000  160.000000  232.000000
      75%    323.250000  180.000000  286.000000
      max    350.000000  200.000000  340.000000

> by_comp.describe().transpose()['GOOG']

Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64

Example DataFrames

> df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
+                         'B': ['B0', 'B1', 'B2', 'B3'],
+                         'C': ['C0', 'C1', 'C2', 'C3'],
+                         'D': ['D0', 'D1', 'D2', 'D3']},
+                         index=[0, 1, 2, 3])

> df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
+                         'B': ['B4', 'B5', 'B6', 'B7'],
+                         'C': ['C4', 'C5', 'C6', 'C7'],
+                         'D': ['D4', 'D5', 'D6', 'D7']},
+                          index=[4, 5, 6, 7])

> df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
+                         'B': ['B8', 'B9', 'B10', 'B11'],
+                         'C': ['C8', 'C9', 'C10', 'C11'],
+                         'D': ['D8', 'D9', 'D10', 'D11']},
+                         index=[8, 9, 10, 11])

> df1

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3

> df2

    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7

> df3

      A    B    C    D
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

Concatenation

Concatenation glues together DataFrames. The dimensions should match along the axis on which you are concatenating. You can use pd.concat and pass in a list of DataFrames to concatenate together:

> pd.concat([df1,df2,df3])

      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

There will be missing values where indices don’t match.

> pd.concat([df1,df2,df3],axis=1)

      A    B    C    D    A    B    C    D    A    B    C    D
0    A0   B0   C0   D0  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
1    A1   B1   C1   D1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2    A2   B2   C2   D2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
3    A3   B3   C3   D3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
4   NaN  NaN  NaN  NaN   A4   B4   C4   D4  NaN  NaN  NaN  NaN
5   NaN  NaN  NaN  NaN   A5   B5   C5   D5  NaN  NaN  NaN  NaN
6   NaN  NaN  NaN  NaN   A6   B6   C6   D6  NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN   A7   B7   C7   D7  NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A8   B8   C8   D8
9   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   A9   B9   C9   D9
10  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A10  B10  C10  D10
11  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  A11  B11  C11  D11

Example DataFrames

> left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
+                      'A': ['A0', 'A1', 'A2', 'A3'],
+                      'B': ['B0', 'B1', 'B2', 'B3']})
+    
+ right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
+                           'C': ['C0', 'C1', 'C2', 'C3'],
+                           'D': ['D0', 'D1', 'D2', 'D3']})

> left

  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2
3  K3  A3  B3

> right

  key   C   D
0  K0  C0  D0
1  K1  C1  D1
2  K2  C2  D2
3  K3  C3  D3

Merging

The merge function allows you to merge DataFrames together using logic similar to what’s found in SQL.

Inner Join - Matches observations where the keys are equal. Unmatched keys are not included in the result.
Left Join - Keeps all keys from the left table.
Right Join - Keeps all keys from the right table.
Outer Join - Keeps all keys from both the left and right tables. Known as Full Join in R.

> #default is inner join
+ pd.merge(left,right,how='inner',on='key')

  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2
3  K3  A3  B3  C3  D3

A more complicated example, with two keys:

> left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
+                      'key2': ['K0', 'K1', 'K0', 'K1'],
+                         'A': ['A0', 'A1', 'A2', 'A3'],
+                         'B': ['B0', 'B1', 'B2', 'B3']})
+     
+ right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
+                       'key2': ['K0', 'K0', 'K0', 'K0'],
+                          'C': ['C0', 'C1', 'C2', 'C3'],
+                          'D': ['D0', 'D1', 'D2', 'D3']})
+                          
+ left

  key1 key2   A   B
0   K0   K0  A0  B0
1   K0   K1  A1  B1
2   K1   K0  A2  B2
3   K2   K1  A3  B3

> right

  key1 key2   C   D
0   K0   K0  C0  D0
1   K1   K0  C1  D1
2   K1   K0  C2  D2
3   K2   K0  C3  D3

> pd.merge(left, right, on=['key1', 'key2'])

  key1 key2   A   B   C   D
0   K0   K0  A0  B0  C0  D0
1   K1   K0  A2  B2  C1  D1
2   K1   K0  A2  B2  C2  D2

> pd.merge(left, right, how='outer', on=['key1', 'key2'])

  key1 key2    A    B    C    D
0   K0   K0   A0   B0   C0   D0
1   K0   K1   A1   B1  NaN  NaN
2   K1   K0   A2   B2   C1   D1
3   K1   K0   A2   B2   C2   D2
4   K2   K1   A3   B3  NaN  NaN
5   K2   K0  NaN  NaN   C3   D3

> pd.merge(left, right, how='right', on=['key1', 'key2'])

  key1 key2    A    B   C   D
0   K0   K0   A0   B0  C0  D0
1   K1   K0   A2   B2  C1  D1
2   K1   K0   A2   B2  C2  D2
3   K2   K0  NaN  NaN  C3  D3

> pd.merge(left, right, how='left', on=['key1', 'key2'])

  key1 key2   A   B    C    D
0   K0   K0  A0  B0   C0   D0
1   K0   K1  A1  B1  NaN  NaN
2   K1   K0  A2  B2   C1   D1
3   K1   K0  A2  B2   C2   D2
4   K2   K1  A3  B3  NaN  NaN

Joining

Join uses DataFrame indices as keys

> left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
+                      'B': ['B0', 'B1', 'B2']},
+                       index=['K0', 'K1', 'K2']) 
+ 
+ right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
+                     'D': ['D0', 'D2', 'D3']},
+                       index=['K0', 'K2', 'K3'])

> left

     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2

> right

     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3

> left.join(right) #left join default

     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2

> left.join(right, how='outer')

      A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3  NaN  NaN   C3   D3

Panda Operations

> import pandas as pd
+ df = pd.DataFrame({'col1':[1,2,3,4],
+                 'col2':[444,555,666,444],
+                 'col3':['abc','def','ghi','xyz']})
+ df.head()

   col1  col2 col3
0     1   444  abc
1     2   555  def
2     3   666  ghi
3     4   444  xyz

Info on Unique Values

> df['col2'].unique()

array([444, 555, 666], dtype=int64)

> len(df['col2'].unique())

> df['col2'].nunique()

> df['col2'].value_counts()

444    2
555    1
666    1
Name: col2, dtype: int64

Selecting Data

> #Select from DataFrame using 
+ #criteria from multiple columns
+ newdf = df[(df['col1']>2) & (df['col2']==444)]
+ newdf

   col1  col2 col3
3     4   444  xyz

Applying Functions

> def times2(x):
+     return x*2

> df['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

> df['col2'].apply(lambda x:x*2)

0     888
1    1110
2    1332
3     888
Name: col2, dtype: int64

> df['col3'].apply(len)

0    3
1    3
2    3
3    3
Name: col3, dtype: int64

> df['col1'].sum()

> df.drop('col1', axis=1)

   col2 col3
0   444  abc
1   555  def
2   666  ghi
3   444  xyz

Permanently Removing a Column

> del df['col1']
+ df

   col2 col3
0   444  abc
1   555  def
2   666  ghi
3   444  xyz

Get Column and Index Names

> df.columns

Index(['col2', 'col3'], dtype='object')

> df.index

RangeIndex(start=0, stop=4, step=1)

Sorting and Ordering a DataFrame

> df

   col2 col3
0   444  abc
1   555  def
2   666  ghi
3   444  xyz

> #inplace=False by default
+ df.sort_values(by='col2')

   col2 col3
0   444  abc
3   444  xyz
1   555  def
2   666  ghi

Find Null Values or Check for Null Values

> df.isnull()

    col2   col3
0  False  False
1  False  False
2  False  False
3  False  False

> # Drop rows with NaN Values
+ df.dropna()

   col2 col3
0   444  abc
1   555  def
2   666  ghi
3   444  xyz

Filling in NaN values with something else

> df = pd.DataFrame({'col1':[1,2,3,np.nan],
+                    'col2':[np.nan,555,666,444],
+                    'col3':['abc','def','ghi','xyz']})
+ df.head()

   col1   col2 col3
0   1.0    NaN  abc
1   2.0  555.0  def
2   3.0  666.0  ghi
3   NaN  444.0  xyz

> df.fillna('FILL')

   col1  col2 col3
0     1  FILL  abc
1     2   555  def
2     3   666  ghi
3  FILL   444  xyz

> data = {'A':['foo','foo','foo','bar','bar','bar'],
+      'B':['one','one','two','two','one','one'],
+        'C':['x','y','x','y','x','y'],
+        'D':[1,3,2,5,4,1]}
+ 
+ df = pd.DataFrame(data)
+ df

     A    B  C  D
0  foo  one  x  1
1  foo  one  y  3
2  foo  two  x  2
3  bar  two  y  5
4  bar  one  x  4
5  bar  one  y  1

> df.pivot_table(values='D',
+             index=['A', 'B'],
+             columns=['C'])

C          x    y
A   B            
bar one  4.0  1.0
    two  NaN  5.0
foo one  1.0  3.0
    two  2.0  NaN

Data Input and Output

You may need to install some packages:

> conda install sqlalchemy
> conda install lxml
> conda install html5lib
> conda install BeautifulSoup4
> conda install xlrd

CSV Input

> df = pd.read_csv('C:\\Users\\pbj20\\Documents\\example')
> df

    a   b   c   d
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15

CSV Output

> df.to_csv('C:\\Users\\pbj20\\Documents\\My_output',index=False)

Excel Input

Pandas can read and write excel files but this only imports data, not formulas or images Having images or macros may cause this read_excel method to crash.

> pd.read_excel('C:\\Users\\pbj20\\Documents\\Excel_Sample.xlsx',
> sheet_name='Sheet1')

   Unnamed: 0   a   b   c   d
0           0   0   1   2   3
1           1   4   5   6   7
2           2   8   9  10  11
3           3  12  13  14  15

Excel Output

> df.to_excel('C:\\Users\\pbj20\\Documents\\Excel_Sample2.xlsx',
> sheet_name='Sheet1')

HTML

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

> df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')

> type(df)

<class 'list'>

> df[0]

                             Bank Name  ...       Closing Date
0                 The First State Bank  ...      April 3, 2020
1                   Ericson State Bank  ...  February 14, 2020
2     City National Bank of New Jersey  ...   November 1, 2019
3                        Resolute Bank  ...   October 25, 2019
4                Louisa Community Bank  ...   October 25, 2019
..                                 ...  ...                ...
556                 Superior Bank, FSB  ...      July 27, 2001
557                Malta National Bank  ...        May 3, 2001
558    First Alliance Bank & Trust Co.  ...   February 2, 2001
559  National State Bank of Metropolis  ...  December 14, 2000
560                   Bank of Honolulu  ...   October 13, 2000

[561 rows x 6 columns]

Python Exercises 1

These use the SF Salaries Dataset from Kaggle:

https://www.kaggle.com/kaggle/sf-salaries

Read Salaries.csv as a dataframe called sal.

> sal = pd.read_csv('C:\\Users\\pbj20\\Documents\\Salaries.csv')

Check the head of the DataFrame

> sal.head()

   Id       EmployeeName  ...         Agency  Status
0   1     NATHANIEL FORD  ...  San Francisco     NaN
1   2       GARY JIMENEZ  ...  San Francisco     NaN
2   3     ALBERT PARDINI  ...  San Francisco     NaN
3   4  CHRISTOPHER CHONG  ...  San Francisco     NaN
4   5    PATRICK GARDNER  ...  San Francisco     NaN

[5 rows x 13 columns]

Use the .info() method to find out how many entries there are

> sal.info() # 148654 Entries

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
Id                  148654 non-null int64
EmployeeName        148654 non-null object
JobTitle            148654 non-null object
BasePay             148045 non-null float64
OvertimePay         148650 non-null float64
OtherPay            148650 non-null float64
Benefits            112491 non-null float64
TotalPay            148654 non-null float64
TotalPayBenefits    148654 non-null float64
Year                148654 non-null int64
Notes               0 non-null float64
Agency              148654 non-null object
Status              0 non-null float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB

What is the average BasePay ?

> sal['BasePay'].mean()

66325.44884050643

What is the highest amount of OvertimePay in the dataset ?

> sal['OvertimePay'].max()

245131.88

What is the job title of JOSEPH DRISCOLL ? Note: Use all caps, otherwise you may get an answer that doesn’t match up (there is also a lowercase Joseph Driscoll)

> sal[sal['EmployeeName']=='JOSEPH DRISCOLL']['JobTitle']

24    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object

How much does JOSEPH DRISCOLL make (including benefits)?

> sal[sal['EmployeeName']==
+     'JOSEPH DRISCOLL']['TotalPayBenefits']

24    270324.91
Name: TotalPayBenefits, dtype: float64

What is the name of highest paid person (including benefits)?

> sal[sal['TotalPayBenefits']== 
+     sal['TotalPayBenefits'].max()]['EmployeeName']

0    NATHANIEL FORD
Name: EmployeeName, dtype: object

> sal.loc[sal['TotalPayBenefits'].
+         idxmax()]['EmployeeName']

'NATHANIEL FORD'

What is the name of lowest paid person (including benefits)? Do you notice something strange about how much he or she is paid?

> sal[sal['TotalPayBenefits']==
+     sal['TotalPayBenefits'].min()]['EmployeeName']

148653    Joe Lopez
Name: EmployeeName, dtype: object

> sal.loc[sal['TotalPayBenefits'].
+       idxmin()]

Id                                      148654
EmployeeName                         Joe Lopez
JobTitle            Counselor, Log Cabin Ranch
BasePay                                      0
OvertimePay                                  0
OtherPay                               -618.13
Benefits                                     0
TotalPay                               -618.13
TotalPayBenefits                       -618.13
Year                                      2014
Notes                                      NaN
Agency                           San Francisco
Status                                     NaN
Name: 148653, dtype: object

What was the average (mean) BasePay of all employees per year? (2011-2014) ?

> sal.groupby('Year').mean()['BasePay']

Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64

How many unique job titles are there?

> sal['JobTitle'].nunique()

What are the top 5 most common jobs?

> sal['JobTitle'].value_counts().head(5)

Transit Operator                7036
Special Nurse                   4389
Registered Nurse                3736
Public Svc Aide-Public Works    2518
Police Officer 3                2421
Name: JobTitle, dtype: int64

How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurence in 2013?)

> sum(sal[sal['Year']==2013]['JobTitle'].
+     value_counts() == 1)

How many people have the word Chief in their job title?

> def chief_string(title):
+     if 'chief' in title.lower():
+         return True
+     else:
+         return False
+         
+ sum(sal['JobTitle'].apply(lambda x: chief_string(x)))
+

Is there a correlation between length of the Job Title string and Salary?

> # No Correlation
+ sal['title_len'] = sal['JobTitle'].apply(len)
+ sal[['title_len','TotalPayBenefits']].corr()

                  title_len  TotalPayBenefits
title_len          1.000000         -0.036878
TotalPayBenefits  -0.036878          1.000000

Python Exercises 2

These exercises involve fake data about some purchases made through Amazon.

Read in the Ecommerce Purchases csv file and set it to a DataFrame called ecom.

> ecom = pd.read_csv('C:\\Users\\Documents\\Ecommerce Purchases')

Check the head of the DataFrame.

> ecom.head()

                                             Address  ... Purchase Price
0  16629 Pace Camp Apt. 448\nAlexisborough, NE 77...  ...          98.14
1  9374 Jasmine Spurs Suite 508\nSouth John, TN 8...  ...          70.73
2                   Unit 0065 Box 5052\nDPO AP 27450  ...           0.95
3              7780 Julia Fords\nNew Stacy, WA 45798  ...          78.04
4  23012 Munoz Drive Suite 337\nNew Cynthia, TX 5...  ...          77.82

[5 rows x 14 columns]

How many rows and columns are there?

> ecom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
Address             10000 non-null object
Lot                 10000 non-null object
AM or PM            10000 non-null object
Browser Info        10000 non-null object
Company             10000 non-null object
Credit Card         10000 non-null int64
CC Exp Date         10000 non-null object
CC Security Code    10000 non-null int64
CC Provider         10000 non-null object
Email               10000 non-null object
Job                 10000 non-null object
IP Address          10000 non-null object
Language            10000 non-null object
Purchase Price      10000 non-null float64
dtypes: float64(1), int64(2), object(11)
memory usage: 1.1+ MB

What is the average Purchase Price?

> ecom['Purchase Price'].mean()

50.34730200000025

What were the highest and lowest purchase prices?

> ecom['Purchase Price'].max()

99.99

> ecom['Purchase Price'].min()

0.0

How many people have English ‘en’ as their Language of choice on the website?

> ecom[ecom['Language']=='en'].count()

Address             1098
Lot                 1098
AM or PM            1098
Browser Info        1098
Company             1098
Credit Card         1098
CC Exp Date         1098
CC Security Code    1098
CC Provider         1098
Email               1098
Job                 1098
IP Address          1098
Language            1098
Purchase Price      1098
dtype: int64

How many people have the job title of “Lawyer” ?

> ecom[ecom['Job'] == 'Lawyer'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 470 to 9979
Data columns (total 14 columns):
Address             30 non-null object
Lot                 30 non-null object
AM or PM            30 non-null object
Browser Info        30 non-null object
Company             30 non-null object
Credit Card         30 non-null int64
CC Exp Date         30 non-null object
CC Security Code    30 non-null int64
CC Provider         30 non-null object
Email               30 non-null object
Job                 30 non-null object
IP Address          30 non-null object
Language            30 non-null object
Purchase Price      30 non-null float64
dtypes: float64(1), int64(2), object(11)
memory usage: 3.5+ KB

How many people made the purchase during the AM and how many people made the purchase during PM?

> ecom['AM or PM'].value_counts()

PM    5068
AM    4932
Name: AM or PM, dtype: int64

What are the 5 most common Job Titles?

> ecom['Job'].value_counts().head(5)

Interior and spatial designer        31
Lawyer                               30
Social researcher                    28
Research officer, political party    27
Designer, jewellery                  27
Name: Job, dtype: int64

Someone made a purchase that came from Lot: “90 WT” , what was the Purchase Price for this transaction?

> ecom[ecom['Lot']=='90 WT']['Purchase Price']

513    75.1
Name: Purchase Price, dtype: float64

What is the email of the person with the following Credit Card Number: 4926535242672853

> ecom[ecom["Credit Card"] == 4926535242672853]['Email']

1234    bondellen@williams-garza.com
Name: Email, dtype: object

How many people have American Express as their Credit Card Provider and made a purchase above $95 ?

> ecom[(ecom['CC Provider']=='American Express') &
+      (ecom['Purchase Price']>95)].count()

Address             39
Lot                 39
AM or PM            39
Browser Info        39
Company             39
Credit Card         39
CC Exp Date         39
CC Security Code    39
CC Provider         39
Email               39
Job                 39
IP Address          39
Language            39
Purchase Price      39
dtype: int64

Hard: How many people have a credit card that expires in 2025?

> sum(ecom['CC Exp Date'].apply(lambda x:
+                     x[3:]) == '25')

Hard: What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc…)

> ecom['Email'].apply(lambda x: 
+     x.split('@')[1]).value_counts().head(5)

hotmail.com     1638
yahoo.com       1616
gmail.com       1605
smith.com         42
williams.com      37
Name: Email, dtype: int64

Python for Data Analysis

NumPy and Pandas in R Markdown

Paul Jozefek

2020-09-08

NumPy Arrays

Creating NumPy Arrays

From a Python List

Built-in Methods

arange

zeros and ones

linspace

eye

Random

rand

randn

randint

Array Attributes and Methods

reshape

max,min,argmax,argmin

Shape

dtype

Indexing and Selection

Bracket Indexing and Selection

Broadcasting

Indexing a 2D Array (Matrices)

Fancy Indexing

Selection

NumPy Operations

Arithmetic

Universal Array Functions

NumPy Exercises

Pandas DataFrames

Series

Creating a Series

Using Lists

NumPy Arrays

Dictionaries

Data in a Series

Using an Index

DataFrames

Selection and Indexing

Creating a New Column:

Removing Columns

Selecting Rows

Selecting Subset of Rows and Columns

Conditional Selection

More Index Details

Multi-Index and Index Hierarchy

Missing Data

Group, Merge, Join

Groupby

Example DataFrames

Concatenation

Example DataFrames

Merging

Joining

Panda Operations

Info on Unique Values

Selecting Data

Applying Functions

Permanently Removing a Column

Get Column and Index Names

Sorting and Ordering a DataFrame

Find Null Values or Check for Null Values

Filling in NaN values with something else

Data Input and Output

CSV Input

CSV Output

Excel Input

Excel Output

HTML

Python Exercises 1

Python Exercises 2