Numpy arrays can be vectors or matrices. Vectors are 1d arrays and matrices are 2d (but a matrix can still have only one row or one column).
Returns evenly spaced values within a given interval.
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
array([ 0, 2, 4, 6, 8, 10])
Generate arrays of zeros or ones.
array([0., 0., 0.])
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
array([1., 1., 1.])
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
Returns evenly spaced numbers over a specified interval. start, stop, num
array([ 0., 5., 10.])
array([0. , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
2.77777778, 3.33333333, 3.88888889, 4.44444444, 5. ])
array([ 0. , 0.20408163, 0.40816327, 0.6122449 , 0.81632653,
1.02040816, 1.2244898 , 1.42857143, 1.63265306, 1.83673469,
2.04081633, 2.24489796, 2.44897959, 2.65306122, 2.85714286,
3.06122449, 3.26530612, 3.46938776, 3.67346939, 3.87755102,
4.08163265, 4.28571429, 4.48979592, 4.69387755, 4.89795918,
5.10204082, 5.30612245, 5.51020408, 5.71428571, 5.91836735,
6.12244898, 6.32653061, 6.53061224, 6.73469388, 6.93877551,
7.14285714, 7.34693878, 7.55102041, 7.75510204, 7.95918367,
8.16326531, 8.36734694, 8.57142857, 8.7755102 , 8.97959184,
9.18367347, 9.3877551 , 9.59183673, 9.79591837, 10. ])
Create an array of the given shape and populate it with random samples from a uniform distribution over [0, 1).
array([0.59110145, 0.97873794])
array([[0.49683779, 0.41011857, 0.49680767, 0.19783712],
[0.38087874, 0.09527378, 0.53229016, 0.9158716 ],
[0.9875367 , 0.82660478, 0.6605952 , 0.90010448],
[0.05382331, 0.50698107, 0.58781927, 0.29309634],
[0.46095471, 0.38694441, 0.87989227, 0.46206292]])
Returns a sample (or samples) from the “standard normal” distribution.
array([0.00592357, 1.21541429])
array([[ 0.8479065 , 1.22475886, 0.66784301, -0.96586068],
[-1.42468007, 1.07965643, -0.36176629, 0.31531068],
[-2.48265365, 0.51033764, -0.5693635 , -0.960074 ],
[-0.89532982, 0.89758788, -0.62306251, -0.30047938],
[-0.10816238, 0.54934125, -0.45687432, 2.90258011]])
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24])
array([ 0, 3, 3, 39, 9, 19, 21, 36, 23, 6])
Returns an array containing the same data with a new shape.
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
Finding max or min values or their index locations using argmin or argmax
39
3
0
0
Shape is an attribute that arrays have (not a method):
(25,)
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
(5, 5)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
8
array([1, 2, 3, 4])
array([0, 1, 2, 3, 4])
array([0, 1, 2, 3, 4, 5])
array([ 5, 6, 7, 8, 9, 10])
Numpy arrays differ from a normal Python list because of their ability to broadcast:
array([100, 100, 100, 100, 100, 5, 6, 7, 8, 9, 10])
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
array([0, 1, 2, 3, 4, 5])
array([99, 99, 99, 99, 99, 99])
Assigning a slice of the array to slice_of_arr will not actually create a separate array. It will just point to the original. Any changes will impact both.
array([99, 99, 99, 99, 99, 99, 6, 7, 8, 9, 10])
If you don’t want to change the original you must make a copy with arr.copy().
array([99, 99, 99, 99, 99, 99, 6, 7, 8, 9, 10])
array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100])
Now the original is not affected.
array([99, 99, 99, 99, 99, 99, 6, 7, 8, 9, 10])
The general format is arr_2d[row][col] or arr_2d[row,col].
array([[ 5, 10, 15],
[20, 25, 30],
[35, 40, 45]])
array([20, 25, 30])
20
20
array([[10, 15],
[25, 30]])
array([35, 40, 45])
array([35, 40, 45])
Fancy indexing allows you to select entire rows or columns out of order.
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[3., 3., 3., 3., 3., 3., 3., 3., 3., 3.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.],
[9., 9., 9., 9., 9., 9., 9., 9., 9., 9.]])
Fancy indexing allows the following:
array([[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8.]])
array([[6., 6., 6., 6., 6., 6., 6., 6., 6., 6.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[7., 7., 7., 7., 7., 7., 7., 7., 7., 7.]])
You can use brackets for selection with comparison operators.
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
array([False, False, False, False, True, True, True, True, True,
True])
array([False, False, False, False, True, True, True, True, True,
True])
array([ 5, 6, 7, 8, 9, 10])
array([ 3, 4, 5, 6, 7, 8, 9, 10])
array([ 3, 4, 5, 6, 7, 8, 9, 10])
You can easily perform array with array arithmetic, or scalar with array arithmetic.
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109])
array([ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900])
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
array([nan, 1., 1., 1., 1., 1., 1., 1., 1., 1.])
C:/Users/pbj20/Anaconda3/python.exe:1: RuntimeWarning: invalid value encountered in true_divide
array([ inf, 1. , 0.5 , 0.33333333, 0.25 ,
0.2 , 0.16666667, 0.14285714, 0.125 , 0.11111111])
C:/Users/pbj20/Anaconda3/python.exe:1: RuntimeWarning: divide by zero encountered in true_divide
array([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729], dtype=int32)
Universal array functions perform a mathematical operation across the array.
See here for more details: http://docs.scipy.org/doc/numpy/reference/ufuncs.html
Some common ones are shown below:
array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
9
array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ,
-0.95892427, -0.2794155 , 0.6569866 , 0.98935825, 0.41211849])
array([ -inf, 0. , 0.69314718, 1.09861229, 1.38629436,
1.60943791, 1.79175947, 1.94591015, 2.07944154, 2.19722458])
C:/Users/pbj20/Anaconda3/python.exe:1: RuntimeWarning: divide by zero encountered in log
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])
array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
44, 45, 46, 47, 48, 49, 50])
array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,
44, 46, 48, 50])
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
array([0.05671298])
array([-0.1730696 , -1.76165167, -0.08767307, 1.36687937, 1.1253141 ,
-0.35899555, 1.2206081 , -1.33949555, 0.42837337, -0.12346315,
1.41437719, -0.12405066, 2.00815709, 0.22988654, 0.60489373,
1.62715982, 1.59456053, 0.23043417, -0.06491034, -0.96898025,
0.59124281, -0.7827755 , -0.44423283, -0.34518616, -0.88180055])
array([[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
[0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ],
[0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 ],
[0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 ],
[0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ],
[0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 ],
[0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ],
[0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ],
[0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 ],
[0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1. ]])
array([[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
[0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ],
[0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 ],
[0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 ],
[0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ],
[0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 ],
[0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ],
[0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ],
[0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 ],
[0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1. ]])
array([[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 ],
[0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 ],
[0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 ],
[0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 ],
[0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 ],
[0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 ],
[0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 ],
[0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 ],
[0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 ],
[0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1. ]])
array([0. , 0.05263158, 0.10526316, 0.15789474, 0.21052632,
0.26315789, 0.31578947, 0.36842105, 0.42105263, 0.47368421,
0.52631579, 0.57894737, 0.63157895, 0.68421053, 0.73684211,
0.78947368, 0.84210526, 0.89473684, 0.94736842, 1. ])
matarray([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
array([[12, 13, 14, 15],
[17, 18, 19, 20],
[22, 23, 24, 25]])
array([[12, 13, 14, 15],
[17, 18, 19, 20],
[22, 23, 24, 25]])
2020
array([[ 2],
[ 7],
[12]])
array([[ 2],
[ 7],
[12]])
array([21, 22, 23, 24, 25])
array([21, 22, 23, 24, 25])
array([[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
array([[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
325
7.211102550927978
array([55, 60, 65, 70, 75])
A Series is very similar to a NumPy array. What differentiates the NumPy array from a Series is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data, it can hold any arbitrary Python Object.
> import pandas as pd
+ labels = ['a','b','c']
+ my_list = [10,20,30]
+ arr = np.array([10,20,30])
+ d = {'a':10,'b':20,'c':30}0 10
1 20
2 30
dtype: int64
a 10
b 20
c 30
dtype: int64
a 10
b 20
c 30
dtype: int64
0 10
1 20
2 30
dtype: int32
a 10
b 20
c 30
dtype: int32
A pandas Series can hold a variety of object types:
0 a
1 b
2 c
dtype: object
0 <built-in function sum>
1 <built-in function print>
2 <built-in function len>
dtype: object
The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information.
USA 1
Germany 2
USSR 3
Japan 4
dtype: int64
USA 1
Germany 2
Italy 5
Japan 4
dtype: int64
1
Operations are then also done based off of index:
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64
DataFrames are inspired by the R programming language.
> from numpy.random import randn
+ np.random.seed(101)
+ df = pd.DataFrame(randn(5,4),
+ index='A B C D E'.split(),
+ columns='W X Y Z'.split())
+ df W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
DataFrame Columns are just Series
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
W Z
A 2.706850 0.503826
B 0.651118 0.605965
C -2.018168 -0.589001
D 0.188695 0.955057
E 0.190794 0.683509
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
You can also drop rows this way:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
The shape is a tuple, with rows at index 0 and columns at index 1. You can think of the axis as the tuple index.
(5, 4)
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
Or select based off of position instead of label.
W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
W X Y Z
A True True True True
B True False False True
C False True True False
D True False False True
E True True True True
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 NaN NaN 0.605965
C NaN 0.740122 0.528813 NaN
D 0.188695 NaN NaN 0.955057
E 0.190794 1.978757 2.605967 0.683509
A True
B True
C False
D True
E True
Name: W, dtype: bool
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
W X Y Z
C -2.018168 0.740122 0.528813 -0.589001
A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64
Y X
A 0.907969 0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E 2.605967 1.978757
For two conditions you can use | (OR) and & (AND) with parenthesis:
W X Y Z
E 0.190794 1.978757 2.605967 0.683509
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
> # Reset to default 0,1...n index
+ # old index will become column
+ # not inplace
+ df.reset_index() index W X Y Z
0 A 2.706850 0.628133 0.907969 0.503826
1 B 0.651118 -0.319318 -0.848077 0.605965
2 C -2.018168 0.740122 0.528813 -0.589001
3 D 0.188695 -0.758872 -0.933237 0.955057
4 E 0.190794 1.978757 2.605967 0.683509
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
['CA', 'NY', 'WY', 'OR', 'CO']
W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
W X Y Z
States
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509
> # Index Levels
+ outside = ['G1','G1','G1','G2','G2','G2']
+ inside = [1,2,3,1,2,3]
+ hier_index = list(zip(outside,inside))
+ hier_index = pd.MultiIndex.from_tuples(hier_index)MultiIndex([('G1', 1),
('G1', 2),
('G1', 3),
('G2', 1),
('G2', 2),
('G2', 3)],
)
A B
G1 1 0.302665 1.693723
2 -1.706086 -1.159119
3 -0.134841 0.390528
G2 1 0.166905 0.184502
2 0.807706 0.072960
3 0.638787 0.329646
For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
A B
1 0.302665 1.693723
2 -1.706086 -1.159119
3 -0.134841 0.390528
A 0.302665
B 1.693723
Name: 1, dtype: float64
FrozenList([None, None])
A B
Group Num
G1 1 0.302665 1.693723
2 -1.706086 -1.159119
3 -0.134841 0.390528
G2 1 0.166905 0.184502
2 0.807706 0.072960
3 0.638787 0.329646
0.07295967531703869
Use .xs for cross section
A B
Num
1 0.302665 1.693723
2 -1.706086 -1.159119
3 -0.134841 0.390528
A 0.302665
B 1.693723
Name: (G1, 1), dtype: float64
1.693722925204035
A B
Group
G1 0.302665 1.693723
G2 0.166905 0.184502
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
A B C
0 1.0 5.0 1
C
0 1
1 2
2 3
0 1.0
1 2.0
Name: A, dtype: float64
A B C
0 1.0 5.0 1
1 2.0 NaN 2
You can set a threshold of NAs with thresh.
A B C
0 1.0 5.0 1
1 2.0 NaN 2
A B C
0 1 5 1
1 2 FILL VALUE 2
2 FILL VALUE FILL VALUE 3
Fill missing values the mean
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
The groupby method allows you to group rows of data together and call aggregate functions.
> import pandas as pd
+ # Create dataframe
+ data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
+ 'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
+ 'Sales':[200,120,340,124,243,350]} Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350
Now you can use the .groupby() method to group rows together based off of a column name. For instance grouping based off of Company will create a DataFrameGroupBy object:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000049E0D508>
You can save this object as a new variable:
And then call aggregate methods off the object:
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0
Sales
Company
FB 75.660426
GOOG 56.568542
MSFT 152.735065
Person Sales
Company
FB Carl 243
GOOG Charlie 120
MSFT Amy 124
Person Sales
Company
FB Sarah 350
GOOG Sam 200
MSFT Vanessa 340
Person Sales
Company
FB 2 2
GOOG 2 2
MSFT 2 2
Sales
count mean std min 25% 50% 75% max
Company
FB 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0
GOOG 2.0 160.0 56.568542 120.0 140.00 160.0 180.00 200.0
MSFT 2.0 232.0 152.735065 124.0 178.00 232.0 286.00 340.0
Company FB GOOG MSFT
Sales count 2.000000 2.000000 2.000000
mean 296.500000 160.000000 232.000000
std 75.660426 56.568542 152.735065
min 243.000000 120.000000 124.000000
25% 269.750000 140.000000 178.000000
50% 296.500000 160.000000 232.000000
75% 323.250000 180.000000 286.000000
max 350.000000 200.000000 340.000000
Sales count 2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
Name: GOOG, dtype: float64
> df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
+ 'B': ['B0', 'B1', 'B2', 'B3'],
+ 'C': ['C0', 'C1', 'C2', 'C3'],
+ 'D': ['D0', 'D1', 'D2', 'D3']},
+ index=[0, 1, 2, 3])> df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
+ 'B': ['B4', 'B5', 'B6', 'B7'],
+ 'C': ['C4', 'C5', 'C6', 'C7'],
+ 'D': ['D4', 'D5', 'D6', 'D7']},
+ index=[4, 5, 6, 7]) > df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
+ 'B': ['B8', 'B9', 'B10', 'B11'],
+ 'C': ['C8', 'C9', 'C10', 'C11'],
+ 'D': ['D8', 'D9', 'D10', 'D11']},
+ index=[8, 9, 10, 11]) A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
A B C D
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
Concatenation glues together DataFrames. The dimensions should match along the axis on which you are concatenating. You can use pd.concat and pass in a list of DataFrames to concatenate together:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
There will be missing values where indices don’t match.
A B C D A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN
5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN
6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN
7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8
9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9
10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10
11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11
> left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
+ 'A': ['A0', 'A1', 'A2', 'A3'],
+ 'B': ['B0', 'B1', 'B2', 'B3']})
+
+ right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
+ 'C': ['C0', 'C1', 'C2', 'C3'],
+ 'D': ['D0', 'D1', 'D2', 'D3']}) key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
The merge function allows you to merge DataFrames together using logic similar to what’s found in SQL.
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
A more complicated example, with two keys:
> left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
+ 'key2': ['K0', 'K1', 'K0', 'K1'],
+ 'A': ['A0', 'A1', 'A2', 'A3'],
+ 'B': ['B0', 'B1', 'B2', 'B3']})
+
+ right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
+ 'key2': ['K0', 'K0', 'K0', 'K0'],
+ 'C': ['C0', 'C1', 'C2', 'C3'],
+ 'D': ['D0', 'D1', 'D2', 'D3']})
+
+ left key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
Join uses DataFrame indices as keys
> left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
+ 'B': ['B0', 'B1', 'B2']},
+ index=['K0', 'K1', 'K2'])
+
+ right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
+ 'D': ['D0', 'D2', 'D3']},
+ index=['K0', 'K2', 'K3']) A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
> import pandas as pd
+ df = pd.DataFrame({'col1':[1,2,3,4],
+ 'col2':[444,555,666,444],
+ 'col3':['abc','def','ghi','xyz']})
+ df.head() col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
array([444, 555, 666], dtype=int64)
3
3
444 2
555 1
666 1
Name: col2, dtype: int64
> #Select from DataFrame using
+ #criteria from multiple columns
+ newdf = df[(df['col1']>2) & (df['col2']==444)]
+ newdf col1 col2 col3
3 4 444 xyz
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
0 888
1 1110
2 1332
3 888
Name: col2, dtype: int64
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
10
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
Index(['col2', 'col3'], dtype='object')
RangeIndex(start=0, stop=4, step=1)
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
col2 col3
0 444 abc
3 444 xyz
1 555 def
2 666 ghi
col2 col3
0 False False
1 False False
2 False False
3 False False
col2 col3
0 444 abc
1 555 def
2 666 ghi
3 444 xyz
> df = pd.DataFrame({'col1':[1,2,3,np.nan],
+ 'col2':[np.nan,555,666,444],
+ 'col3':['abc','def','ghi','xyz']})
+ df.head() col1 col2 col3
0 1.0 NaN abc
1 2.0 555.0 def
2 3.0 666.0 ghi
3 NaN 444.0 xyz
col1 col2 col3
0 1 FILL abc
1 2 555 def
2 3 666 ghi
3 FILL 444 xyz
> data = {'A':['foo','foo','foo','bar','bar','bar'],
+ 'B':['one','one','two','two','one','one'],
+ 'C':['x','y','x','y','x','y'],
+ 'D':[1,3,2,5,4,1]}
+
+ df = pd.DataFrame(data)
+ df A B C D
0 foo one x 1
1 foo one y 3
2 foo two x 2
3 bar two y 5
4 bar one x 4
5 bar one y 1
C x y
A B
bar one 4.0 1.0
two NaN 5.0
foo one 1.0 3.0
two 2.0 NaN
You may need to install some packages:
> conda install sqlalchemy
> conda install lxml
> conda install html5lib
> conda install BeautifulSoup4
> conda install xlrd a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
Pandas can read and write excel files but this only imports data, not formulas or images Having images or macros may cause this read_excel method to crash.
Unnamed: 0 a b c d
0 0 0 1 2 3
1 1 4 5 6 7
2 2 8 9 10 11
3 3 12 13 14 15
Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
<class 'list'>
Bank Name ... Closing Date
0 The First State Bank ... April 3, 2020
1 Ericson State Bank ... February 14, 2020
2 City National Bank of New Jersey ... November 1, 2019
3 Resolute Bank ... October 25, 2019
4 Louisa Community Bank ... October 25, 2019
.. ... ... ...
556 Superior Bank, FSB ... July 27, 2001
557 Malta National Bank ... May 3, 2001
558 First Alliance Bank & Trust Co. ... February 2, 2001
559 National State Bank of Metropolis ... December 14, 2000
560 Bank of Honolulu ... October 13, 2000
[561 rows x 6 columns]
These use the SF Salaries Dataset from Kaggle:
https://www.kaggle.com/kaggle/sf-salaries
Id EmployeeName ... Agency Status
0 1 NATHANIEL FORD ... San Francisco NaN
1 2 GARY JIMENEZ ... San Francisco NaN
2 3 ALBERT PARDINI ... San Francisco NaN
3 4 CHRISTOPHER CHONG ... San Francisco NaN
4 5 PATRICK GARDNER ... San Francisco NaN
[5 rows x 13 columns]
.info() method to find out how many entries there are<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
Id 148654 non-null int64
EmployeeName 148654 non-null object
JobTitle 148654 non-null object
BasePay 148045 non-null float64
OvertimePay 148650 non-null float64
OtherPay 148650 non-null float64
Benefits 112491 non-null float64
TotalPay 148654 non-null float64
TotalPayBenefits 148654 non-null float64
Year 148654 non-null int64
Notes 0 non-null float64
Agency 148654 non-null object
Status 0 non-null float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB
66325.44884050643
245131.88
24 CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object
24 270324.91
Name: TotalPayBenefits, dtype: float64
0 NATHANIEL FORD
Name: EmployeeName, dtype: object
'NATHANIEL FORD'
148653 Joe Lopez
Name: EmployeeName, dtype: object
Id 148654
EmployeeName Joe Lopez
JobTitle Counselor, Log Cabin Ranch
BasePay 0
OvertimePay 0
OtherPay -618.13
Benefits 0
TotalPay -618.13
TotalPayBenefits -618.13
Year 2014
Notes NaN
Agency San Francisco
Status NaN
Name: 148653, dtype: object
Year
2011 63595.956517
2012 65436.406857
2013 69630.030216
2014 66564.421924
Name: BasePay, dtype: float64
2159
Transit Operator 7036
Special Nurse 4389
Registered Nurse 3736
Public Svc Aide-Public Works 2518
Police Officer 3 2421
Name: JobTitle, dtype: int64
202
> def chief_string(title):
+ if 'chief' in title.lower():
+ return True
+ else:
+ return False
+
+ sum(sal['JobTitle'].apply(lambda x: chief_string(x)))
+ 627
> # No Correlation
+ sal['title_len'] = sal['JobTitle'].apply(len)
+ sal[['title_len','TotalPayBenefits']].corr() title_len TotalPayBenefits
title_len 1.000000 -0.036878
TotalPayBenefits -0.036878 1.000000
These exercises involve fake data about some purchases made through Amazon.
Address ... Purchase Price
0 16629 Pace Camp Apt. 448\nAlexisborough, NE 77... ... 98.14
1 9374 Jasmine Spurs Suite 508\nSouth John, TN 8... ... 70.73
2 Unit 0065 Box 5052\nDPO AP 27450 ... 0.95
3 7780 Julia Fords\nNew Stacy, WA 45798 ... 78.04
4 23012 Munoz Drive Suite 337\nNew Cynthia, TX 5... ... 77.82
[5 rows x 14 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
Address 10000 non-null object
Lot 10000 non-null object
AM or PM 10000 non-null object
Browser Info 10000 non-null object
Company 10000 non-null object
Credit Card 10000 non-null int64
CC Exp Date 10000 non-null object
CC Security Code 10000 non-null int64
CC Provider 10000 non-null object
Email 10000 non-null object
Job 10000 non-null object
IP Address 10000 non-null object
Language 10000 non-null object
Purchase Price 10000 non-null float64
dtypes: float64(1), int64(2), object(11)
memory usage: 1.1+ MB
50.34730200000025
99.99
0.0
Address 1098
Lot 1098
AM or PM 1098
Browser Info 1098
Company 1098
Credit Card 1098
CC Exp Date 1098
CC Security Code 1098
CC Provider 1098
Email 1098
Job 1098
IP Address 1098
Language 1098
Purchase Price 1098
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 470 to 9979
Data columns (total 14 columns):
Address 30 non-null object
Lot 30 non-null object
AM or PM 30 non-null object
Browser Info 30 non-null object
Company 30 non-null object
Credit Card 30 non-null int64
CC Exp Date 30 non-null object
CC Security Code 30 non-null int64
CC Provider 30 non-null object
Email 30 non-null object
Job 30 non-null object
IP Address 30 non-null object
Language 30 non-null object
Purchase Price 30 non-null float64
dtypes: float64(1), int64(2), object(11)
memory usage: 3.5+ KB
PM 5068
AM 4932
Name: AM or PM, dtype: int64
Interior and spatial designer 31
Lawyer 30
Social researcher 28
Research officer, political party 27
Designer, jewellery 27
Name: Job, dtype: int64
513 75.1
Name: Purchase Price, dtype: float64
1234 bondellen@williams-garza.com
Name: Email, dtype: object
Address 39
Lot 39
AM or PM 39
Browser Info 39
Company 39
Credit Card 39
CC Exp Date 39
CC Security Code 39
CC Provider 39
Email 39
Job 39
IP Address 39
Language 39
Purchase Price 39
dtype: int64
1033
hotmail.com 1638
yahoo.com 1616
gmail.com 1605
smith.com 42
williams.com 37
Name: Email, dtype: int64