Introduction to Python

Lecturer: Hugo Bowne-Anderson


1 Course Description

Python is a general-purpose programming language that is becoming ever more popular for data science. Companies worldwide are using Python to harvest insights from their data and gain a competitive edge. Unlike other Python tutorials, this course focuses on Python specifically for data science. In our Introduction to Python course, you’ll learn about powerful ways to store and manipulate data, and helpful data science tools to begin conducting your own analyses.

Course materials can be found The data can be found here.

2 Python Basics

An introduction to the basic concepts of Python. Learn how to use Python interactively and by using a script. Create your first variables and acquaint yourself with Python’s basic data types.

2.1 Lecture: Hello Python!


Below are simple examples of Python executions.

5 / 8
## 0.625
print(7 + 10)
## 17

2.2 When to use Python?

Python is a pretty versatile language. You can use it for any purpose that you may imagine:

  • You want to do some quick calculations.
  • For your new business, you want to develop a database-driven website.
  • Your boss asks you to clean and analyze the results of the latest satisfaction survey.

2.3 Any Comments?

Something that Hugo didn’t mention in his videos is that you can add comments to your Python scripts. Comments are important to make sure that you and others can understand what your code is about.

To add comments to your Python script, you can use the # tag. These comments are not run as Python code, so they will not influence your result. As an example, the comment below is completely ignored during execution.

# Division
print(5 / 8)
## 0.625

2.4 Python as a Calculator

Python is perfectly suited to do basic calculations. Apart from addition, subtraction, multiplication and division, there is also support for more advanced operations such as exponentiation \(**\) and modulo \(\%\). The code below gives some examples.

# Addition - Subtraction
print('5 + 5 =', 5 + 5, ' ||  5 - 5 =', 5 - 5)
## 5 + 5 = 10  ||  5 - 5 = 0
# Multiplication - Division - Modulo - Exponentiation
print('3 * 5 =', 3 * 5, ' ||  10 / 2 =', 10 / 2,
      ' ||  18 % 7 =', 18 % 7, ' ||  4 ** 2 =', 4 ** 2)
## 3 * 5 = 15  ||  10 / 2 = 5.0  ||  18 % 7 = 4  ||  4 ** 2 = 16

Suppose you have \(\$100\), which you can invest with a \(10\%\) return each year. After one year, it’s \(100*1.1=110\) dollars, and after two years it’s \(100*1.1*1.1=121\). The code below calculates how much money you end up with after \(7\) years, and prints the result.

# How much is your $100 worth after 7 years?
print(100 * (1.1 ** 7))
## 194.87171000000012

2.5 Lecture: Variables and Types

2.6 Variable Assignment

In Python, a variable allows you to refer to a value with a name. To create a variable, use \(=\) like the example below.

x = 5
print(x)
## 5

You can now use the name of this variable, \(x\), instead of the actual value \(5.\) Remember, \(=\) in Python means assignment, it doesn’t test equality!

2.7 Calculations with Variables

Remember how you calculated the money you ended up with after \(7\) years of investing \(\$100\)? You did something like this:

100 * (1.1 ** 7)
## 194.87171000000012

Instead of calculating with the actual values, you can use variables instead.

savings = 100
growth_multiplier = 1.1
result = savings * growth_multiplier ** 7
result
## 194.87171000000012

2.8 Other Variable Types

Previously, you worked with two Python data types:

  • int, or integer: a number without a fractional part. savings, with the value \(100\), is an example of an integer.
  • float, or floating point: a number that has both an integer and fractional part, separated by a point. growth_multiplier, with the value \(1.1\), is an example of a float.

Next to numerical data types, there are two other very common data types:

  • str, or string: a type to represent text. You can use single or double quotes to build a string.
  • bool, or boolean: a type to represent logical values. Can only be True or False (the capitalization is important!). To find out the type of a value or a variable that refers to that value, you can use the type() function.
# Create a variable desc
desc = "compound interest"

# Create a variable profitable
profitable = True

print(type(desc))
## <class 'str'>

2.9 Operations with other Types

Hugo mentioned that different types behave differently in Python.

When you sum two strings, for example, you’ll get different behavior than when you sum two integers or two booleans. Notice how desc \(+\) desc causes “compound interest” and “compound interest” to be pasted together.

desc = "compound interest"

# Assign sum of desc and desc to doubledesc
doubledesc = desc + desc

# Print out doubledesc
print(doubledesc)
## compound interestcompound interest

2.10 Type Conversion

Using the \(+\) operator to paste together two strings can be very useful in building custom messages.

Suppose, for example, that you’ve calculated the return of your investment and want to summarize the results in a string. Assuming the integer savings and float result are defined, you can try something like this:

print("I started with $" + savings + " and now have $" + result + ". Awesome!")

This will not work, though, as you cannot simply sum strings and integers/floats.

To fix the error, you’ll need to explicitly convert the types of your variables. More specifically, you’ll need str() to convert a value into a string. str(savings), for example, will convert the integer savings to a string.

# Definition of savings and result
savings = 100
result = 100 * 1.10 ** 7

# Fix the printout
print("I started with $" + str(savings),
      "and now have $" + str(result) + ". Awesome!")
## I started with $100 and now have $194.87171000000012. Awesome!

You have a profit of around \(\$95\); that’s pretty awesome indeed! Similar functions such as int(), float() and bool() will help you convert Python values into any type.

3 Python Lists

Learn to store, access, and manipulate data in lists: the first step toward efficiently working with huge amounts of data.

3.1 Lecture: Python Lists

3.2 Create a List

As opposed to int, bool etc., a list is a compound data type; you can group values together:

a = 'is'
b = 'nice'
my_list = ['my', 'list', a, b]

Let’s say after measuring the height of your family, you decide to collect some information on the house you’re living in.

# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# Create list areas
areas = [hall, kit, liv, bed, bath]

# Print areas
print(areas)
## [11.25, 18.0, 20.0, 10.75, 9.5]

3.3 Create List with different Types

A list can contain any Python type. Although it’s not really common, a list can also contain a mix of Python types including strings, floats, booleans, etc.

The previous printout wasn’t really satisfying. It’s just a list of numbers representing the areas, but you can’t tell which area corresponds to which part of your house. The code below is the start of a solution.

# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# Adapt list areas
areas = ["hallway", hall, "kitchen", kit, "living room",
         liv, "bedroom", bed, "bathroom", bath]

# Print areas
print(areas)
## ['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0, 'bedroom', 10.75, 'bathroom', 9.5]

Is the printout more informative this time? The list contains both strings and floats, but that’s not a problem for Python!

3.4 List of Lists

As a data scientist, you’ll often be dealing with a lot of data, and it will make sense to group some of this data.

Instead of creating a flat list containing strings and floats, representing the names and areas of the rooms in your house, you can create a list of lists. Don’t get confused here: “hallway” is a string, while hall is a variable that represents the float \(11.25\) you specified earlier.

# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# house information as list of lists
house = [["hallway", hall],
         ["kitchen", kit],
         ["living room", liv],
         ["bedroom", bed],
         ["bathroom", bath]]

# Print out house
print(house)
## [['hallway', 11.25], ['kitchen', 18.0], ['living room', 20.0], ['bedroom', 10.75], ['bathroom', 9.5]]

Get ready to learn about list subsetting!

3.5 Lecture: Subsetting Lists

3.6 Subset and conquer

Subsetting Python lists is a piece of cake. Take the code sample below, which creates a list x and then selects “b” from it. Remember that this is the second element, so it has index \(1\). You can also use negative indexing.

x = ["a", "b", "c", "d"]
print(x[1], '\n',
      x[-3]) # same result!
## b 
##  b

Remember the areas list from before, containing both strings and floats? Let’s do some subsetting with it!

# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0,
         "bedroom", 10.75, "bathroom", 9.50]

# Print out second element from areas
print(areas[1])
## 11.25
# Print out last element from areas
print(areas[-1])
## 9.5
# Print out the area of the living room
print(areas[5])
## 20.0

3.7 Subset and Calculate

After you’ve extracted values from a list, you can use them to perform additional calculations. Take this example, where the second and fourth element of a list x are extracted. The strings that result are pasted together using the \(+\) operator:

x = ["a", "b", "c", "d"]
print(x[1] + x[3])
## bd

Further examples are given below.

# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0,
         "bedroom", 10.75, "bathroom", 9.50]

# Sum of kitchen and bedroom area: eat_sleep_area
eat_sleep_area = areas[3] + areas[7]

# Print the variable eat_sleep_area
print(eat_sleep_area)
## 28.75

3.8 Slicing and dicing

Selecting single values from a list is just one part of the story. It’s also possible to slice your list, which means selecting multiple elements from your list. Use the following syntax:

my_list[start:end]

The start index will be included, while the end index is not.

The code sample below shows an example. A list with “b” and “c”, corresponding to indexes \(1\) and \(2,\) are selected from a list x:

x = ["a", "b", "c", "d"]
x[1:3]
## ['b', 'c']

The elements with index \(1\) and \(2\) are included, while the element with index \(3\) is not. Below is another example.

# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0,
         "bedroom", 10.75, "bathroom", 9.50]

# Use slicing to create downstairs
downstairs = areas[2:6]

# Use slicing to create upstairs
upstairs = areas[-4:-2]

# Print out downstairs and upstairs
print(downstairs, '\n', upstairs)
## ['kitchen', 18.0, 'living room', 20.0] 
##  ['bedroom', 10.75]

If you don’t specify the begin index, Python figures out that you want to start your slice at the beginning of your list. If you don’t specify the end index, the slice will go all the way to the last element of your list.

# Alternative slicing to create downstairs
downstairs = areas[:6]

# Alternative slicing to create upstairs
upstairs = areas[-4:]

# Print out downstairs and upstairs
print(downstairs, '\n', upstairs)
## ['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0] 
##  ['bedroom', 10.75, 'bathroom', 9.5]

3.9 Subsetting Lists of Lists

You saw before that a Python list can contain practically anything; even other lists! To subset lists of lists, you can use the same technique as before: square brackets.

x = [["a", "b", "c"],
     ["d", "e", "f"],
     ["g", "h", "i"]]
print(x[2][0], '\n', x[2][:2])
## g 
##  ['g', 'h']

x[2] results in a list, that you can subset again by adding additional square brackets.

3.10 Lecture: Manipulating Lists

3.11 Replace List Elements

Replacing list elements is pretty easy. Simply subset the list and assign new values to the subset. You can select single elements or you can change entire list slices at once.

x = ["a", "b", "c", "d"]
x[1] = "r"
x[2:] = ["s", "t"]
x
## ['a', 'r', 's', 't']

Let’s continue working on the areas list that contains the names and areas of different rooms in a house.

# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0,
         "bedroom", 10.75, "bathroom", 9.50]

# Correct the bathroom area
areas[9] = 10.50

# Change "living room" to "chill zone"
areas[4] = "chill zone"

areas
## ['hallway', 11.25, 'kitchen', 18.0, 'chill zone', 20.0, 'bedroom', 10.75, 'bathroom', 10.5]

3.12 Extend a List

If you can change elements in a list, you sure want to be able to add elements to it, right? You can use the \(+\) operator:

x = ["a", "b", "c", "d"]
x + ["e", "f"]
## ['a', 'b', 'c', 'd', 'e', 'f']

You just won the lottery, awesome! You decide to build a poolhouse and a garage. Let’s add the information to the areas list!

# Create the areas list and make some changes
areas = ["hallway", 11.25, "kitchen", 18.0, "chill zone", 20.0,
         "bedroom", 10.75, "bathroom", 10.50]

# Add poolhouse data to areas, new list is areas_1
areas_1 = areas + ["poolhouse", 24.5]

# Add garage data to areas_1, new list is areas_2
areas_2 = areas_1 + ["garage", 15.45]

areas_2
## ['hallway', 11.25, 'kitchen', 18.0, 'chill zone', 20.0, 'bedroom', 10.75, 'bathroom', 10.5, 'poolhouse', 24.5, 'garage', 15.45]

Cool! The list is shaping up nicely!

3.13 Delete List Elements

Finally, you can also remove elements from your list. You can do this with the del statement:

x = ["a", "b", "c", "d"]
del(x[1])
x
## ['a', 'c', 'd']

Pay attention here: as soon as you remove an element from a list, the indexes of the elements that come after the deleted element all change!

areas = ["hallway", 11.25, "kitchen", 18.0,
         "chill zone", 20.0, "bedroom", 10.75,
         "bathroom", 10.50, "poolhouse", 24.5,
         "garage", 15.45]

The updated and extended version of areas that you’ve built in the previous exercises is coded above. There was a mistake! The amount you won with the lottery is not that big after all and it looks like the poolhouse isn’t going to happen. You decide to remove the corresponding string and float from the areas list.

del(areas[-4:-2])
areas
## ['hallway', 11.25, 'kitchen', 18.0, 'chill zone', 20.0, 'bedroom', 10.75, 'bathroom', 10.5, 'garage', 15.45]

You’ll learn about easier ways to remove specific elements from Python lists later on.

3.14 Inner workings of Lists

At the end of the video, Hugo explained how Python lists work behind the scenes. Now you’ll get some hands-on experience with this.

# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Create areas_copy
areas_copy = areas

# Change areas_copy
areas_copy[0] = 5.0

# Print areas
print(areas)
## [5.0, 18.0, 20.0, 10.75, 9.5]

The Python code above creates a list areas and a copy areas_copy. Next, the first element in areas_copy is changed and areas is printed out. You see that, although you’ve changed areas_copy, the change also takes effect in areas. That’s because areas and areas_copy point to the same list.

If you want to prevent changes in areas_copy from also taking effect in areas, you’ll have to do a more explicit copy of areas. You can do this with list() or by using [:].

areas = [11.25, 18.0, 20.0, 10.75, 9.50]
areas_copy = areas[:]
areas_copy[0] = 5.0
print(areas)
## [11.25, 18.0, 20.0, 10.75, 9.5]

Nice! The difference between explicit and reference-based copies is subtle, but can be really important. Try to keep in mind how a list is stored in the computer’s memory.

4 Functions and Packages

You’ll learn how to use functions, methods, and packages to efficiently leverage the code that brilliant Python developers have written. The goal is to reduce the amount of code you need to solve challenging problems!

4.1 Lecture: Functions

4.2 Familiar Functions

Out of the box, Python offers a bunch of built-in functions to make your life as a data scientist easier. You already know two such functions: print() and type(). You’ve also used the functions str(), int(), bool() and float() to switch between data types. These are built-in functions as well.

Calling a function is easy. To get the type of \(3.0\) and store the output as a new variable, result, you can use the following:

result = type(3.0)
result
## <class 'float'>

The general recipe for calling functions and saving the result to a variable is thus:

# output = function_name(input)

Examples are provided below.

var1 = [1, 2, 3, 4]
var2 = True
out = int(var2)
print(type(var1), '\n', len(var1), '\n', type(out))
## <class 'list'> 
##  4 
##  <class 'int'>

The len() function is extremely useful; it also works on strings to count the number of characters!

4.3 Help!

Maybe you already know the name of a Python function, but you still have to figure out how to use it. Ironically, you have to ask for information about a function with another function: help().

help(max)
## Help on built-in function max in module builtins:
## 
## max(...)
##     max(iterable, *[, default=obj, key=func]) -> value
##     max(arg1, arg2, *args, *[, key=func]) -> value
##     
##     With a single iterable argument, return its biggest item. The
##     default keyword-only argument specifies an object to return if
##     the provided iterable is empty.
##     With two or more arguments, return the largest argument.
help(pow)
## Help on built-in function pow in module builtins:
## 
## pow(base, exp, mod=None)
##     Equivalent to base**exp with 2 arguments or base**exp % mod with 3 arguments
##     
##     Some types, such as ints, are able to use a more efficient algorithm when
##     invoked using the three argument form.

Using help() can help you understand how functions work, unleashing their full potential!

4.4 Multiple Arguments

In the previous exercise, you identified optional arguments by viewing the documentation with help(). We will now apply this to change the behavior of the sorted() function. Let’s have a look at its documentation.

help(sorted)
## Help on built-in function sorted in module builtins:
## 
## sorted(iterable, /, *, key=None, reverse=False)
##     Return a new list containing all items from the iterable in ascending order.
##     
##     A custom key function can be supplied to customize the sort order, and the
##     reverse flag can be set to request the result in descending order.

You’ll see that sorted() takes three arguments: iterable, key, and reverse.

key = None means that if you don’t specify the key argument, it will be None. reverse = False means that if you don’t specify the reverse argument, it will be False, by default.

In the code below, we only have to specify iterable and reverse, not key. The first input you pass to sorted() will be matched to the iterable argument, but what about the second input? To tell Python you want to specify reverse without changing anything about key, you can use \(=\) to assign it a new value:

sorted(____, reverse=____)

Code and results are given below.

# Create lists first and second
first = [11.25, 18.0, 20.0]
second = [10.75, 9.50]

# Paste together first and second: full
full = first + second

# Sort full in descending order: full_sorted
full_sorted = sorted(full, reverse = True)

# Print out full_sorted
print(full_sorted)
## [20.0, 18.0, 11.25, 10.75, 9.5]

4.5 Lecture: Methods

4.6 String Methods

Strings come with a bunch of methods. If you want to discover them in more detail, you can always type help(str) in the execution shell.

# string to experiment with: place
place = "poolhouse"

# Use upper() on place: place_up
place_up = place.upper()

print(place, '\n',
      place_up, '\n',
      place.count('o'))
## poolhouse 
##  POOLHOUSE 
##  3

Nice! Notice from the printouts that the upper() method does not change the object it is called on. This will be different for lists, as will be shown below!

4.7 List Methods

Strings are not the only Python types that have methods associated with them. Lists, floats, integers and booleans are also types that come packaged with a bunch of useful methods. Let’s first experiment with:

  • index(), to get the index of the first element of a list that matches its input;
  • count(), to get the number of times an element appears in a list.
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
print(areas.index(20.0), '\n', areas.count(9.50))
## 2 
##  1

Above are examples of list methods that did not change the list they were called on. Indeed, most list methods will change the list they’re called on. Examples are:

  • append(), that adds an element to the list it is called on,
  • remove(), that removes the first element of a list that matches the input,
  • reverse(), that reverses the order of the elements in the list it is called on.
# Create list areas
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

# Use append twice to add poolhouse and garage size
areas.append(24.5)
areas.append(15.45)

# Print out areas
print(areas)
## [11.25, 18.0, 20.0, 10.75, 9.5, 24.5, 15.45]
# Reverse the orders of the elements in areas
areas.reverse()

# Print out areas
print(areas)
## [15.45, 24.5, 9.5, 10.75, 20.0, 18.0, 11.25]

4.8 Lecture: Packages

4.9 Import Package

As a data scientist, some notions of geometry never hurt. Let’s refresh some of the basics.

For a fancy clustering algorithm, you want to find the circumference \(C\) and area \(A\) of a circle. When the radius of the circle is \(r\), you can calculate \(C\) and \(A\) as \[C=2\pi r\textrm{ and }A=\pi r^2.\] To use the constant \(\pi\), you’ll need the math package.

# Definition of radius
r = 0.43

# Import the math package
import math

# Calculate C
C = 2 * math.pi * r

# Calculate A
A = math.pi * r * r

# Build printout
print("Circumference: " + str(C), '\n', "Area: " + str(A))
## Circumference: 2.701769682087222 
##  Area: 0.5808804816487527

Nice! If you know how to deal with functions from packages, the power of a lot of Python programmers is at your fingertips!

4.10 Selective Import

General imports, like import math, make all functionality from the math package available to you. However, if you decide to only use a specific part of a package, you can always make your import more selective:

from math import pi

Let’s say the Moon’s orbit around planet Earth is a perfect circle, with a radius \(r\) (in km) that is defined in the script. The code below calculates the distance traveled by the Moon over \(12\) degrees of its orbit.

# Definition of radius
r = 192500

# Import radians function of math package
from math import radians

# Travel distance of Moon over 12 degrees. Store in dist.
dist = r * radians(12)

# Print out dist
print(dist)
## 40317.10572106901

5 NumPy

NumPy is a fundamental Python package to efficiently practice data science. Learn to work with powerful tools in the NumPy array, and get started with data exploration.

5.1 Lecture: NumPy

5.2 Your first NumPy Array

In this chapter, we’re going to dive into the world of baseball. Along the way, you’ll get comfortable with the basics of numpy, a powerful package to do data science.

# Create list baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]

# Import the numpy package as np
import numpy as np

# Create a numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out type of np_baseball
print(type(np_baseball))
## <class 'numpy.ndarray'>

5.3 Baseball players’ height

You are a huge baseball fan. You decide to call the MLB (Major League Baseball) and ask around for some more statistics on the height of the main players. They pass along data on more than a thousand players, which is stored in a regular Python list height_in. The data can be found here.

# height_in is available as a regular list
height_in[:6]
## [74, 74, 72, 72, 73, 69]

The height is expressed in inches. Let’s make a numpy array out of it and convert the units to meters!

# Import numpy
import numpy as np

# Create a numpy array from height_in: np_height_in
np_height_in = np.array(height_in)

# Print out np_height_in
print(np_height_in)
## [74 74 72 ... 75 75 73]
# Convert np_height_in to m: np_height_m
np_height_m = np_height_in * 0.0254

# Print np_height_m
print(np_height_m)
## [1.8796 1.8796 1.8288 ... 1.905  1.905  1.8542]

Nice! In the blink of an eye, numpy performs multiplications on more than \(1000\) height measurements!

5.4 Baseball player’s BMI

The MLB also offers to let you analyze their weight data. Again, both are available as regular Python lists: height_in and weight_lb. The data can be found here.

# height_in and weight_lb are available as regular lists
print(height_in[:6], '\n',
      weight_lb[:6])
## [74, 74, 72, 72, 73, 69] 
##  [180, 215, 210, 210, 188, 176]

height_in is in inches and weight_lb is in pounds.

It’s now possible to calculate the BMI of each baseball player, with the formula: \[\textrm{BMI}=\frac{\textrm{weight(kg)}}{\textrm{height(m)}^2}\]

# Import numpy
import numpy as np

# Create array from height_in with metric units: np_height_m
np_height_m = np.array(height_in) * 0.0254

# Create array from weight_lb with metric units: np_weight_kg
np_weight_kg = np.array(weight_lb) * 0.453592

# Calculate the BMI: bmi
bmi = np_weight_kg / np_height_m / np_height_m

# Print out bmi
print(bmi)
## [23.11037639 27.60406069 28.48080465 ... 25.62295933 23.74810865
##  25.72686361]

5.5 Subsetting NumPy Arrays

To subset both regular Python lists and numpy arrays, you can use square brackets:

x = [4 , 9 , 6, 3, 1]
x[1]
## 9
import numpy as np
y = np.array(x)
y[1]
## 9

For numpy specifically, you can also use boolean numpy arrays:

high = y > 5
y[high]
## array([9, 6])

Let’s reveal interesting things from the baseball data!

# height_in and weight_lb are available as a regular lists

# Import numpy
import numpy as np

# Store weight and height lists as numpy arrays
np_weight_lb = np.array(weight_lb)
np_height_in = np.array(height_in)

# Print out the weight at index 50
print(np_weight_lb[50])
## 200
# Print out sub-array of np_height_in: index 100 up to and including index 110
print(np_height_in[100:111])
## [73 74 72 73 69 72 73 75 75 73 72]
# Calculate the BMI: bmi
np_height_m = np_height_in * 0.0254
np_weight_kg = np_weight_lb * 0.453592
bmi = np_weight_kg / np_height_m ** 2

# Create the light array
light = bmi < 21

# Print out light
print(light)
## [False False False ... False False False]
# Print out BMIs of all baseball players whose BMI is below 21
print(bmi[light], '\n', len(bmi[light]))
## [20.54255679 20.54255679 20.69282047 20.69282047 20.34343189 20.34343189
##  20.69282047 20.15883472 19.4984471  20.69282047 20.9205219 ] 
##  11

Wow! It appears that only \(11\) of the more than \(1000\) baseball players have a BMI under \(21\)!

5.6 NumPy side effects

As Hugo explained before, numpy is great for doing vector arithmetic. If you compare its functionality with regular Python lists, however, some things have changed.

First of all, numpy arrays cannot contain elements with different types. If you try to build such a list, some of the elements’ types are changed to end up with a homogeneous list. This is known as type coercion.

Second, the typical arithmetic operators, such as \(+\), \(-\), \(*\) and \(/\) have a different meaning for regular Python lists and numpy arrays. Have a look at this line of code:

np.array([True, 1, 2]) + np.array([3, 4, False])
## array([4, 5, 2])

5.7 Lecture: 2D NumPy Arrays

5.8 Your first 2D NumPy Array

Let’s try to create a \(2\textrm{D}\) numpy array from a small list of lists. Here, baseball is a list of lists. The main list contains \(4\) elements. Each of these elements is a list containing the height and the weight of \(4\) baseball players, in this order.

# Create baseball, a list of lists
baseball = [[180, 78.4],
            [215, 102.7],
            [210, 98.5],
            [188, 75.2]]

# Import numpy
import numpy as np

# Create a 2D numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

print(type(np_baseball), '\n',
      np_baseball.shape)
## <class 'numpy.ndarray'> 
##  (4, 2)

Great! You’re ready to convert the actual MLB data to a \(2\textrm{D}\) numpy array now!

5.9 Baseball data in 2D form

You have another look at the MLB data and realize that it makes more sense to restructure all this information in a \(2\textrm{D}\) numpy array. This array should have \(1015\) rows, corresponding to the \(1015\) baseball players you have information on, and \(2\) columns (for height and weight).

The MLB was, again, very helpful and passed you the data in a different structure, a Python list of lists. In this list of lists, each sublist represents the height and weight of a single baseball player. The name of this embedded list is baseball. The data can be found here.

# baseball is available as a regular list of lists
baseball[:6]
## [[74, 180], [74, 215], [72, 210], [72, 210], [73, 188], [69, 176]]

Let’s store the data as a \(2\textrm{D}\) array to unlock numpy’s extra functionality!

# Import numpy package
import numpy as np

# Create a 2D numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out the shape of np_baseball
print(np_baseball.shape)
## (1015, 2)

Slick! Time to show off some killer features of multi-dimensional numpy arrays!

5.10 Subsetting 2D NumPy Arrays

If your \(2\textrm{D}\) numpy array has a regular structure, i.e. each row and column has a fixed number of values, complicated ways of subsetting become very easy. Have a look at the code below where the elements “a” and “c” are extracted from a list of lists.

# regular list of lists
x = [["a", "b"], ["c", "d"]]
[x[0][0], x[1][0]]
## ['a', 'c']
# numpy
import numpy as np
np_x = np.array(x)
np_x[:, 0]
## array(['a', 'c'], dtype='<U1')

For regular Python lists, this is a real pain. For \(2\textrm{D}\) numpy arrays, however, it’s pretty intuitive! The indexes before the comma refer to the rows, while those after the comma refer to the columns. The : is for slicing; in this example, it tells Python to include all rows.

Below is an example for the MLB data.

# baseball is available as a regular list of lists

# Import numpy package
import numpy as np

# Create np_baseball (2 cols)
np_baseball = np.array(baseball)

# Print out the 50th row of np_baseball
print(np_baseball[49])
## [ 70 195]
# Select the entire second column of np_baseball: np_weight_lb
np_weight_lb = np_baseball[:, 1]

# Print out height of 124th player
print(np_baseball[123, 0])
## 75

5.11 2D Arithmetic

Remember how you calculated the BMI for all baseball players? numpy was able to perform all calculations element-wise (i.e. element by element). For \(2\textrm{D}\) numpy arrays this isn’t any different! You can combine matrices with single numbers, with vectors, and with other matrices.

import numpy as np
np_mat = np.array([[1, 2],
                   [3, 4],
                   [5, 6]])
np_mat * 2
## array([[ 2,  4],
##        [ 6,  8],
##        [10, 12]])
np_mat + np.array([10, 10])
## array([[11, 12],
##        [13, 14],
##        [15, 16]])
np_mat + np_mat
## array([[ 2,  4],
##        [ 6,  8],
##        [10, 12]])

Here is another example for MLB data. baseball is available for us; it’s now a \(2\textrm{D}\) list of lists with 3 columns representing height (in inches), weight (in pounds) and age (in years).

# baseball is available as a regular list of lists
baseball[:6]
## [[74, 180, 22.99], [74, 215, 34.69], [72, 210, 30.78], [72, 210, 35.43], [73, 188, 35.71], [69, 176, 29.39]]

You want to convert the units of height and weight to metric (meters and kilograms, respectively).

# Import numpy package
import numpy as np

# Create np_baseball (3 cols)
np_baseball = np.array(baseball)

# Create numpy array: conversion
conversion = np.array([0.0254, 0.453592, 1])

# Print out product of np_baseball and conversion
print(np_baseball * conversion)
## [[ 1.8796  81.64656 22.99   ]
##  [ 1.8796  97.52228 34.69   ]
##  [ 1.8288  95.25432 30.78   ]
##  ...
##  [ 1.905   92.98636 25.19   ]
##  [ 1.905   86.18248 31.01   ]
##  [ 1.8542  88.45044 27.92   ]]

Notice how with very little code, you can change all values in your numpy data structure in a very specific way. This will be very useful in your future as a data scientist!

5.12 Lecture: NumPy: Basic Statistics

5.13 Explore the baseball data

After the video, you now know how to use numpy functions to get a better feeling for your data. It basically comes down to importing numpy and then calling several simple functions on the numpy arrays:

import numpy as np
x = [1, 4, 8, 10, 12]
np.mean(x)
## 7.0
np.median(x)
## 8.0

The baseball data is available as a \(2\textrm{D}\) numpy array with \(3\) columns (height, weight, age) and \(1015\) rows. The name of this numpy array is np_baseball. The data can be found here. Let’s print out informative messages with the different summary statistics!

# np_baseball is available

# Import numpy
import numpy as np

# Print mean height (first column)
avg = np.mean(np_baseball[:,0])
print("Average: " + str(avg))
## Average: 73.6896551724138
# Print median height. Replace 'None'
med = np.median(np_baseball[:,0])
print("Median: " + str(med))
## Median: 74.0
# Print out the standard deviation on height. Replace 'None'
stddev = np.std(np_baseball[:,0])
print("Standard Deviation: " + str(stddev))
## Standard Deviation: 2.312791881046546
# Print out correlation between first and second column. Replace 'None'
corr = np.corrcoef(np_baseball[:,0], np_baseball[:,1])
print("Correlation: " + str(corr))
## Correlation: [[1.         0.53153932]
##  [0.53153932 1.        ]]

Great! Time to use all of your new data science skills in the last exercise!

5.14 Blend it all together

In the last few exercises you’ve learned everything there is to know about heights and weights of baseball players. Now it’s time to dive into another sport: soccer.

You’ve contacted FIFA for some data and they handed you two lists. The lists are the following:

positions[:6]
## ['GK', 'M', 'A', 'D', 'M', 'D']
heights[:6]
## [191, 184, 185, 180, 181, 187]

Each element in the lists corresponds to a player. The first list, positions, contains strings representing each player’s position. The possible positions are: ‘GK’ (goalkeeper), ‘M’ (midfield), ‘A’ (attack) and ‘D’ (defense). The second list, heights, contains integers representing the height of the player in cm. The first player in the lists is a goalkeeper and is pretty tall (\(191\) cm). The data can be found here.

You’re fairly confident that the median height of goalkeepers is higher than that of other players on the soccer field. Some of your friends don’t believe you, so you are determined to show them using the data you received from FIFA and your newly acquired Python skills.

# Import numpy
import numpy as np

# Convert positions and heights to numpy arrays: np_positions, np_heights
np_positions = np.array(positions)
np_heights = np.array(heights)

# Heights of the goalkeepers: gk_heights
gk_heights = np_heights[np_positions == 'GK']

# Heights of the other players: other_heights
other_heights = np_heights[np_positions != 'GK']

# Print out the median height of goalkeepers
print("Median height of goalkeepers: " + str(np.median(gk_heights)))
## Median height of goalkeepers: 188.0
# Print out the median height of other players
print("Median height of other players: " + str(np.median(other_heights)))
## Median height of other players: 181.0

Wonderful! You were right and the disbelievers were wrong!

6 Final Words

Congratulations on completing the course! More courses, tracks and instructions can be found here. Happy learning!