. Prerequisites

Before taking this course, you are expected to have completed the following programming introductions available on DataCamp:

Moreover, I recommend the following textbook as complementary material for this course:

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 2nd Edition, by Wes McKinney.
Python for Finance: Mastering Data-Driven Finance 2nd Edition, by Yves Hilpisch.

0. Introduction

This course is an introduction to programming with Python, with a focus on data analysis and financial applications. We will start with the basics, then deal with plots, financial packages, portfolio analysis, textual analysis, and regression analysis.

Computers have their own language called machine code which tells them what to do. A programming languages provides an interface between a programmer and the machine language. When we write some code in Python, we write a code that tells the machine what to do. When we click run or execute, the code is compiled, which turns it into machine code the computer can understand.

Python is a free programming language. It possesses an extensive catalog of statistical and graphical methods. It includes textual analysis tools, linear regression, time series, statistical inference to name a few. A great thing with Python it that many users have already created so-called packages, which are collections of functions we can use to perform specific tasks.

To code in Python, we will use Anaconda and Spyder. Anaconda is the world’s most popular Python distribution platform. It is the platform of choice for Python data science. Spyder is a free and open source scientific environment written in Python, for Python, and designed by and for scientists, engineers and data analysts.

I expect you to prepare the class sessions (read the material, understand the code, and test running the code on your machine), so we can focus on practicing:

In session 1 (November 12, 2021), we will cover section 1 - the basics.
In session 2 (November 16, 2021), we will cover section 2 - financial models & sections 3 - dataframes .
In session 3 (November 23, 2021), we will cover sections 3 + 4 - data frames and plots & section 5 - data analysis applications .
Your first individual assignment (described in section 6) is due for December, 7, 2021.
In session 4 (November 30, 2021), we will cover section 7 - financial analysis.
In session 5 (December 2, 2021), we will cover section 7 - financial analysis.
In session 6 (December 7, 2021), we will cover section 7 - financial analysis.
Your second individual assignment (described in section 8) is due for January, 7, 2022.
In session 7 (December 14, 2021), we will cover section 9 & 10 - textual analysis and regression analysis.

Notice that all the material and knowledge needed to start working on the assignments is already available to you on this website. Make sure to start looking at the assignment instructions early on, ask for clarifications if needed, plan ahead, and deliver on time.

1. Basics

1.1. Printing “hello”

We are going to write our first line of code in Python. Let us open Anaconda, and we launch Spyder (take the time to read the tutorial offered by Spyder), and write our first code in Python. Beforehand, please read this to get to know the different elements you can see in Spyder.

We want the machine to print “Hello”. That is we want to see “Hello” appears in the console of Spyder. To do so, we need to open Spyder, create an empty script, write the below code in it, and then we click the run button to execute the code. We can see the outcome of the execution of our code lines in the console window.

print("hello")

## hello

Notice that, this webpage, as you see right above, show you the code to write in the script editor but also the outcome associated to it that you will see in the console of Spyder. The idea is that you try to obtain the same thing using Spyder.

The code we have written calls a built-in function of Python: print(), to show in the console “Hello”. To know how to use this function, or to find such a function in the first place, we can consult Spyder help (top-right window). We can search for a print() function there and once we find it, the help tells us what the function does, how to use it, and what are the required arguments. If we happen to know the name of the function but are unsure on how to use it, we can also type help(print) directly in the console.

We can also print the result of an arithmetic operation:

print(5+5)

## 10

As you can imagine, this also works with subtractions, divisions, multiplications, and so on (give it a try).

1.2. Creating a variable

In Python, a variable allows us to refer to a value with a name. To assign a value to a variable, we use the sign =. For instance:

my_variable =5
print(my_variable)

## 5

The variable with the name my_variable stores the value 5. When we pass my_variable to the function print, it prints out the value stored within the variable. It also works with words or sentences:

my_variable = "Dodo"
print(my_variable)

## Dodo

We can use variables to perform calculations:

equity = 100
income = 10
return_on_equity = income/equity
print(return_on_equity)

## 0.1

Ideally, we want to pick a variable name that is telling and inform us about what is stored in the variable (like return_on_equity).

1.3. Variable types

There are four broad types of variables:

int, or integer: a number without a fractional part. 100 for instance is an integer.
float, or floating point: a number that has both an integer and fractional part, separated by a point. 1.12 for instance is a float.
str, or string: is a piece of text. We can use single or double quotes to build a string. For instance “Hello” is a string. ‘Hello’ works too.
bool, or boolean: is a logical value. It can only be True or False. Notice that the capitalization is important.

To find out the type of a value or of a variable that refers to that value, we can use the type() function of Python. For example:

my_variable = "abc"
type(my_variable)

## <class 'str'>

my_number=100
type(my_number)

## <class 'int'>

We can convert a type to another. To convert a string to integer or float using the functions int() and float(), respectively:

my_number="100"
print(my_number)

## 100

type(my_number)

## <class 'str'>

my_number=int(my_number)
type(my_number)

## <class 'int'>

We can turn an integer or float value to a string if needed by calling the function str().

my_number=100
my_number=str(my_number)
type(my_number)

## <class 'str'>

Notice that to concatenate strings in Python, we use the + operator:

first_str="Hello"
second_str=" "
third_str="World"
my_string=first_str+second_str+third_str
print(my_string)

## Hello World

We can now improve the piece of code presented at the end of section 1.2., so that we print out a percentage:

equity = 100
income = 10
return_on_equity = (income/equity)*100
return_on_equity_with_pct = str(return_on_equity) + "%"
print(return_on_equity_with_pct)

## 10.0%

1.4. Functions and methods

We have used several built-in functions of Python already (e.g., type() or str()). We know that we can check the help to know how to call a function and the arguments that the function expects.

function: my_function(arguments)

A method in Python is somewhat similar to a function except that it is implicitly used for an object for which it is called. It is accessible to data that is contained within the class. For instance, the upper() method returns a string where all characters are in upper case. It can only be applied on an object that is a string. Try it on an integer for instance, it will not work.

my_name = "Alex"
my_name = my_name.upper()
print(my_name)

## ALEX

method: my_variable.method(arguments)

1.5. Relational operators

In Python, we can test how two variables or values relate to each other by means of the relational operators: ==, !=, >, <, >=, <=:

print(2==2)

## True

print(2>3)

## False

print(2!=3)

## True

print(2<1)

## False

1.6. Logical operators

A logical operator is a symbol or word used to connect two or more objects (AND, OR, etc…)

print (15 == 15 and 15 > 13)

## True

print (15 == 15 and 15 > 16)

## False

print (15 == 15 or "B"=="A")

## True

print (15 == 17 or "A"=="A")

## True

1.7. Conditional statements

A conditional statement is a statement that gives an algorithm the ability to make decision (for instance if it rains then show me the TV program). Below is a if statement stating that if the number of views (we assume those are Instagram views for instance) exceeds 15, then the user can be considered as popular. Notice that, in Python, the body of the if statement is indicated by the indentation. Generally, four whitespaces are used for indentation and are preferred over the tab key (but you can also use the tab key). The body starts with an indentation and the first unindented line marks the end.

num_views = 16
if num_views > 15:
  print("You are popular!")

## You are popular!

We can add an else statement to explicitly consider the case where the number of views does not exceed 15:

num_views = 16
if num_views > 15:
  print("You are popular!")
else:
  print("You are not popular!")

## You are popular!

If we want to consider several alternatives, we can use elif statements (stands for else if):

num_views = 16
if num_views<=10:
  print("You are really not popular!")
elif num_views>10 and num_views<=15:
  print("You are not popular!")
elif num_views>15 and num_views<=20:
  print("You are popular!")
elif num_views>20:
  print("You are really popular!")

## You are popular!

1.8. Loops

With loops we can keep having the algorithm doing something while a condition holds. In the below example, we use a while loop that reduces the value of the variable speed as long as the condition (speed value > 30) is met:

speed = 64
while speed > 30: 
  print("Slow down! Your speed is:", speed)
  speed = speed - 7

## Slow down! Your speed is: 64
## Slow down! Your speed is: 57
## Slow down! Your speed is: 50
## Slow down! Your speed is: 43
## Slow down! Your speed is: 36

We can combine a while loop with a break. For instance, in the below example we use a while loop to increase the variable speed if speed is greater than 30. We add a break command, so that when the speed exceeds 80, we break the loop and stop the algorithm.

speed = 35
while speed > 30:
  speed = speed+3
  print("Your speed is:", speed)
  if speed > 80: 
    print("Speed is too high, we stop and break the loop.")
    break

## Your speed is: 38
## Your speed is: 41
## Your speed is: 44
## Your speed is: 47
## Your speed is: 50
## Your speed is: 53
## Your speed is: 56
## Your speed is: 59
## Your speed is: 62
## Your speed is: 65
## Your speed is: 68
## Your speed is: 71
## Your speed is: 74
## Your speed is: 77
## Your speed is: 80
## Your speed is: 83
## Speed is too high, we stop and break the loop.

There is also a for loop. The latter does an action for each value of a list of values. For instance, below, we create a list of prime numbers, and we then have our code telling the machine, with a for loop, to print each of the prime number included in that list. Notice that are they several ways to make that happen:

primes = [2, 3, 5, 7, 11, 13]
for p in primes:
  print(p)

## 2
## 3
## 5
## 7
## 11
## 13

primes = [2, 3, 5, 7, 11, 13]
for i in range(len(primes)):
  print(primes[i])

## 2
## 3
## 5
## 7
## 11
## 13

1.9. Lists

A list is a compound data type, that is we can group values of different types together. To create a list, we use the following syntax:

my_list = [A, B, C, D]

For instance:

a = "Hello"
b = True
c= 100
d= 5.55
my_list = [a, b, c, d]
print(my_list)

## ['Hello', True, 100, 5.55]

We can also use directly:

my_list = ["Hello", True, 100,5.55]
print(my_list)

## ['Hello', True, 100, 5.55]

We may want to retrieve a portion of a list only. We say that we subset a list. To do so, we use the implicit index of the list. this index goes: first item has the index 0, second item has the index 1, and so on… Importantly, notice that the index starts at 0. Hence, to know what the first item of a list is, we go:

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
print(my_list[0])

## Paul

To know the what the third one is, we go:

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
print(my_list[2])

## Agatha

We know now how to select single values from a list. It is also possible to select multiple elements from it. We use the following syntax:

my_list[start:end]

Notice that the start index is included, while the end index is not. To retrieve the first two items, we go:

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
print(my_list[0:2])

## ['Paul', 'Marc']

To retrieve the last two items, we go:

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
print(my_list[2:4])

## ['Agatha', 'Sophie']

If we do not specify a start index, Python figures out that we want to start at the beginning of our list (index = 0). If we do not specify an end index, Python figures out that we want to end at the end of our list. See:

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
print(my_list[1:])

## ['Marc', 'Agatha', 'Sophie']

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
print(my_list[:2])

## ['Paul', 'Marc']

To replace some of the list elements (for instance to correct errors), we can subset the list and assign new values to the subset. For instance:

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
my_list[0] = "Jean-Paul"
my_list[2:] = ["Agathe", "Sophia"]
print(my_list)

## ['Jean-Paul', 'Marc', 'Agathe', 'Sophia']

To add an element (or several ones) to a list, we can use the + operator:

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
my_list = my_list + ["Bruno", "Claire"]
print(my_list)

## ['Paul', 'Marc', 'Agatha', 'Sophie', 'Bruno', 'Claire']

To remove elements from a list, we use the del() function. For example, in the below code, we delete the first element of the list and then show the list. As soon as we remove an element from a list, the indexes of the elements that come after the deleted element are updated.

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
del(my_list[0])
print(my_list)

## ['Marc', 'Agatha', 'Sophie']

print(my_list[0])

## Marc

We can sort the elements of a list. To that end, we use the function sorted(). This function returns a new list containing all items in ascending order. By typing in the console help(sorted), we can obtain even more information on this function, for instance how to use a descending order:

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
my_list = sorted(my_list)
print(my_list)

## ['Agatha', 'Marc', 'Paul', 'Sophie']

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
my_list = sorted(my_list, reverse=True)
print(my_list)

## ['Sophie', 'Paul', 'Marc', 'Agatha']

The following methods are available for lists:

index(): to get the index of the first element of a list that matches its input.
count(): to get the number of times an element appears in a list.
append(): that adds an element to the list it is called on.
remove(): that removes the first element of a list that matches the input.
reverse(): that reverses the order of the elements in the list it is called on.

We already have seen the function del(), they are often several ways to achieve the same result.

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
my_list.index("Paul")

## 0

my_list = ["Paul", "Marc", "Paul", "Sophie"]
my_list.count("Paul")

## 2

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
my_list.append("Robert")
print(my_list)

## ['Paul', 'Marc', 'Agatha', 'Sophie', 'Robert']

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
my_list.remove("Paul")
print(my_list)

## ['Marc', 'Agatha', 'Sophie']

my_list = ["Paul", "Marc", "Agatha", "Sophie"]
my_list.reverse()
print(my_list)

## ['Sophie', 'Agatha', 'Marc', 'Paul']

1.10. Dictionnaries

Dictionaries are used to store data values in key:value pairs. A dictionary is a collection which is ordered, changeable, and does not allow duplicates. Dictionaries are written with curly brackets, and have keys and values. We create a dictionary as follows:

my_dictionary = {"France":"Paris", "Italy":"Roma", "Germany" :"Berlin"}
print(my_dictionary)

## {'France': 'Paris', 'Italy': 'Roma', 'Germany': 'Berlin'}

To access a value, we refer to its key in the dictionary:

my_dictionary[“key”]

my_dictionary["France"]

## 'Paris'

We can also assign a new value to a key To add a new key-value pair to our dictionary, we go:

my_dictionary = {"France":"Paris", "Italy":"Roma", "Germany" :"Berlin"}
my_dictionary['Poland'] = 'Warsaw'
print(my_dictionary)

## {'France': 'Paris', 'Italy': 'Roma', 'Germany': 'Berlin', 'Poland': 'Warsaw'}

To remove an element from a dictionary, we can use the function del():

my_dictionary = {"France":"Paris", "Italy":"Roma", "Germany" :"Berlin"}
del(my_dictionary["France"])
print(my_dictionary)

## {'Italy': 'Roma', 'Germany': 'Berlin'}

1.11. Manipulating strings

Methods are available to manipulate strings. Among others:

replace(): returns a string where a specified value is replaced with a specified value.
find(): searches the string for a specified value and returns the position of where it was found.
count(): return the number of non-overlapping occurrences of a specific substring in the string.

For an overview of the other available methods, visit this page.

Consider the following string:

my_string= "This is my sentAnce"
print(my_string)

## This is my sentAnce

We can replace the “A” by “e” to correct the typo.

my_string_corected=my_string.replace("A","e")
print(my_string_corected)

## This is my sentence

We can find the typo, delete it, and then introduce an “e” (less optimal):

my_string= "This is my sentAnce"
find_pb=my_string.find("A")
print(find_pb)

## 15

my_string_corrected=my_string[0:find_pb]+"e"+my_string[(find_pb+1):]
print(my_string_corrected)

## This is my sentence

To know how many “s” there are in a string, we use the count() method:

my_string = "Is it a success?"
print(my_string.count("s"))

## 4

1.12. Packages

A key feature of Python are packages. Packages are an essential building block in programming. They are libraries of functions that we can import into Python and use to do specific tasks (e.g., call financial functions, do a regression analysis, plot a world map, get a list of tweets from twitter…). Without packages, we would need to spend a lot of time writing code that has already been written. The syntax to us to import a package to use it in our code is (at the beginning of our script):

import package_name as pick_a_name

Let us say we need to use the constant pi. We are lucky, there is a package called math that already contains mathematical variables and functions useful to do mathematical operations. We import it into Python as follows:

import math as mt

We can then refer to the variables from the math package and call its functions, for instance:

print(mt.pi)

## 3.141592653589793

For each package there is a documentation available. For math, we can access it online here.

2. Financial models

2.1 Mortgage calculator

A first usage of Python when it comes to finance is to design simple financial models. We want to create a mortgage calculator. To build such a model, we need financial functions and a user-friendly interface. We want the model to use some user’s inputs (interest rate, home price, down payment, maturity) to generate some outputs (monthly payment, principal paid, interest paid, total paid).

As a benchmark, consider this mortgage calculator for instance.

We assume we now how to calculate the annuity of a mortgage (monthly payment).

Let us assume that Rose wants to buy a house that costs 100,000 euros. The annual interest rate is 5% a year. She already has some cash for a down payment (50,000), which means that she will have to borrow only 50,000. She wants to borrow money with a maturity of 10 years.

Recall that, by construction, for a loan, the amount borrowed (principal) has to be equal to the present value of all the monthly payments of the borrower: \[\begin{array}{ccc} Principal = Monthly Payment \times (1-(1+Monthly Interest)^{-Number Months})/Monthly Interest) \end{array}\]

How do we find the monthly payment? It is equal to (we derive it from the annuity formula): \[\begin{array}{ccc} Monthly Interest = 0.05/12 = 0.004166 \\ Number Months = 12 \times 10 = 120 \\ Monthly Payment= 50,000/((1-1.004166^{(-120)})/0.004166) = 530.31 € \end{array}\]

Once we have calculated the monthly payment, we can extrapolate how much the borrower will pay to the bank in total: \[\begin{array}{ccc} Total Payment = 120*530.31 = 63,637 € \end{array}\]

Then we can separate what corresponds to the payment of the principal to what corresponds to the payment of the interests: \[\begin{array}{ccc} Principal Paid = 50,000 € \\ Total Interest paid = 63,637 - 50,000 = 13,637€ \end{array}\]

Now, in Python, ideally, we want a package to make the annuity calculation, so that it saves us some heavy computation. We install and import numpy_financial to get access to annuity related calculations.

If not already installed, to install the package, we execute Anaconda Prompt (Anaconda 3) and type:

pip install numpy_financial

To import it into our script, we type as the beginning of the script:

import numpy_financial as nf

To check what we can do with the pmt() function of numpy_financial, we can write in the console:

help(nf.pmt)

The pmt() function computes the payment against loan principal plus interest. It takes the following arguments: pmt(rate, nb periods, principal). We give a first go with default values, without any user interface:

# we import numpy_financial package
import numpy_financial as nf

#user inputs - we use default values for the moment
price_house = 100000
interest_rate_yearly=0.05
down_payment=50000
need_to_borrow=price_house-down_payment
maturity_in_years=10

#computation of monthly payment and other outputs 
interest_monthly=interest_rate_yearly/12
nb_payments_in_month=maturity_in_years*12
need_to_borrow=price_house-down_payment
monthly_repayment = -nf.pmt(interest_monthly, nb_payments_in_month, need_to_borrow)
total_paid = (monthly_repayment*maturity_in_years*12)
principal_paid = (need_to_borrow)
interest_paid = (monthly_repayment*nb_payments_in_month-need_to_borrow)

#we want to show rounded figures
monthly_repayment = round(monthly_repayment)
total_paid = round(total_paid)
principal_paid = round(principal_paid)
interest_paid = round(interest_paid)

#outputs we want to show to the user
print("monthly repayment: ", monthly_repayment)

## monthly repayment:  530

print("total paid:",total_paid)

## total paid: 63639

print("total principal paid :",principal_paid)

## total principal paid : 50000

print("total interest paid:", interest_paid)

## total interest paid: 13639

Notice that, if we want, we can create a user interface to have the user of the model entering her/his own values. To create a GUI (graphical user interface), there is also a package in Python, it is called PySimpleGUI. We install and import it as follows:

To install the package, we execute Anaconda Prompt (Anaconda 3) and type:

pip install PySimpleGUI

To import it into our script, we type at the beginning of it:

import PySimpleGUI as sg

A commented implementation of the model using a GUI (with this package) is available on Blackboard under financial_model_with_GUI.py. In your own time, study it and try to replicate it altering the elements of your choice. Get to understand how to add/remove text and input boxes, how the model can react to the user’s inputs and update the fields in the output section. Detailed information about how to use PySimpleGUI can be found here.

2.2. If you are stuck…

You are about to tackle an exercise. When coding on your own, you will make mistakes and get stuck, that is how one learns programming. When this happens, I recommend going through these steps:

Check list:
- Most of the time functions or variables are misspelled, first thing to double check.
- When you are stuck, read the error messages, they give you clues, google them if needed.
- Use Python’s help to see which functions are available and what they do, how to call them, what type of arguments they require and so on…
- Search for alternative ways to achieve the same goal.
- Make use of the print() function to see what your variables contain, whether this is what you expect.
- If a solution file is provided, carefully compare your code to it to spot your mistakes or clear up a potential misunderstanding.
- Python is open-source. Sharing coding issues and providing solutions is central in the community. Refer to forums and other Python websites to see whether someone has already encountered a similar issue and check the answers provided by the community.
- In last resort, contact your lecturer - if you do it without carefully considering the above options, you cannot expect an answer.

Way to go: turn a problem into small logical steps, and for each step, do some research on the best way to achieve it with Python. For instance, if we want a find a function that compares two strings to know whether they match, we google it and find the right package/function to do so and then use Python help to get to know how to use the function.

A sustainable and efficient way to learn coding is to develop the ability to find help on the internet and correct your mistakes by yourself. When you will work on your own applications or develop one for a company, you cannot expect someone to help you every time you are stuck. This is key for you to continue improving your programming skills once you will know the basics and be done with this course.

At this stage, we can easily create an executable file, for instance with the package PyInstaller for instance. For more information, please check this link

2.3. Exercise

Find below the information needed to build a pension simulator.

Create a model in which a user can select its current age and simulate how much she could get from a retirement account. We assume a retirement age of 62. We assume that, at the beginning of each year until the retirement age, she will make a deposit into the retirement account (we assume this amount to be constant over time). The user has to indicate how much she thinks the account will earn per year (for instance 3%). After retirement at age 62, the user must indicate how many years more she anticipates living to enjoy the money she sets aside on her retirement account. The model will tell the user how much she can withdraw yearly over the duration of her retirement. The account balance will continue to earn the same interest as before.

To help you designing the model, find below a pen & paper solution:

Let us use the following inputs:

Current age: 25
Annual deposits on the retirement account: 10,000€
Annual interest rate earned on the retirement account: 4%
Life expectancy: 85
retirement age: 62

In finance terms, we want to find the periodic payment of an annuity. The latter starts somewhere in the future after the retirement account has been loaded with money and have earned some interest.

We first compute the future value of the retirement account by the time the user will reach her retirement age: \[\begin{array}{ccc} PV(RetirementAccount)=10,000({(1-(1/(1.04)^{(62-25)}))}/0.04)=191,426€ \\ FV(RetirementAccount)=191,426(1.04)^{(62-25)}=817,023€ \\ \end{array}\]

Then, we want to know how much the user can withdraw from that account each year once she will be 62, knowing that she will withdraw money yearly until the age of 85. \[\begin{array}{ccc} PV(Withdrawals)=C((1-(1/(1.04)^{(85-62)} ))/0.04)=817,023€\\ C=817,023/(((1-(1/(1.04)^{(85-62)}))/0.04) )=54,993€ \\ \end{array}\]

Based on these inputs, the user can expect a yearly pension of 54,993€.

You must build an appropriate financial model in Python based on this set of instructions. A hidden solution code is provided below (not for the GUI part but for the computation one - try to do it by yourself first).

Solution without using numpy_financials (with raw computations):


#user inputs - we use default values for the moment
current_age = 25
annual_deposit=10000
annual_interest_rate=0.04
retirement_age=62
life_expectancy=85

#computation of outputs 
present_value_retirement_account=annual_deposit*((1-(1/(1+annual_interest_rate)**(retirement_age-current_age)))/annual_interest_rate)
future_value_retirement_account=present_value_retirement_account*(1+annual_interest_rate)**(retirement_age-current_age)
withdrawal=future_value_retirement_account/(((1-(1/(1+annual_interest_rate)**(life_expectancy-retirement_age)))/annual_interest_rate))

#we want to show rounded figures
future_value_retirement_account = round(future_value_retirement_account)
withdrawal=round(withdrawal)

#output we want to show to the user
print("User will have on his/her account by the age of retirement: ", future_value_retirement_account, "euros.\n", "It will allow the user to withdraw per year until death:", withdrawal, "euros")

Solution using numpy_financials:

# we import numpy_financial package
import numpy_financial as nf

#user inputs - we use default values for the moment
current_age = 25
annual_deposit=10000
annual_interest_rate=0.04
retirement_age=62
life_expectancy=85

#computation
present_value_retirement_account = -nf.pv(annual_interest_rate,(retirement_age-current_age),annual_deposit,fv=0)
future_value_retirement_account= -nf.fv(annual_interest_rate,(retirement_age-current_age),annual_deposit,pv=0)
withdrawal=-nf.pmt(annual_interest_rate, (life_expectancy-retirement_age), future_value_retirement_account)

#we want to show rounded figures
future_value_retirement_account = round(future_value_retirement_account)
withdrawal=round(withdrawal)

#output we want to show to the user
print("User will have on his/her account by the age of retirement: ", future_value_retirement_account, "euros.\n", "It will allow the user to withdraw per year until death:", withdrawal, "euros")

3. Arrays and dataframes

3.1. Numpy arrays

We import the package Numpy. Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. A numpy array is a grid of values, all of the same type, and is indexed.

Lists have limitations that we can overcome with numpy arrays. For instance, we cannot multiply list elements from two different lists. If you try to run the below code, it is not going to work.

movie_ticker_price_list = [5.0, 6.0, 7.0, 5.0]
movie_nb_visitors_list = [100000, 20000, 300000, 140000]
movie_revenue_list=movie_ticker_price_list*movie_nb_visitors_list
print(movie_revenue_list)

We can overcome this by using numpy arrays. We import numpy into Python and turn the lists into numpy arrays, we then can perform the calculation we are after:

import numpy as np

movie_ticker_price_list = [5.0, 6.0, 7.0, 5.0]
movie_ticker_price_array = np.array(movie_ticker_price_list)

movie_nb_visitors_list = [100000, 20000, 300000, 140000]
movie_nb_visitors_array = np.array(movie_nb_visitors_list)

movie_revenue_array=movie_ticker_price_array*movie_nb_visitors_array
print(movie_revenue_array)

## [ 500000.  120000. 2100000.  700000.]

We can subset an array as we do for a list:

movie_ticker_price_list = [5.0, 6.0, 7.0, 5.0]
movie_ticker_price_array = np.array(movie_ticker_price_list)
print(movie_ticker_price_array[0])

## 5.0

We can run a test on the elements of an array, for instance greater or equal to 6:

movie_ticker_price_list = [5.0, 6.0, 7.0, 5.0]
movie_ticker_price_array = np.array(movie_ticker_price_list)
print(movie_ticker_price_array>=6)

## [False  True  True False]

To retrieve only the elements of an array that meet a criteria, we go :

movie_ticker_price_list = [5.0, 6.0, 7.0, 5.0]
movie_ticker_price_array = np.array(movie_ticker_price_list)
print(movie_ticker_price_array[movie_ticker_price_array>=6])

## [6. 7.]

We can create basic statistics with numpy such as the mean, median, and standard deviation of a variable. We carry on our example with the ticket price from movies.

movie_ticker_price_list = [5.0, 6.0, 7.0, 5.0]
movie_ticker_price_array = np.array(movie_ticker_price_list)
movie_nb_visitors_list = [100000, 20000, 300000, 140000]
movie_nb_visitors_array = np.array(movie_nb_visitors_list)
movie_revenue_array=movie_ticker_price_array*movie_nb_visitors_array
print(movie_revenue_array)

## [ 500000.  120000. 2100000.  700000.]

print(np.mean(movie_revenue_array))

## 855000.0

print(np.median(movie_revenue_array))

## 600000.0

print(np.std(movie_revenue_array))

## 748381.5871599193

3.2. Pandas dataframes

To prepare this section, feel free to check the following DataCamp introductions:

Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. The DataFrame is one of Pandas’ most important data structures. It’s basically a way to store tabular data where you can label the rows and the columns. Data can be from different types.

In what follows, we will create a dataframe out of lists. We will be working with vehicle data from different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right. We create three lists and combine them into a data frame that we create with pandas. We will use the function zip() to combine lists together in a dataframe. Notice, that, as usual, we start our script by importing the package we will use, namely pandas in order to deal with dataframes here.

import pandas as pd
#we create lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
drive_on_right =  [True, False, False, False, True, True, True]
nb_vehicle_per_capita = [809, 731, 588, 18, 200, 70, 45]
#we combine the lists in a dataframe using the zip function
mydata = pd.DataFrame(zip(names, drive_on_right,nb_vehicle_per_capita)) 
print(mydata)

##                0      1    2
## 0  United States   True  809
## 1      Australia  False  731
## 2          Japan  False  588
## 3          India  False   18
## 4         Russia   True  200
## 5        Morocco   True   70
## 6          Egypt   True   45

We can specify the names of the columns when we create the dataframe:

mydata = pd.DataFrame(list(zip(names, drive_on_right,nb_vehicle_per_capita)), columns=["Countries","Drive_Right_Dummy","Nb_Vehicle_Per_Capita"]) 
print(mydata)

##        Countries  Drive_Right_Dummy  Nb_Vehicle_Per_Capita
## 0  United States               True                    809
## 1      Australia              False                    731
## 2          Japan              False                    588
## 3          India              False                     18
## 4         Russia               True                    200
## 5        Morocco               True                     70
## 6          Egypt               True                     45

We can also specify the names of the row (called index):

mydata.index=["obs_1","obs_2","obs_3","obs_4","obs_5","obs_6","obs_7",]
print(mydata)

##            Countries  Drive_Right_Dummy  Nb_Vehicle_Per_Capita
## obs_1  United States               True                    809
## obs_2      Australia              False                    731
## obs_3          Japan              False                    588
## obs_4          India              False                     18
## obs_5         Russia               True                    200
## obs_6        Morocco               True                     70
## obs_7          Egypt               True                     45

Using column and index names or number, we can subset some elements of a dataframe:

print(mydata["Countries"])

## obs_1    United States
## obs_2        Australia
## obs_3            Japan
## obs_4            India
## obs_5           Russia
## obs_6          Morocco
## obs_7            Egypt
## Name: Countries, dtype: object

print(mydata.Countries)

## obs_1    United States
## obs_2        Australia
## obs_3            Japan
## obs_4            India
## obs_5           Russia
## obs_6          Morocco
## obs_7            Egypt
## Name: Countries, dtype: object

print(mydata["Countries"]["obs_1"])

## United States

print(mydata.Countries[0])

## United States

print(mydata.Countries[0:3])

## obs_1    United States
## obs_2        Australia
## obs_3            Japan
## Name: Countries, dtype: object

We can describe the variables enclosed in a dataframe using the method describe(). It generates descriptive statistics:

mydata.describe()

##        Nb_Vehicle_Per_Capita
## count               7.000000
## mean              351.571429
## std               345.595552
## min                18.000000
## 25%                57.500000
## 50%               200.000000
## 75%               659.500000
## max               809.000000

We can get a snapshot of a dataset contained in the dataframe by using the method head(). It returns the first n rows (5 by default, otherwise we have to specify a value for n):

mydata.head(n=3)

##            Countries  Drive_Right_Dummy  Nb_Vehicle_Per_Capita
## obs_1  United States               True                    809
## obs_2      Australia              False                    731
## obs_3          Japan              False                    588

Similarly, we can have a snapshot of the tail fo the dataframe, by calling the method tail():

mydata.tail(n=3)

##       Countries  Drive_Right_Dummy  Nb_Vehicle_Per_Capita
## obs_5    Russia               True                    200
## obs_6   Morocco               True                     70
## obs_7     Egypt               True                     45

Most of the time, the raw data will come in the form of a csv (comma-separated values)file - such data follows a regular structure. Either we have created the file ourselves or we got it from a web source. To import csv data into Python as a dataframe,we use the pandas function read_csv(). See the below example (you can find the csv file on Blackboard under mkt_to_gdp_2020.csv):

import pandas as pd
data = pd.read_csv("mkt_to_gdp_2020.csv",sep=";")
data.head()

##   Country Name  mkt_cap_per_GDP
## 0    Australia       129.277555
## 1      Austria        30.791120
## 2   Bangladesh        27.687481
## 3     Barbados        63.690986
## 4       Brazil        68.412236

We are going to see how to use the following dataframe functions and methods:

groupby(): Group dataframe by a series of columns. More information there.
merge(): Merge dataframes or named series objects with a database-style join. More information there.
loc(): Helps to access a group of rows and columns in a dataset, a slice of the dataset. More information there.
sort_values(): Sort by the values along either axis. More information there.
unique():Uniques are returned in order of appearance. More information there.
transpose(): Used to transpose index and columns. More information there.
pivot_table(): Return reshaped DataFrame organized by given index / column values. More information there.
rename(): Alter axes labels. More information there.

To review these functions and methods, we consider climate-related disasters data over 1900-2021 from this public source. You can find the csv file on Blackboard under disasters.csv. The below code is also available on Blackboard (the Python file disasters.py).

import pandas as pd
#ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. This is to get the country names right.
data = pd.read_csv("disasters.csv",sep=";", encoding = "ISO-8859-1")
data.head()

##    Year Disaster Group  ... Continent Total Deaths
## 0  1900        Natural  ...    Africa      11000.0
## 1  1900        Natural  ...      Asia    1250000.0
## 2  1901  Technological  ...    Europe         18.0
## 3  1902        Natural  ...  Americas       2000.0
## 4  1902        Natural  ...  Americas       1000.0
## 
## [5 rows x 9 columns]

data.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 25244 entries, 0 to 25243
## Data columns (total 9 columns):
## Year                 25244 non-null int64
## Disaster Group       25244 non-null object
## Disaster Subgroup    25244 non-null object
## Disaster Type        25244 non-null object
## Disaster Subtype     22105 non-null object
## Country              25244 non-null object
## Region               25244 non-null object
## Continent            25244 non-null object
## Total Deaths         19956 non-null float64
## dtypes: float64(1), int64(1), object(7)
## memory usage: 1.7+ MB

data.describe()

##                Year  Total Deaths
## count  25244.000000  1.995600e+04
## mean    1997.501070  1.926358e+03
## std       19.080074  6.296132e+04
## min     1900.000000  1.000000e+00
## 25%     1990.000000  1.100000e+01
## 50%     2002.000000  2.100000e+01
## 75%     2010.000000  4.900000e+01
## max     2021.000000  5.000000e+06

What are the broad categories of disaster?

print(data['Disaster Group'].unique())

## ['Natural' 'Technological' 'Complex Disasters']

To restrict our data to observations for a specific value of one the columns, we use the syntax:

new_data_frame = my_data_frame[my_data_frame[‘column_name’]==“XXXX”]

To only consider natural disasters.

data2 = data[data['Disaster Group'] == 'Natural']
data2.head()

##    Year Disaster Group  ... Continent Total Deaths
## 0  1900        Natural  ...    Africa      11000.0
## 1  1900        Natural  ...      Asia    1250000.0
## 3  1902        Natural  ...  Americas       2000.0
## 4  1902        Natural  ...  Americas       1000.0
## 5  1902        Natural  ...  Americas       6000.0
## 
## [5 rows x 9 columns]

To create a smaller dataframe, we can use the syntax:

new_data_frame = my_data_frame[[“name_column_1”, “name_column_2”]]

To group observations by a category, and then run actions at the group level, we use the syntax:

new_data_frame = my_data_frame.groupby([‘Category’]).action_we_want()

We then want to know what is the total damage over the entire period caused by the recorded disasters per country. First we reduce the set of columns to the ones of interest, then we group the disasters by countries using groupby() and sum the damages caused.

#first we restrict our dataframe to specific columns, notice how we do it:
data3 = data2[['Total Deaths', 'Country']]
#we group observations by country and compute the sum of total deaths across countries:
data3 = data3.groupby(['Country']).sum()
data3.head()

##                 Total Deaths
## Country                     
## Afghanistan          25197.0
## Albania                803.0
## Algeria              11874.0
## American Samoa         130.0
## Angola                6222.0

Often time, we will want to change the value of some of the rows of a column in a dataframe conditional on something. To do that, the best practice is to combine pandas and numpy and make use of the numpy function where(). If we want to turn to uppercase the names of countries (index values here) when there are at least 1000 deaths caused by disasters in the country, we go:

import numpy as np
data3.index=np.where(data3['Total Deaths']>=1000, data3.index.str.upper(), data3.index)
data3[0:12]

We can then show the 20 countries the most affected by natural disasters in terms of total deaths:

data4 = data3.sort_values(by=['Total Deaths'], ascending=False)
data4[:20]

##                             Total Deaths
## Country                                 
## China                         12521858.0
## India                          9133229.0
## Soviet Union                   3857096.0
## Bangladesh                     2993988.0
## Ethiopia                        416530.0
## Haiti                           252595.0
## Indonesia                       242591.0
## Japan                           239374.0
## Uganda                          205073.0
## Niger (the)                     195935.0
## Pakistan                        178860.0
## Sudan (the)                     163827.0
## Iran (Islamic Republic of)      163353.0
## Myanmar                         146894.0
## Italy                           140565.0
## Mozambique                      107060.0
## Turkey                           97086.0
## Peru                             97029.0
## Cabo Verde                       85295.0
## Guatemala                        80081.0

For a given country, we can show, over time, the total deaths associated with natural disasters by disaster type. Let us do it for France:

data5 = data2[data2['Country'] == "France"]
data5 = data5[['Total Deaths', 'Disaster Type', 'Year']]
data5 = data5.pivot_table('Total Deaths', index='Year', columns='Disaster Type', aggfunc=sum)
data5.head()

## Disaster Type  Drought  Earthquake  Epidemic  ...  Landslide  Storm  Wildfire
## Year                                          ...                            
## 1909               NaN        46.0       NaN  ...        NaN    NaN       NaN
## 1926               NaN         NaN       NaN  ...       28.0    NaN       NaN
## 1930               NaN         NaN       NaN  ...       40.0    NaN       NaN
## 1932               NaN         NaN       NaN  ...       30.0    NaN       NaN
## 1949               NaN         NaN       NaN  ...        NaN    NaN      80.0
## 
## [5 rows x 8 columns]

data5.tail()
#NaN stands for Not a Number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable. 
#For our data: to be interpreted as the absence of disaster type for a given year.

## Disaster Type  Drought  Earthquake  Epidemic  ...  Landslide  Storm  Wildfire
## Year                                          ...                            
## 2017               NaN         NaN       NaN  ...        NaN    2.0       0.0
## 2018               NaN         NaN       NaN  ...        NaN    6.0       NaN
## 2019               NaN         0.0       NaN  ...        NaN    8.0       NaN
## 2020               NaN         NaN       NaN  ...        NaN   18.0       NaN
## 2021               NaN         NaN       NaN  ...        NaN    0.0       NaN
## 
## [5 rows x 8 columns]

We have done it using pivot_table(), which creates one column per disaster type, we can also use group_by() but the format of the outcome is different, it creates rows specific to each disaster type and year pair:

data5 = data2[data2['Country'] == "France"]
data5 = data5[['Total Deaths', 'Disaster Type', 'Year']]
data5 = data5.groupby(['Disaster Type', 'Year']).sum()
data5.head()

##                     Total Deaths
## Disaster Type Year              
## Drought       1982           0.0
##               1989           0.0
##               1991           0.0
##               1997           0.0
## Earthquake    1909          46.0

data5.tail()

##                     Total Deaths
## Disaster Type Year              
## Wildfire      1999           0.0
##               2003           5.0
##               2005           2.0
##               2009           0.0
##               2017           0.0

We can sum the information by disaster type for the entire period too, here we do it with group_by() because we want disaster types as rows:

data6 = data2[data2['Country'] == "France"]
data6 = data6[['Total Deaths', 'Disaster Type']]
data6 = data6.groupby(['Disaster Type']).sum()
data6

##                      Total Deaths
## Disaster Type                    
## Drought                       0.0
## Earthquake                   58.0
## Epidemic                     21.0
## Extreme temperature       27593.0
## Flood                       322.0
## Landslide                   297.0
## Storm                       461.0
## Wildfire                    112.0

Notice that we can transpose the resulting table using the transpose() method if we want to have rows as columns:

data6.transpose()

## Disaster Type  Drought  Earthquake  Epidemic  ...  Landslide  Storm  Wildfire
## Total Deaths       0.0        58.0      21.0  ...      297.0  461.0     112.0
## 
## [1 rows x 8 columns]

Now, let us assume we are after the number of deaths caused by storms in Iran by year:

#as before we select a subset of columns
data7 = data2[['Total Deaths', 'Disaster Type', 'Year', 'Country']]
# we filter out observations that correspond to the disaster type "Storm" and the country "Iran (Islamic Republic of)"
data7 = data7[(data7['Disaster Type'] == "Storm") & (data7['Country'] == "Iran (Islamic Republic of)")]
data7

##        Total Deaths Disaster Type  Year                     Country
## 2068           60.0         Storm  1972  Iran (Islamic Republic of)
## 2181           14.0         Storm  1975  Iran (Islamic Republic of)
## 3821          200.0         Storm  1981  Iran (Islamic Republic of)
## 6432           22.0         Storm  1991  Iran (Islamic Republic of)
## 6653            NaN         Storm  1992  Iran (Islamic Republic of)
## 6846            NaN         Storm  1993  Iran (Islamic Republic of)
## 8609            9.0         Storm  1996  Iran (Islamic Republic of)
## 10567           3.0         Storm  2000  Iran (Islamic Republic of)
## 14153           NaN         Storm  2004  Iran (Islamic Republic of)
## 14286           NaN         Storm  2005  Iran (Islamic Republic of)
## 17249          12.0         Storm  2007  Iran (Islamic Republic of)
## 17500          28.0         Storm  2008  Iran (Islamic Republic of)
## 22748           4.0         Storm  2016  Iran (Islamic Republic of)
## 24777          10.0         Storm  2021  Iran (Islamic Republic of)
## 24898           6.0         Storm  2021  Iran (Islamic Republic of)

We have not explained how the loc method works yet, we do it now. We go back to the dataframe with the information about natural disasters and aimed to retrieve specific rows based on index names:

#we restart from data2
data8 = data2[['Total Deaths', 'Disaster Type', 'Year', 'Country']]

#we introduce restrictions
data8 = data8[data8['Year'] == 2020]
data8 = data8[data8['Country'] == "Brazil"]

# we change the index to disaster type
data8.index=data8['Disaster Type'] 

#then with loc we can find the information for specific disaster type in Brazil 2020
data8.loc[['Storm']]

##                Total Deaths Disaster Type  Year Country
## Disaster Type                                          
## Storm                  12.0         Storm  2020  Brazil

Even if the index is not based on row numbers, we can force a reference to the row number by using iloc():

data8.iloc[0]

## Total Deaths         12
## Disaster Type     Storm
## Year               2020
## Country          Brazil
## Name: Storm, dtype: object

To have a better view on the casualties per country, we would need to express the number of deaths as of 2020 as a ratio of the population this year. We retrieve population data from World Bank as of 2020 and import it into a dataframe. You can find the file pop_2020.csv on Blackboard.

pop = pd.read_csv("pop_2020.csv",sep=",")
pop.head()

##    Unnamed: 0    Country Name  Population
## 0           0     Afghanistan  38928341.0
## 1           1         Albania   2837743.0
## 2           2         Algeria  43851043.0
## 3           3  American Samoa     55197.0
## 4           4         Andorra     77265.0

We change the name of the country column in the population dataset, we set it to “Country”:

pop = pop.rename(columns={"Country Name": "Country"})
#we concentrate on the two following columns
pop=pop[["Country","Population"]]
pop.head()

##           Country  Population
## 0     Afghanistan  38928341.0
## 1         Albania   2837743.0
## 2         Algeria  43851043.0
## 3  American Samoa     55197.0
## 4         Andorra     77265.0

We then compute the total number of deaths per country caused by Earthquakes between 2000 and 2021.

data3 = data2[(data2['Disaster Type'] == "Earthquake") & (data2['Year'] >= 2000)]
data3 = data3[['Total Deaths', 'Country']]
data3 = data3.groupby(['Country']).sum()
data3.head()

##                 Total Deaths
## Country                     
## Afghanistan           1453.0
## Albania                 51.0
## Algeria               2285.0
## American Samoa          34.0
## Argentina                0.0

We merge the disaster data to the population data (the two different dataframes) by calling the pandas function merge():

#when using the merge function, we have to indicate the dataframes we want to merge and then on what variable (name of the column) we want to merge them
#we set the parameters how to 'inner': it means that we keep only the intersection of keys from both dataframes
merged_data=pd.merge(data3,pop,left_on="Country",right_on="Country",how='inner')
merged_data.head()

##           Country  Total Deaths  Population
## 0     Afghanistan        1453.0  38928341.0
## 1         Albania          51.0   2837743.0
## 2         Algeria        2285.0  43851043.0
## 3  American Samoa          34.0     55197.0
## 4       Argentina           0.0  45376763.0

We then can create a new variable. We express the number of deaths as a ratio of the country’s population (as a percentage):

merged_data["deaths_to_pop"]=(merged_data["Total Deaths"]/merged_data["Population"])*100
merged_data.describe()

##         Total Deaths    Population  deaths_to_pop
## count      70.000000  7.000000e+01      70.000000
## mean     9923.271429  7.488983e+07       0.036223
## std     36849.046239  2.336475e+08       0.236239
## min         0.000000  5.519700e+04       0.000000
## 25%         3.000000  4.507151e+06       0.000021
## 50%        13.500000  1.637578e+07       0.000314
## 75%       588.750000  4.694893e+07       0.003516
## max    225164.000000  1.402112e+09       1.974684

Now, we rank countries according to the total number of deaths expressed as a percentage of the total population due to disasters.

merged_data_sorted=merged_data.sort_values(by=['deaths_to_pop'], ascending=False)
merged_data_sorted[:20]

##             Country  Total Deaths    Population  deaths_to_pop
## 25            Haiti      225164.0  1.140253e+07       1.974684
## 60        Sri Lanka       35399.0  2.191900e+07       0.161499
## 52            Samoa         148.0  1.984100e+05       0.074593
## 30        Indonesia      180552.0  2.735236e+08       0.066010
## 3    American Samoa          34.0  5.519700e+04       0.061598
## 46         Pakistan       74391.0  2.208923e+08       0.033677
## 43            Nepal        8976.0  2.913681e+07       0.030806
## 38         Maldives         102.0  5.405420e+05       0.018870
## 19      El Salvador        1161.0  6.486201e+06       0.017900
## 33            Japan       20034.0  1.258360e+08       0.015921
## 62         Thailand        8347.0  6.979998e+07       0.011958
## 56  Solomon Islands          62.0  6.868780e+05       0.009026
## 64            Tonga           9.0  1.056970e+05       0.008515
## 13            China       92179.0  1.402112e+09       0.006574
## 2           Algeria        2285.0  4.385104e+07       0.005211
## 18          Ecuador         676.0  1.764306e+07       0.003832
## 0       Afghanistan        1453.0  3.892834e+07       0.003732
## 44      New Zealand         184.0  5.084300e+06       0.003619
## 12            Chile         613.0  1.911621e+07       0.003207
## 54       Seychelles           3.0  9.846200e+04       0.003047

We can run a test to know for which countries the ratio is greater than 1% and add that information to the dataset:

merged_data_sorted["Greater_than_1_pct"]=merged_data_sorted["deaths_to_pop"]>1
merged_data_sorted[:20]

##             Country  Total Deaths  ...  deaths_to_pop  Greater_than_1_pct
## 25            Haiti      225164.0  ...       1.974684                True
## 60        Sri Lanka       35399.0  ...       0.161499               False
## 52            Samoa         148.0  ...       0.074593               False
## 30        Indonesia      180552.0  ...       0.066010               False
## 3    American Samoa          34.0  ...       0.061598               False
## 46         Pakistan       74391.0  ...       0.033677               False
## 43            Nepal        8976.0  ...       0.030806               False
## 38         Maldives         102.0  ...       0.018870               False
## 19      El Salvador        1161.0  ...       0.017900               False
## 33            Japan       20034.0  ...       0.015921               False
## 62         Thailand        8347.0  ...       0.011958               False
## 56  Solomon Islands          62.0  ...       0.009026               False
## 64            Tonga           9.0  ...       0.008515               False
## 13            China       92179.0  ...       0.006574               False
## 2           Algeria        2285.0  ...       0.005211               False
## 18          Ecuador         676.0  ...       0.003832               False
## 0       Afghanistan        1453.0  ...       0.003732               False
## 44      New Zealand         184.0  ...       0.003619               False
## 12            Chile         613.0  ...       0.003207               False
## 54       Seychelles           3.0  ...       0.003047               False
## 
## [20 rows x 5 columns]

Notice that in Spyder, at any point, we can go to the variable explorer tab in the top-right window and explore the content of an entire dataframe at will.

Now it is time to practice. Please find:

A. The number of deaths due to flood for Germany over time (do you find a specific pattern for 2021? this is expected).
B. The total number of deaths by type of natural disaster over the entire period. Do more people die from animal accidents than storms?

3.3. Export outputs

We have seen that we can easily explore a dataframe, for example generating statistics per country or per disaster type. So far, we have used Spyder to generate the outputs. Another option is to use Jupyter Notebook. The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. When writing a report mixing Python code, table outputs, charts, and text, this can be a good option. Jupyter Notebook is available on Anaconda. We will open Jupyter and run the code below:

import pandas as pd
data = pd.read_csv("disasters.csv",sep=";", encoding = "ISO-8859-1")
data.head()
data.describe()
print(data['Disaster Group'].unique())
data2 = data[data['Disaster Group'] == 'Natural']
data3 = data2[['Total Deaths', 'Country']]
data3 = data3.groupby(['Country']).sum()
data3

We will add chunks of markdown text to add titles or discussions and split the code into logical pieces, commenting on what we do and the generated outputs. You can find such a Jupyter document under first_jupyter_file.ipynb, available on Blackboard. As you can see the outputs are displayed nicely, you can also add pieces of text in the Jupyter document (using the markdown format). Once we have the code we want in Spyder, to write a nice data analysis report, we can use Jupyter. For more documentation, please visit this link. To generate a report that we can share, we can download the Jupyter document as a html or pdf file using the download as option.

Otherwise, another option is to save the an outputs generated by Spyder code as a xlsx file using the pandas method to_excel(), so we can insert them in a report later on:

import pandas as pd
data = pd.read_csv("disasters.csv",sep=";", encoding = "ISO-8859-1")
data.head()
data.describe()
print(data['Disaster Group'].unique())
data2 = data[data['Disaster Group'] == 'Natural']
data3 = data2[['Total Deaths', 'Country']]
data3 = data3.groupby(['Country']).sum()
data3.to_excel("total_deaths_per_country.xlsx")

4. Creating plots

Please check introduction-to-data-visualization-with-matplotlib of DataCamp before starting this section.

All the code covered in this section is available in the Python file plots.py that you can find on Blackboard.

The Matplotlib package helps us to create plots. It allows the data to tell their own story by plotting them wisely. We start by importing the package.

import matplotlib.pyplot as plt

4.1. Key functions

We can create a bunch of different plots in Python with matplotlib. The most basic plot is the line plot (we plot x as a function of y, in other words x will be the horizontal axis and y the vertical one). We call the plot() function to do so:

import matplotlib.pyplot as plt
import numpy as np
x=np.array([1,2,3,4,5,6])
y=np.array([10,20,30,40,50,60])
plt.clf()
plt.plot(x,y)
plt.show()

To see the plot in Spyder, we copy paste the code, and then execute it. In the plot window (top-left), we should be able to see the resulting plot. In the same window, using the menu, we can also save it as a PNG file if we want to.

Alternatively, we can use:

x=np.array([1,2,3,4,5,6])
plt.clf()
plt.plot(x,x*10)
plt.show()

Beside line plots, there are other plot types available such as scatter plot (dots). Before drawing another plot, we make sure we erase what is contained in the plt() by using clf().

plt.clf()
x=np.array([1,2,3,4,5,6])
y=np.array([10,20,30,40,50,60])
plt.scatter(x,y)
plt.show()

Bar plots are also available:

plt.clf()
x=np.array(["France","Germany","Japan","US","China","Brazil"])
y=np.array([10,12,14,20,22,16])
plt.bar(x,y)

plt.show()

We can also show an histogram. We will use hist() from the matplotlib package, we can use the help function of Python to get more information about it. To do so, we execute in the console the command:

help(plt.hist)

Python sets the number of bins to 10 by default. The number of bins is pretty important. Too few bins will oversimplify reality and will not show the details. Too many bins will overcomplicate reality and will not show the bigger picture. We can set the bins argument if needed.

plt.clf()
my_array=np.array([10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87])
plt.hist(my_array)

plt.show()

We can customize the plots using the methods xlabel(), ylabel(), and title(), which allow us to specify the horizontal axis title, the vertical axis title, and the plot title, respectively. For instance, building on the last plot:

plt.clf()
my_array=np.array([10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87,10,10,30,50,100,60,70,50,65,76,87])
plt.hist(my_array)

plt.xlabel("Random numbers")
plt.ylabel("Frequency")
plt.title("My first histogram with Python")
plt.show()

When we make a scatter plot, we can change the size of the dots according to the value of a variable, for instance in the plot below, the dots are bigger the bigger the y values are. We achieve it by setting the option s in the method scatter to be equal to y.

plt.clf()
x=np.array([1,2,3,4,5,6])
y=np.array([10,20,30,40,50,60])
plt.scatter(x,y, s=y)
plt.xlabel("Value_X")
plt.ylabel("Value_Y")
plt.title("My_scatter_plot")
plt.show()

We did not touch on colors so far. To make a plot more colorful, we can create a list of colors (c in the below illustration). In the below example, we report the GDP per capita of five countries (these are fictitious numbers). We can also specify the color of the bar edges (option edgecolor in the method plt.bar()):

plt.clf()
x=np.array(["France","Germany","Japan","US","China","Brazil"])
y=np.array([10,12,14,20,22,16])
c=['violet', 'red', 'green', 'blue', 'grey','brown']
plt.bar(x,y,color=c, edgecolor='black')

plt.xlabel("Countries")
plt.ylabel("GDP per capita")
plt.title("GDP per capita accross selected countries")
plt.show()

Finally, we can add a grid if needed by calling plt.grid(True) and add text by means of plt.text():

plt.clf()
x=np.array([1,2,3,4,5,6])
y=np.array([10,20,30,40,50,60])
plt.scatter(x,y, s=y)
plt.xlabel("Value X")
plt.xlabel("Value Y")
plt.title("My scatter plot")
plt.grid(True)
plt.text(1,10,"First observation")
plt.text(2,20,"Second observation")
plt.show()

4.2. Visualizing a random walk

import matplotlib.pyplot as plt
import numpy as np
# Initialization of the list
random_walk = [0]
#create a loop
#range(100) means for values of x from 0 to 99
for x in range(100) :
  #get last value of the random walk
  step = random_walk[-1]
  #draw a random integer between 1 and 6
  dice = np.random.randint(1,7)
  #if the draw is <= 2, then step is the max between 0 and the last value of the random walk -1
  if dice <= 2:
    step = max(0, step - 1)
  #if the draw is >2 and <= 5, we increase the last value of the random walk by 1
  elif dice <= 5:
    step = step + 1
  #if the draw is >5, the random walk value is set to last value known + a random integer between 1 and 6
  else:
    step = step + np.random.randint(1,7)
  #we add one step to the random walk
  random_walk.append(step)
#clear plt if any
plt.clf()
# create the random_walk plot
plt.plot(random_walk)
# show the plot => one possible outcome (remember this is random)
plt.show()

4.3. Plots with dataframes

Once we have created a dataframe, we can easily create a plot out of it combining pandas and matplotlib. More information about that here.

Let us assume we have created the following dataframe:

import pandas as pd
import matplotlib.pyplot as plt
Unemployment_Rate =  [6.1,5.8,5.7,5.7,5.8,5.6,5.5,5.3,5.2,5.2]
Stock_Index_Price = [1500,1520,1525,1523,1515,1540,1545,1560,1555,1565]
df = pd.DataFrame(zip(Unemployment_Rate, Stock_Index_Price), columns=["Unemployment_Rate","Stock_Index_Price"]) 
df.head()

##    Unemployment_Rate  Stock_Index_Price
## 0                6.1               1500
## 1                5.8               1520
## 2                5.7               1525
## 3                5.7               1523
## 4                5.8               1515

We can directly plot the dataframe by using the following syntax:

plt.clf()
df.plot(x ='Unemployment_Rate', y='Stock_Index_Price', kind = 'scatter')
plt.show()

Below we show how to plot a pie chart out of a dataframe:

import pandas as pd
import matplotlib.pyplot as plt
data =[300,500,700]
df = pd.DataFrame(data,columns=['Tasks'],index = ['Tasks Pending','Tasks Ongoing','Tasks Completed'])
plt.clf()
df.plot.pie(y='Tasks',figsize=(5, 5),autopct='%1.1f%%', startangle=90)
plt.show()

5. Data analysis applications

In this section, we perform data analysis in several contexts using the tools reviewed so far. We will follow a similar pattern across applications:

Download/Import the data.
Format the data.
Explore the data.
Extract valuable information out of the data.
Visualize the data if appropriate.

5.1. GDP and stock market financing

Our starting point are World Bank data. We want to investigate the relation between gdp per capita and the importance of the stock market in a country. Please refer to the Python file gdp_market_analysis.py on Blackboard, make sure you also download the associated csv files gdp_data_2020.csv and mkt_to_gdp_2020.csv.

5.2. Movie ratings

Inspired by Python for Data Analysis of Wes McKinney.

We next examine the movie ratings given by women and men and collected from users of MovieLens in the late 1990s. The data source is available here. The data provide movie ratings, movie metadata (genres and year), and demographic data about the users (age, zip code, gender, and occupation). Such data is often of interest in the development of recommendation systems. We can download the dataset here.

The MovieLens 1M data set contains 1 million ratings collected from 6,000 users on 4,000 movies. It is spread across 3 tables: ratings, user information, and movie information. After extracting the data from the zip file, each table can be loaded into a pandas dataframe object using read_table(). Note that ages and occupations are coded as integers indicating groups described in the dataset’s README file.

Our objective is to explore the datasets and extract some interesting facts and insights. Analyzing the data spread across three tables is not a simple task: for example, suppose we want to compute mean ratings for a particular movie by gender and age. This is much easier to do with all of the data merged together into a single table. Using pandas’s merge() function, we first merge ratings with users and then merge this new dataframe with the movies data.

Please refer to the Python file movies_analysis.py available on Blackboard, make sure you also download the associated files: movies.dat, users.dat, and ratings.dat.

5.3. Baby names

Inspired by Python for Data Analysis of Wes McKinney.

The United States Social Security Administration has made available data on the frequency of baby names from 1880 through the present. Data is publicly available at this link. We concentrate on the top 1,000 names in the U.S. by year for men and women Initially, in the zip file, there is one .txt document per year (we turn it into one big csv file in the code) and create the data file names_1880_2020.csv. This is our starting point. Make sure you download the names_1880_2020.csv from Blackboard, then check the Python code that examines the data and generates some outputs: baby_names_analysis.py, also available on Backboard.

There are many things we can do with the dataset:

Visualize the proportion of babies given a particular name over time.
Determine the relative rank of a name.
Determine the most popular names in each year or the names with largest increases or decreases.
Analyze trends in names: vowels, consonants, length, overall diversity, changes in spelling, first and last letters.
Analyze external sources of trends: biblical names, celebrities, demographic changes.

We cover some of it in the Python file.

5.4. Exercise

Using the file names_1880_2020.csv, analyze the naming trend for the name Barbara and Bruce in the US. Also, what can you say about the preferences for names starting with the letter b for boy versus girl names and for girls over time? Use tables and plots. Try to present the results of your investigation in Jupyter after writing up the entire code in Spyder.

6. Assignment 1 (50%)

Go to Kaggle, register, sign in, and then download a dataset of your choice (.csv ideally). Import it as a dataframe and explore it, try to extract valuable information out of it and render it nicely in the form of tables and plots, write a very short report reporting the outputs in a nice format and commenting on them. Within few pages the reader should get what the data are, and what are the central features of the dataset, what is worth examining with it, and what can then be said based on the tables and plots you show in your document. Marking criteria are as follows (out of 20 marks):

2 marks for identifying and downloading a dataset and formatting it the right way in Python.
8 marks for generating relevant tables and plots.
4 marks for introducing the dataset and the quality of the discussion of the tables and plots.
2 marks for commenting the code appropriately.
4 marks for ability to incorporate features not covered in class (for instance to generate nice table outputs - there are many options).

Everything you do should be well explained so that by reading your code and the document your lecturer understands what you examine and what is reported. Send the code, data file, and the report by the due date to your lecturer at agarel@audencia.com, make sure to use as a title of the email PYTHON - ASSIGNMENT 1 - NAME. The size of both files together should not exceed 10 MB.

7. Financial analysis

Please first check the Datacamp training: importing-managing-financial-data-in-python.

In this section, we will review how to deal with time series, and especially financial time series. In a second time, we will build portfolios and analyze their performance over time. Finally, we are going to see how to implement and test trading strategies.

Points 7.1 to 7.4 are also covered in the Python file financial analysis first part.py available on Blackboard. Points 7.5 to 7.7 are also covered in the Python file financial analysis second part.py available on Blackboard.

7.1. Time series

A time series is a sequence of observations recorded at regular time intervals. Depending on the frequency of observations, a time series may be hourly, daily, weekly, monthly, quarterly or annually. Sometimes, you might have seconds and minute-wise time series (number of clicks on adds).

Let us use the a first time series (drug_sales.csv, a csv file on Australian drug sales). We add parse_dates=[‘date’] to make the date column to be parsed as a date field. You can find this csv file on Blackboard. When it comes to time series with dataframes, common practice is to define the index as dates, i.e., the names of rows are dates:

import pandas as  pd
df = pd.read_csv('drug_sales.csv', parse_dates=['date'], index_col='date')
df.head()

##                value
## date                
## 1991-07-01  3.526591
## 1991-08-01  3.180891
## 1991-09-01  3.252221
## 1991-10-01  3.611003
## 1991-11-01  3.565869

Alternatively, one can first import the dataframe, and then indicate to Python with an other line of code, where the column with the dates in string format is and how to turn the string dates into proper dates that Python can read. We indicate, in the string date, where the year (Y), month(m) and day(d) are.

import pandas as  pd
df = pd.read_csv('drug_sales.csv')
df.index = pd.to_datetime(df['date'], format = '%Y-%m-%d')
#we drop date column now we have the dates as the index
df=df[['value']]
df.head()

##                value
## date                
## 1991-07-01  3.526591
## 1991-08-01  3.180891
## 1991-09-01  3.252221
## 1991-10-01  3.611003
## 1991-11-01  3.565869

Let us plot the time series. What we want to do is to isolate the column of the dataframe we want to plot, automatically, Python we plot the values as a function of time (the dates we have in index).

import matplotlib.pyplot as plt
plt.clf()
df.plot()
plt.title("Drug sales over time")
plt.ylabel("Drug sales")
plt.xlabel("")
plt.show()

We observe a seasonal pattern, better to aggregate everything yearly. We can group the data at seasonal intervals and see how the values are distributed within a given year or month and how it compares over time. We group observations by year (month) and show the means values and plot corresponding bar charts:

By year:

df = pd.read_csv('drug_sales.csv', parse_dates=['date'], index_col='date')
df['year'] = df.index.year
plt.clf()
df = df.groupby('year')['value'].mean().plot(kind='bar')
plt.title("Mean sales per year")
plt.ylabel("Mean sales per year")
plt.xlabel("")
plt.show()

By month of the year:

df = pd.read_csv('drug_sales.csv', parse_dates=['date'], index_col='date')
df['month'] = df.index.month
plt.clf()
df = df.groupby('month')['value'].mean().plot(kind='bar')
plt.title("Mean sales per month of the year")
plt.ylabel("Mean sales per month")
plt.xlabel("")
plt.show()

The plots make the year-wise and month-wise distributions obvious. The months of December and January have higher drug sales, which can be attributed to the holiday discounts season.

7.2. Financial series

Let us now consider standard financial time series. We start with a time series of stock prices. A first question is how can we easily get such a time series with Python?

7.2.1. Data sources

We can use various sources to find financial series. Quandl is a free option but there is a limited number of requests per day we can make. It has the advantage to be a package that we can call from Python (fully integrated solution). A good practice when it comes to Quandl is to download the series and then store the series as a csv. Otherwise, we can extract financial time series from EIKON (Thomson Refinitiv) or Bloomberg, using their Excel add-ins, and store the data as csv files, and then import them into Python. IEX Cloud API is another alternative. At Audencia, you have access to EIKON Refinitiv and Bloomberg, for more information get in touch with the library.

Let us say few words about Quandl. We can use this package to import a time series of stock prices directly into Python. The package allows to get connected to Quandl API (Application Programming Interface, which is a software intermediary that allows two applications to talk to each other) to retrieve data. Quandl gives users access to the vast collection of economic, financial, and market data collected from central banks, governments, multinational organizations and many other sources. Most of the raw datasets are free to access upon sign up (provided you have an API key - see below how to generate one), with more advanced and in-depth datasets available at a cost. Each feed of data has a short ID, also known as a Quandl code. For example:

Federal Reserve Economic Data (time-series). This dataset includes things such as growth, employment, inflation, labor, manufacturing and many more US economic data.: FRED.
US stock data: WIKI
leading real estate and rental marketplace: ZILLOW

The full list of codes can be found here.

Having a Quandl account is important as it allows you to access their API, libraries and tools, and more importantly, download free and/or premium data in any format. Let us go over the steps to open a free account:

Go over to the Quandl website and click on the sign up button in the top right corner (Nasdaq bought over Quandl in 2018).
A screen will open that requires your first and last name, as well as your purpose of using Quandl (business, academic or personal). So feel free to input those fields. After that, click on “Next”.
Now, another screen will appear where you will input your email address and answer a question on “How will you be using this data?” After that, click on “Next”.
The final screen of your signing up process will appear where you will be asked to create a password. After you have created your password and checked the terms of service box, click on “Create Account”.

Now, a welcome screen will appear with your API key. Copy it and do not forget to verify your email address. You can find your API key in the Account Details section. The API key is a “secret” authentication token as well as a unique identifier for collecting data from Quandl.

We have an API key for Quandl, next let us download some financial data into Python. We want to import energy prices.

We install the package Quandl, as usual we open the anaconda prompt and type:

pip install quandl

Then we import it at the beginning of our script using the syntax:

import quandl as ql

We want to get the WTI Crude Oil price from the United States department of Energy:

We need to import all the packages we are going to use (also pandas)
After that, we need to activate our API key. Use the following command: quandl.ApiConfig.api_key = “YOUR_API_KEY”
The next step is to use the following command mydata = quandl.get(“EIA/PET_RWTC_D”) to get the time series.

For more information, check the documentation. The code to achieve this is shown below:

The time-series data provided by Quandl only contains numerical data which is indexed by one date field. It can directly be used a time series.

import quandl as quandl
import pandas as pd
import matplotlib.pyplot as plt
#active API key - use your own API key (see above how to get one)
quandl.ApiConfig.Api_key=MY_API_KEY
#get time series of crude oil prices and show it with a plot
#the code for crude oil prices in Quandl is EIA/PET_RWTC_D
#we want to get a pandas dataframe, so we use returns="pandas"
#more information on quandl.get can be obtained with the help: help(quandl.get)
mydata= quandl.get("EIA/PET_RWTC_D", returns="pandas")
print(mydata)

##             Value
## Date             
## 1986-01-02  25.56
## 1986-01-03  26.00
## 1986-01-06  26.53
## 1986-01-07  25.85
## 1986-01-08  25.87
## ...           ...
## 2021-11-19  76.11
## 2021-11-22  76.74
## 2021-11-23  78.32
## 2021-11-24  78.32
## 2021-11-29  69.88
## 
## [9072 rows x 1 columns]

plt.clf()
mydata.plot(title='Crude oil prices over time', legend=False)
plt.xticks(rotation=90)

plt.ylabel("Oil price")
plt.xlabel("")
plt.show()

Notice that we can download data for a specific data range and for a specific frequency:

import quandl as q
import pandas as pd
import matplotlib.pyplot as plt
#get time series of GDP - we want quarterly data, so we use collapse="quarterly", also we want to retrieve he information for a specific time period
#we indicate this time period with start_date & end_date
mydata= q.get("FRED/GDP", start_date="2020-01-01", end_date="2020-12-31", returns="pandas", collapse="quaterly")
print(mydata)

##                 Value
## Date                 
## 2020-01-01  21481.367
## 2020-04-01  19477.444
## 2020-07-01  21138.574
## 2020-10-01  21477.597

#the dates is the index as often in time series, yet they come in the format Y-M-D, so we change the date format to get year-quarter
#to change the date format of the index we use the pandas function PeriodIndex().
mydata.index = pd.PeriodIndex(mydata.index  ,freq='Q')
print(mydata)

##             Value
## Date             
## 2020Q1  21481.367
## 2020Q2  19477.444
## 2020Q3  21138.574
## 2020Q4  21477.597

#we plot the information
plt.clf()
mydata.plot(title='US GDP by quarter in 2020', legend=False, kind='bar')
plt.ylabel("US GDP")
plt.xlabel("")
#make the plot window a big bigger to be able to contain the labels.
plt.tight_layout()
plt.show()

As indicated, when it comes to Quandl, the best practice is to download the series once from Quandl and then store it as a csv file, because the number of requests to the API per day is limited for free accounts. We make use of the pandas method to_csv() to do that:

import quandl as quandl
import pandas as pd
#active API key - use your own API key (see above how to get one)
quandl.ApiConfig.Api_key=MY_API_KEY
mydata= quandl.get("EIA/PET_RWTC_D", returns="pandas")
mydata.to_csv("EIA_PET_RWTC_D_Quandl.csv")

An alternative to Quandl to import financial data directly to Python for free is to use packages collecting data from Yahoo Finance. There is a package (pandas_datareader) giving us access to the stock prices stored there, see the example below (god thing is that it does not require registration):

import pandas as pd
import pandas_datareader as wb
start = '2000-01-01'
end = '2020-12-31'
ticker = 'AAPL'
aapl = wb.DataReader(ticker, 'yahoo', start, end)
aapl=aapl[['Adj Close']]
aapl=aapl.rename(columns={'Adj Close': 'AAPL'})
aapl

##                   AAPL
## Date                  
## 2000-01-03    0.856887
## 2000-01-04    0.784643
## 2000-01-05    0.796124
## 2000-01-06    0.727229
## 2000-01-07    0.761677
## ...                ...
## 2020-12-24  131.161423
## 2020-12-28  135.852509
## 2020-12-29  134.043640
## 2020-12-30  132.900711
## 2020-12-31  131.877014
## 
## [5284 rows x 1 columns]

We get daily prices, but we can easily turn the series into a monthly one, using the resample method:

aapl= aapl.resample("M").interpolate()
aapl

##                   AAPL
## Date                  
## 2000-01-31    0.794211
## 2000-02-29    0.877460
## 2000-03-31    1.039651
## 2000-04-30    0.841338
## 2000-05-31    0.643024
## ...                ...
## 2020-08-31  128.028488
## 2020-09-30  114.902191
## 2020-10-31  116.611385
## 2020-11-30  118.320580
## 2020-12-31  131.877014
## 
## [252 rows x 1 columns]

In what follows, we are going to use datasets of prices extracted either form Quandl, Yahoo, EIKON, or Bloomberg but formatted as csv files. Trainings are available to you from Audencia’s library when it comes to collecting financial data and series from EIKON and Bloomberg, you are strongly encouraged to take them. We will specify each time what the source of our time series is for you to be able to redo the steps starting from the raw data if you wish to.

7.2.2. Create returns

We want to see how to generate returns and cumulative returns. To do so, we will use the time series of Facebook stock prices. The Quandl code for Facebook stock is WIKI/FB. We use it to get the time series of stock prices from Quandl: quandl.get(“WIKI/FB”, collapse=“monthly”, returns=“pandas”). When calling the function get() from Quandl, we specify the frequency with the option collapse. Here, we are interested in monthly prices. We have stored the data we got from Quandl in the csv file facebook_price_quandl.csv, available on Blackboard. Notice that the series we get for free from Quandl stop in 2018.

We want to generate returns and cumulative returns to see how much we would have earned by investing in Facebook stock. To generate returns, we use adjusted close prices (the adjustment is accounting for any corporate actions such as dividend payment or stock splits):

import pandas as pd
import matplotlib.pyplot as plt
#data obtained from Quandl for WIKI/FB and a monthly frequency
mydata=pd.read_csv("facebook_price_quandl.csv",parse_dates=['Date'], index_col='Date')
mydata.head()
#we get the monthly stock prices for Facebook, we consider the adjusted close prices
#The adjusted closing price amends a stock's closing price to reflect that stock's value after accounting for any corporate actions.

##               Open   High    Low  ...  Adj. Low  Adj. Close  Adj. Volume
## Date                              ...                                   
## 2012-05-31  28.545  29.67  26.83  ...     26.83      29.600  111639200.0
## 2012-06-30  31.920  31.99  30.76  ...     30.76      31.095   19526900.0
## 2012-07-31  23.370  23.37  21.61  ...     21.61      21.710   56179400.0
## 2012-08-31  18.680  18.70  18.03  ...     18.03      18.058   58764200.0
## 2012-09-30  20.570  21.95  20.50  ...     20.50      21.660   65486000.0
## 
## [5 rows x 12 columns]

returns = mydata[['Adj. Close']].pct_change()
plt.clf()
returns.plot(kind='line', title='Facebook monthly returns over time')
plt.ylabel("Monthly return")
plt.xlabel("")
plt.show()

Once we have monthly returns, it is easy for us to compute cumulative returns:

import pandas as pd
import matplotlib.pyplot as plt
mydata=pd.read_csv("C:\\Users\\alexa\\Documents\\cours python\\facebook_price_quandl.csv",parse_dates=['Date'], index_col='Date')
#create returns
returns = mydata[['Adj. Close']].pct_change()
#create cumulative returns
cumumlative_returns = (1 + returns['Adj. Close']).cumprod() - 1
#show them on a plot
plt.clf()
cumumlative_returns.plot(kind='line', title='Facebook cumulative monthly returns over time')
plt.ylabel("Cumulative monthly return")
plt.xlabel("")
plt.show()

Next we want to assess the performance relative to a benchmark (a plausible alternative we want to compare our investment performance to). We take as benchmark the S&P500 (we get the series of index prices from Quandl with the code: MULTPL/SP500_REAL_PRICE_MONTH) and save it under sp500_price_quandl.csv, available on Blackboard. Recall that an index, such as the S&P 500 for instance, tracks the performance of an entire market (most of the stocks listed in this market). We can easily buy or sell and an index using ETFs (investment fund whose shares are traded on stock exchanges and that are themselves replicating the index).

import pandas as pd
import matplotlib.pyplot as plt
#import the time series of prices for Facebook and SP500 index
sp_price=pd.read_csv("sp500_price_quandl.csv",parse_dates=['Date'], index_col='Date')
facebook_price=pd.read_csv("facebook_price_quandl.csv",parse_dates=['Date'], index_col='Date')
#for both time series, we compute returns
returns_sp= sp_price[['Value']].pct_change()
returns_fb= facebook_price[['Adj. Close']].pct_change()
#we merge the series - keep only common dates to both series with how=inner, do the merge based on index values using left_index=True and right_index=True
return_series = pd.merge(returns_sp,returns_fb,how="inner", left_index=True, right_index=True)
#for better readability we rename columns, make sure to use inplace=True, which means it will return nothing and the dataframe is now updated (otherwise start with return_series)
return_series.rename(columns={"Value": "SP500_Returns", "Adj. Close": "Facebook_Returns"}, inplace=True)
#we only keep dates for which returns can be computed (we drop the first observation available for Facebook then)
return_series.dropna(inplace=True)
#show a snapshot of the dataframe we have created.
return_series.head()

##             SP500_Returns  Facebook_Returns
## Date                                       
## 2012-06-30      -0.013264          0.050507
## 2012-07-31       0.027428         -0.301817
## 2012-08-31       0.032115         -0.168217
## 2012-09-30       0.028480          0.199468
## 2012-10-31      -0.003880         -0.025392

Now that we have both time series in a same dataframe, we can produce descriptive statistics and comment them:

return_series.describe()

##        SP500_Returns  Facebook_Returns
## count      70.000000         70.000000
## mean        0.010314          0.029171
## std         0.022725          0.108615
## min        -0.065957         -0.301817
## 25%        -0.003110         -0.026631
## 50%         0.012747          0.021528
## 75%         0.026313          0.070892
## max         0.061714          0.479100

We observe that both series have 70 monthly returns. On average, Facebook returns are much higher (3% versus 1%). However, Facebook returns are also much more volatile, the standard deviation is about 10% versus 2% for the S&P 500 index (hence riskier). For Facebook, on average, we can expect to get 3% plus or minus 10%, while it is 1% plus or minus 2% for the S&P 500 index. Gains are lower but safer for the S&P 500 index on average. Also notice that the average return figure for Facebook is heavily influenced by an extreme monthly return of 48% (see the maximum). When the mean return is likely to be distorted by extreme values, it is good practice to consider the median return (the one we get 50% of the time). When we consider the median return, as expected, we see that the difference in average return between Facebook and S&P 500 is much smaller (2% for Facebook versus 1% for S& P500).

We can use the ratio of mean return to standard deviation of the returns to put a number on the return-risk combination associated to a security (this is equivalent to the Sharpe ratio for a portfolio). The best diversified investment strategy is usually the one with the highest Sharpe ratio. We compute it below for Facebook returns and S&P500 returns:

Sharpe_ratio_SP=return_series["SP500_Returns"].mean()/return_series["SP500_Returns"].std()
Sharpe_ratio_FB=return_series["Facebook_Returns"].mean()/return_series["Facebook_Returns"].std()
print("The sharpe ratio of SP500 is " + str(round(Sharpe_ratio_SP,2)) + ",while the sharpe ratio of Facebook is " + str(round(Sharpe_ratio_FB,2))+ ".")

## The sharpe ratio of SP500 is 0.45,while the sharpe ratio of Facebook is 0.27.

It seems that even if on average Facebook delivers a higher return, it is for a much higher risk, so that is better to invest in S&P500 index (diversified strategies usually beat undiversified ones according to this metric) and take on leverage to increase risk to match the risk aversion of the investor. Again, this result is expected because diversification (investing in all the stocks in the market) precisely aims to increase the Sharpe ratio by removing exposure to idiosyncratic risk, i.e., the risk specific to each company, keeping only exposure to systematic risk, i.e., the market-wide risk.

We now consider the cumulative returns of both series:

import pandas as pd
import matplotlib.pyplot as plt
#import the time series of prices for Facebook and SP500 index
cumumlative_returns_SP = (1 + return_series["SP500_Returns"]).cumprod() - 1
cumumlative_returns_FB = (1 + return_series["Facebook_Returns"]).cumprod() - 1
#we must start accumulating returns from the same point in time, so we begin by merging on overlapping dates
cum_series=pd.merge(cumumlative_returns_SP,cumumlative_returns_FB,how="inner", left_index=True, right_index=True)
cum_series.head()

##             SP500_Returns  Facebook_Returns
## Date                                       
## 2012-06-30      -0.013264          0.050507
## 2012-07-31       0.013800         -0.266554
## 2012-08-31       0.046359         -0.389932
## 2012-09-30       0.076159         -0.268243
## 2012-10-31       0.071984         -0.286824

We can now plot the series of cumulative returns:

plt.clf()
cum_series.plot(kind='line', title='Cumulative monthly returns over time')
plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.tight_layout()
plt.show()

We can confirm what we have observed earlier with descriptive statistics: the return series of the S&P500 index is much less volatile and delivers on average a lower positive return.

7.3. Trading strategy

Now, let us assume that we believe momentum trading or technical analysis are viable trading strategies, we will try to apply those to single time series of stock or index prices.

Let us start with Facebook stock (we use daily data this time). We have done some research, and it seems that using technical analysis, we could get a return better than for a buy-and-hold strategy (our benchmark = passively holding the stock over the entire period considered). Technical analysis looks at the technical features of the past time series to extrapolate future trends. It relies a lot on moving averages. A moving average is a simple technical analysis tool that smooths out price data by creating a constantly updated average price. Below, we compute and plot a 50-days (250-days) moving average for Facebook stock price (we use the pandas method rolling()):

import pandas as pd
import matplotlib.pyplot as plt
facebook_price=pd.read_csv("facebook_price_quandl_daily.csv",parse_dates=['Date'], index_col='Date')
facebook_price["MA50"]=facebook_price['Adj. Close'].rolling(window=50, min_periods=50, center=False).mean()
facebook_price["MA250"]=facebook_price['Adj. Close'].rolling(window=250, min_periods=100, center=False).mean()

plt.clf()
facebook_price[["MA50","MA250", 'Adj. Close']].plot(kind='line', title='Cumulative monthly returns over time')
plt.title("Moving averages of Facebook stock price")
plt.ylabel("Price")
plt.xlabel("")
plt.show()

When a shorter moving average (more recent, for instance based on last 50 days) becomes greater (lower) than a longer moving average (for instance based on last 100 days), it means that, recently, the price has been increasing (decreasing) more than before. One may want to use this event as a signal to trade (buy or sell). We will test such a technical-analysis-based trading strategy. We plot below a MA-50 and a MA-100 for Facebook stock prices:

import pandas as pd
import matplotlib.pyplot as plt
facebook_price=pd.read_csv("facebook_price_quandl_daily.csv",parse_dates=['Date'], index_col='Date')
facebook_price["MA50"]=facebook_price['Adj. Close'].rolling(window=50, min_periods=50, center=False).mean()
facebook_price["MA100"]=facebook_price['Adj. Close'].rolling(window=100, min_periods=100, center=False).mean()

plt.clf()
facebook_price[["MA50","MA100"]].plot(kind='line', title='Cumulative monthly returns over time')
plt.title("Moving averages of Facebook stock price")
plt.ylabel("Price")
plt.xlabel("")
plt.show()

We want to create buy (sell) signals in our dataframe to test our trading strategy, that is to start accumulating returns as long as as we get a certain signal:

import pandas as pd
import matplotlib.pyplot as plt
facebook_price=pd.read_csv("facebook_price_quandl_daily.csv",parse_dates=['Date'], index_col='Date')
#generate usual returns and cumulative returns
facebook_price['FB Ret']= facebook_price['Adj. Close'].pct_change()
facebook_price['FB Buy and Hold'] = (1 + facebook_price['FB Ret']).cumprod() - 1

# We initialize the short and long windows
short_window = 50
long_window = 100
# Add a signal column to the dataframe, we set it to zero
facebook_price['signal'] = 0.0
# Create a simple moving average over the short window
facebook_price['short_mavg'] = facebook_price['Adj. Close'].rolling(window=short_window, min_periods=50, center=False).mean()
# Create a simple moving average over the long window
facebook_price['long_mavg'] = facebook_price['Adj. Close'].rolling(window=long_window, min_periods=100, center=False).mean()
# Create signals
facebook_price.loc[facebook_price['short_mavg'] > facebook_price['long_mavg'], 'signal'] = 1
# Generate trading orders, when signal becomes 1, we send a buy order, when signal stops being 1, that is reverts back to 0, we close the position by selling
facebook_price['positions'] = facebook_price['signal'].diff()
# Print
print(facebook_price)

##               Open    High     Low  ...  short_mavg   long_mavg  positions
## Date                                ...                                   
## 2012-05-18   42.05   45.00   38.00  ...         NaN         NaN        NaN
## 2012-05-21   36.53   36.66   33.00  ...         NaN         NaN        0.0
## 2012-05-22   32.61   33.59   30.94  ...         NaN         NaN        0.0
## 2012-05-23   31.37   32.50   31.36  ...         NaN         NaN        0.0
## 2012-05-24   32.95   33.21   31.77  ...         NaN         NaN        0.0
## ...            ...     ...     ...  ...         ...         ...        ...
## 2018-03-21  164.80  173.40  163.30  ...  181.578398  180.259099        0.0
## 2018-03-22  166.13  170.27  163.72  ...  181.118798  180.201999        0.0
## 2018-03-23  165.44  167.10  159.02  ...  180.549798  180.089599        0.0
## 2018-03-26  160.82  161.10  149.02  ...  179.995598  179.911399        0.0
## 2018-03-27  156.31  162.85  150.75  ...  179.451998  179.634599       -1.0
## 
## [1472 rows x 18 columns]

We can next show the signals on the plot directly to get a visual representation of the strategy:

plt.clf()
facebook_price[["short_mavg","long_mavg"]].plot(kind='line')
# Plot the buy signals
plt.plot(facebook_price.loc[facebook_price.positions == 1.0].index, 
         facebook_price.short_mavg[facebook_price.positions == 1.0],
         '^', markersize=10, color='g')
# Plot the sell signals
plt.plot(facebook_price.loc[facebook_price.positions == -1.0].index, 
         facebook_price.short_mavg[facebook_price.positions == -1.0],
         'v', markersize=10, color='r')
plt.title("Tradign signals on the moving averages")
plt.ylabel("")
plt.xlabel("")
plt.show()

Next, we compute the strategies’ returns. We buy when signal = 1 and hold until signal = 0, at which point we sell, then we wait the next signal = 1 to buy again, and so on. We want to know whether such a strategy beats simply buying and holding the share over the same period (which involves much less effort and transaction costs).

#we generate cumulative returns for both strategies
facebook_price['FB Ret Strat']= facebook_price['Adj. Close'].pct_change()*facebook_price['signal']
facebook_price['FB Our Strat']= (1 + facebook_price['FB Ret Strat']).cumprod() - 1
facebook_price['signal'] = facebook_price['signal']

#check returns
facebook_price[["FB Ret",'FB Ret Strat','FB Buy and Hold',"FB Our Strat",'signal' ]].describe()

##             FB Ret  FB Ret Strat  FB Buy and Hold  FB Our Strat       signal
## count  1471.000000   1471.000000      1471.000000   1471.000000  1472.000000
## mean      0.001205      0.000868         1.341447      0.933171     0.771060
## std       0.023325      0.016709         1.269492      0.739957     0.420293
## min      -0.116968     -0.069425        -0.536276     -0.093071     0.000000
## 25%      -0.009080     -0.004798         0.289019      0.242828     1.000000
## 50%       0.000892      0.000000         1.116301      0.768560     1.000000
## 75%       0.011884      0.007222         2.240888      1.508214     1.000000
## max       0.296077      0.155214         4.050508      2.526003     1.000000

Graphically, we see that our strategy underperforms the buy and hold one.

plt.clf()
facebook_price[["FB Buy and Hold","FB Our Strat"]].plot(kind='line', title='Portfolio balance over time')
plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.show()

The worst performance comes from missing periods of positive returns as we can see in the chart.

What we have done is backtesting the strategy, that is we have estimated how much return it makes based on historical data. Notice, that we have ignored important factors such as trading costs (and other possible transaction costs) that can make an active strategy less profitable than a passive one (a buy-and-hold strategy).

Now let us assume that we could also short Facebook shares. We assume that the broker can find someone to lend us one share of Facebook that we can sell right away in the market for a price p1 at time m. Assume we borrow the share for three months. We have to give it back its owner three months later. After three months, we buy the share in the market and give it back to the owner. We buy it at price p2 at time m+3. As you an see, if the stock price has decreased between m and m+3, then p2 is lower than p1, and we make a gain. The money we got from selling the share when we have borrowed it at time m is greater than what it costs us to buy it when it is time to give it back to the owner at time m+3. To sum up, we now assume that when we predict the stock price will go down, we can open a short position and make a positive return out of it (we ignore the fees charged for borrowing a share when shorting).

We can now refine our trading strategy. When we get a sell signal we are now going to short Facebook stock hoping that the stock price will go down, which will generate a positive return for us. Practically, in our code, we replace signal=0 by signal=-1.

import numpy as np
facebook_price['signal']=np.where(facebook_price['signal'] ==0, -1, facebook_price['signal'])
facebook_price['FB Ret Strat']= facebook_price['Adj. Close'].pct_change()*facebook_price['signal']
facebook_price['FB Our Strat']= (1 + facebook_price['FB Ret Strat']).cumprod() - 1
facebook_price['signal'] = facebook_price['signal']
facebook_price[["FB Ret",'FB Ret Strat','FB Buy and Hold',"FB Our Strat",'signal' ]].describe()

##             FB Ret  FB Ret Strat  FB Buy and Hold  FB Our Strat       signal
## count  1471.000000   1471.000000      1471.000000   1471.000000  1472.000000
## mean      0.001205      0.000531         1.341447      0.185586     0.542120
## std       0.023325      0.023350         1.269492      0.242491     0.840587
## min      -0.116968     -0.296077        -0.536276     -0.386401    -1.000000
## 25%      -0.009080     -0.009294         0.289019     -0.014746     1.000000
## 50%       0.000892      0.000847         1.116301      0.164425     1.000000
## 75%       0.011884      0.011162         2.240888      0.365507     1.000000
## max       0.296077      0.155214         4.050508      0.904652     1.000000

The outcome is even worse because instead of missing periods of positive returns, we now bet against them, earning a negative return on our short positions.

plt.clf()
facebook_price[["FB Buy and Hold","FB Our Strat"]].plot(kind='line', title='Portfolio balance over time')
plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.show()

Now, we try to implement another trading strategy that has been shown to generate some performance for market indexes depending on the context: momentum investing. It consists in buying (shorting) following positive(negative) returns. For instance, it can be implemented based on previous-month return or the previous three, six, or twelve months of returns. The expectation here is that increases and decreases tend to persist for some months. We are going to use monthly returns of the S&P 500 and buy (short) if the return is positive (negative) in the previous month. See the code below (here we assume we invested 1,000 in the first period instead of 1):

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#get index prices
sp_data=pd.read_csv("sp500_price_quandl.csv",parse_dates=['Date'], index_col='Date')
#create returns
sp_data["return"]=sp_data["Value"].pct_change()
# look at 2000 - 2020
after_start_date = sp_data.index >= "2000-01-01"
before_end_date =  sp_data.index < "2021-01-01"
between_two_dates = after_start_date & before_end_date
sp_data = sp_data.loc[between_two_dates]
#create cum returns for buy and hold strat
sp_data['Ret CumProd'] = (1 + sp_data["return"]).cumprod() - 1
sp_data['Buy and Hold Strat'] = 1000 + 1000*sp_data['Ret CumProd']
#create momentum returns series
#set signal to 0
sp_data['signal'] = 0
#create a signal based on the return in the previous month, we use shift() to do so.
#shift() => shift index by desired number of periods with an optional time freq.
sp_data['signal'] = np.where(sp_data["return"].shift(periods=1)>0, 1, sp_data['signal'])
sp_data['signal'] = np.where(sp_data["return"].shift(periods=1)<0, -1, sp_data['signal'])
#create returns for momentum strategy
sp_data['Momentum Ret']= sp_data["return"]*sp_data['signal']
sp_data['Momentum CumProd']= (1 + sp_data['Momentum Ret']).cumprod() - 1
sp_data['Momentum Strat']= 1000 + 1000*sp_data['Momentum CumProd']
sp_data[["return",'Momentum Ret',"Buy and Hold Strat","Momentum Strat",'signal' ]].describe()

##            return  Momentum Ret  Buy and Hold Strat  Momentum Strat      signal
## count  252.000000    252.000000          252.000000      252.000000  252.000000
## mean     0.004602      0.007699         1156.752837     2592.717329    0.242063
## std      0.038490      0.037990          472.143645     1157.470417    0.970140
## min     -0.203911     -0.126844          529.950724      940.396613   -1.000000
## 25%     -0.012779     -0.013751          819.581712     1616.092172   -1.000000
## 50%      0.010614      0.009138          972.054624     2397.413694    1.000000
## 75%      0.027462      0.027742         1447.523238     3382.448249    1.000000
## max      0.126844      0.203911         2629.049192     5784.794583    1.000000

It does seem that the momentum strategy beats a buy-and-hold strategy when it comes to S&P 500. Especially in the recent period, with ETFs such a strategy can be implemented at low cost. We have a visual confirmation below:

plt.clf()
sp_data[["Buy and Hold Strat","Momentum Strat"]].plot(kind='line', title='B&H vs momentum S&P 500')
plt.xticks(rotation=90)

plt.ylabel("Portfolio balance")
plt.xlabel("")
plt.show()

We could also look at abnormal return patterns around the end of the week, specific days of the week, or specific hours of the day, and try to take advantage of it to implement a trading strategies that beat a mere buy-and-hold one… Monday effects have been documented in some markets for instance. For a quick test we get S&P 500 daily prices. The raw csv file is available on Blackboard under sp_500_daily.csv. It covers the 2000-2021 period.

We start by looking at the raw daily series:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
sp_data=pd.read_csv("sp_500_daily.csv",parse_dates=['Date'], index_col='Date', sep=";")
#important step, we order the dates correctly (to not mess up the incoming return calculation)
sp_data.sort_index(inplace=True)
sp_data['daily return']=sp_data['Adj Close'].pct_change()
sp_data.describe()

##          Adj Close  daily return
## count  2979.000000   2978.000000
## mean   2226.481417      0.000531
## std     849.064105      0.010834
## min    1022.580000     -0.119841
## 25%    1458.330000     -0.003445
## 50%    2081.720000      0.000703
## 75%    2779.315000      0.005372
## max    4613.670000      0.093828

plt.clf()
sp_data['Adj Close'].plot(kind='line', title='daily S&P 500 index price over 2010-2021')
plt.xlabel("")
plt.show()

plt.clf()
sp_data['daily return'].plot(kind='line', title='daily S&P 500 returns over 2010-2021')
plt.xlabel("")
plt.show()

Then, we look at the average daily returns on Mondays:

#we show the average daily return across the week of the day, Monday being = 0
sp_data['daily return'].groupby(sp_data.index.dayofweek).mean()

## Date
## 0    0.000158
## 1    0.000945
## 2    0.000581
## 3    0.000459
## 4    0.000475
## Name: daily return, dtype: float64

We can also show it with a bar plot:

plt.clf()
sp_data['daily return'].groupby(sp_data.index.dayofweek).mean().plot(kind='bar')
plt.show()

It seems that on average returns are stronger at the end of the week based on the median. We could create a strategy that takes advantage of the above observations by going short on Monday and long on Tuesday. It would allow a limited exposure to S&P 500 and only harvest the average difference between Mondays and Tuesdays. We backtest it:

#we invest 1,000 in S&P 500, we are long on the SP&500 but sell at the end every Friday, short on Monday, and close the short position at the end of Monday + go back long on Tuesday.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
sp_data=pd.read_csv("sp_500_daily.csv",parse_dates=['Date'], index_col='Date', sep=";")
sp_data.sort_index(inplace=True)
sp_data['daily return']=sp_data['Adj Close'].pct_change()
sp_data['Ret CumProd'] = (1 + sp_data["daily return"]).cumprod() - 1
sp_data['Buy and Hold'] = 1000 + 1000*sp_data['Ret CumProd'] 
sp_data['signal'] = 0
sp_data['signal'] = np.where(sp_data.index.dayofweek==1, 1, sp_data['signal'])
sp_data['signal'] = np.where(sp_data.index.dayofweek==0, -1, sp_data['signal'])
sp_data['day of the week']=sp_data.index.dayofweek
sp_data['Strat Ret']= sp_data["daily return"]*sp_data['signal']
sp_data['Strat CumProd']= (1 + sp_data['Strat Ret']).cumprod() - 1
sp_data['Strat']= 1000 + 1000*sp_data['Strat CumProd']
sp_data[["daily return",'Strat Ret',"Buy and Hold","Strat",'signal' ]].describe()

##        daily return    Strat Ret  Buy and Hold        Strat       signal
## count   2978.000000  2978.000000   2978.000000  2978.000000  2979.000000
## mean       0.000531     0.000164   1965.461837  1278.820435     0.017456
## std        0.010834     0.007046    749.318262   220.501536     0.626559
## min       -0.119841    -0.070331    902.549890   803.355753    -1.000000
## 25%       -0.003445    -0.000000   1288.036523  1120.218691     0.000000
## 50%        0.000703     0.000000   1837.438989  1318.044558     0.000000
## 75%        0.005372     0.000000   2453.205677  1396.548606     0.000000
## max        0.093828     0.119841   4072.118907  2033.939800     1.000000

It yields a positive return on average, showing that for a limited risk exposure to the market (buying Monday and selling Tuesday), one can harvest a small positive return on average. While yielding a positive return, such an approach does not beat a buy-and-hold strategy (in terms of cumulative returns), because for all days of the week, on average, returns are positive.

7.4. Exercise

Using the time series of monthly stock price for Microsoft (microsoft_price_quandl.csv available on Blackboard), implement a trading strategy of your choice and compare it to a buy and hold one.

7.5. Building a portfolio

We have seen how to compute returns and assess the performance relative to a benchmark for a single stock. Now we assume we invest in multiple stocks. For simplicity, we will work with three stocks. You can find the relevant csv files on Blackboard: facebook_price_quandl.csv, apple_price_quandl.csv, microsoft_price_quandl.csv.

We start by generating the returns of an equally-weighted portfolio of stocks:

#import relevant packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#get individual stock data
facebook_price=pd.read_csv("facebook_price_quandl.csv",parse_dates=['Date'], index_col='Date')
apple_price=pd.read_csv("apple_price_quandl.csv",parse_dates=['Date'], index_col='Date')
microsoft_price=pd.read_csv("microsoft_price_quandl.csv",parse_dates=['Date'], index_col='Date')

#generate returns for each series
microsoft_price['MSFT Ret']=microsoft_price['Adj. Close'].pct_change()
microsoft_ret=microsoft_price[['MSFT Ret']]
facebook_price['FB Ret']=facebook_price['Adj. Close'].pct_change()
facebook_ret=facebook_price[['FB Ret']]
apple_price['AAPL Ret']=apple_price['Adj. Close'].pct_change()
apple_ret=apple_price[['AAPL Ret']]

#merge series of returns
portfolio=pd.merge(microsoft_ret,apple_ret,how='inner',left_index=True,right_index=True)
portfolio=pd.merge(portfolio,facebook_ret,how='inner',left_index=True,right_index=True)

#create the returns for our equally weighted portfolio and show some information
portfolio.dropna(inplace=True)
portfolio['EW Ret']= (portfolio['MSFT Ret']+portfolio['FB Ret']+portfolio['AAPL Ret'])/3
portfolio.head()

##             MSFT Ret  AAPL Ret    FB Ret    EW Ret
## Date                                              
## 2012-06-30  0.047962  0.010853  0.050507  0.036440
## 2012-07-31 -0.036613  0.045822 -0.301817 -0.097536
## 2012-08-31  0.052751  0.093850 -0.168217 -0.007205
## 2012-09-30 -0.034393  0.002803  0.199468  0.055960
## 2012-10-31 -0.040995 -0.107607 -0.025392 -0.057998

#we can show summary statistics
portfolio.describe()

##         MSFT Ret   AAPL Ret     FB Ret     EW Ret
## count  70.000000  70.000000  70.000000  70.000000
## mean    0.020001   0.014115   0.029171   0.021096
## std     0.060147   0.068881   0.108615   0.051157
## min    -0.130248  -0.144094  -0.301817  -0.097536
## 25%    -0.013046  -0.025317  -0.026631  -0.012286
## 50%     0.020251   0.013738   0.021528   0.021190
## 75%     0.051554   0.069492   0.070892   0.053385
## max     0.196409   0.141225   0.479100   0.180674

#we can also show the minimum and maximum dates in our index, to see what period we cover
print ("the min date in our portfolio is", portfolio.index.min(), "and the max date is ",portfolio.index.max())

## the min date in our portfolio is 2012-06-30 00:00:00 and the max date is  2018-03-31 00:00:00

Then we plot the series of cumulative returns:

#we generate and plot the series of cumulative return vs the individual stocks
portfolio['Cum ret FB']= (1 + portfolio['FB Ret']).cumprod() - 1
portfolio['Cum ret MSFT']= (1 + portfolio['MSFT Ret']).cumprod() - 1
portfolio['Cum ret AAPL']= (1 + portfolio['AAPL Ret']).cumprod() - 1
portfolio['Cum ret EW']= (1 + portfolio['EW Ret']).cumprod() - 1
plt.clf()
portfolio[['Cum ret FB','Cum ret MSFT','Cum ret AAPL','Cum ret EW']].plot(kind='line', title='Cumulative returns over time')
plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.show()

So, does creating a portfolio allows us to get the average return of the three stock for less risk? Yes a bit it seems based on the chart and summary statistics. The standard deviation of the portfolio is lower than any of the three stocks. We can surely do better, perhaps by playing with the weights.

We know pick the weights (weights have to sum to 1):

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
facebook_price=pd.read_csv("facebook_price_quandl.csv",parse_dates=['Date'], index_col='Date')
apple_price=pd.read_csv("apple_price_quandl.csv",parse_dates=['Date'], index_col='Date')
microsoft_price=pd.read_csv("microsoft_price_quandl.csv",parse_dates=['Date'], index_col='Date')
microsoft_price['MSFT Ret']=microsoft_price['Adj. Close'].pct_change()
microsoft_ret=microsoft_price[['MSFT Ret']]
facebook_price['FB Ret']=facebook_price['Adj. Close'].pct_change()
facebook_re=facebook_price[['FB Ret']]
apple_price['AAPL Ret']=apple_price['Adj. Close'].pct_change()
apple_ret=apple_price[['AAPL Ret']]
portfolio_2=pd.merge(microsoft_ret,facebook_ret,how='inner',left_index=True,right_index=True)
portfolio_2=pd.merge(portfolio_2,apple_ret,how='inner',left_index=True,right_index=True)
portfolio_2.dropna(inplace=True)
#this time we use weights (arbitrary ones)
portfolio_2['Port Ret']= (portfolio['MSFT Ret']*0.5+portfolio['FB Ret']*0.2+portfolio['AAPL Ret']*0.3)
portfolio_2.head()

##             MSFT Ret    FB Ret  AAPL Ret  Port Ret
## Date                                              
## 2012-06-30  0.047962  0.050507  0.010853  0.037338
## 2012-07-31 -0.036613 -0.301817  0.045822 -0.064923
## 2012-08-31  0.052751 -0.168217  0.093850  0.020887
## 2012-09-30 -0.034393  0.199468  0.002803  0.023538
## 2012-10-31 -0.040995 -0.025392 -0.107607 -0.057858

portfolio_2.describe()

##         MSFT Ret     FB Ret   AAPL Ret   Port Ret
## count  70.000000  70.000000  70.000000  70.000000
## mean    0.020001   0.029171   0.014115   0.020070
## std     0.060147   0.108615   0.068881   0.047265
## min    -0.130248  -0.301817  -0.144094  -0.084401
## 25%    -0.013046  -0.026631  -0.025317  -0.017531
## 50%     0.020251   0.021528   0.013738   0.023003
## 75%     0.051554   0.070892   0.069492   0.049676
## max     0.196409   0.479100   0.141225   0.148814

portfolio_2['Cum ret FB']= (1 + portfolio_2['FB Ret']).cumprod() - 1
portfolio_2['Cum ret MSFT']= (1 + portfolio_2['MSFT Ret']).cumprod() - 1
portfolio_2['Cum ret AAPL']= (1 + portfolio_2['AAPL Ret']).cumprod() - 1
portfolio_2['Cum ret Port']= (1 + portfolio_2['Port Ret']).cumprod() - 1
plt.clf()
portfolio_2[['Cum ret FB','Cum ret MSFT','Cum ret AAPL','Cum ret Port']].plot(kind='line', title='Cumulative returns over time')
plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.show()

We merge the returns series from both portfolios to compare them:

to_compare=pd.merge(portfolio[['EW Ret','Cum ret EW']],portfolio_2[['Port Ret','Cum ret Port']],left_index=True,right_index=True,how='inner')
to_compare.describe()

##           EW Ret  Cum ret EW   Port Ret  Cum ret Port
## count  70.000000   70.000000  70.000000     70.000000
## mean    0.021096    1.221384   0.020070      1.062610
## std     0.051157    0.977839   0.047265      0.864152
## min    -0.097536   -0.077919  -0.084401     -0.066433
## 25%    -0.012286    0.424457  -0.017531      0.377866
## 50%     0.021190    1.164019   0.023003      0.969778
## 75%     0.053385    1.743369   0.049676      1.505660
## max     0.180674    3.310285   0.148814      3.006633

To do the right comparison, we compute for each portfolio its Sharpe ratio: the average return it generates for a unit of risk (standard deviation):

Sharpe_ratio_EW=to_compare['EW Ret'].mean()/to_compare['EW Ret'].std()
Sharpe_ratio_W=to_compare['Port Ret'].mean()/to_compare['Port Ret'].std()
print("The sharpe ratio of the equally-weighted portfolio " + str(round(Sharpe_ratio_EW,2)) + ", while the sharpe ratio of the portfolio where we pick our own stocks is  " + str(round(Sharpe_ratio_W,2))+ ".")

## The sharpe ratio of the equally-weighted portfolio 0.41, while the sharpe ratio of the portfolio where we pick our own stocks is  0.42.

The second portfolio (where the weights are not equal) seems slightly better. Notice that visual representation is not really helping here:

plt.clf()
to_compare[['Cum ret EW','Cum ret Port']].plot(kind='line', title='Cumulative returns of both portfolios')
plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.show()

We conclude that the weights we pick allow us to get a better performance to risk over an equally-weighted portfolio and should be favored. We could even do better by finding the optimal weights that maximize the Sharpe ratio. Optimization is beyond the scope of this course, but this is something we can do with Python.

7.6. Trading strategy - portfolio

We are now considering trading strategies based on a portfolio of stocks.

We first create an algorithm that extracts the monthly stock prices for S&P 500 constituents from Yahoo. We store the resulting csv file under sp500_const_monthly_prices.csv, available on Blackboard. The code to collect the data is also available on Blackboard (it takes some time to run!), under collect_sp500_prices_from_yahoo.py. We assume we have the right csv file. We concentrate on 2000-2020.

The first thing we do is to look at the performance of an equally-weighted portfolio based on this S&P 500 stocks. We generate the returns of the equally-weighted portfolio as follows:

import pandas as pd
import matplotlib.pyplot as plt
#Import dataframe with monthly stock prices for S&P 500 stocks
port = pd.read_csv("sp500_const_monthly_prices.csv", parse_dates=['Date'], index_col='Date', sep=";")
#generate returns
for i in port.columns:
    port[i]=port[i].pct_change()
#time restriction
port=port[port.index.year>=2000]
port=port[port.index.year<=2020]
#show period covered
# print ("the min date in our portfolio is", port.index.min(), "and the max date is ",port.index.max())
#show stocks in the portfolio
#for i in port.columns:
#    print(i)
#compute equally-weighted portfolio return for each date
port['EW_ret'] = port.mean(axis = 1)
ew_port= port[['EW_ret']]
ew_port['EW_ret'].describe()

## count    251.000000
## mean       0.012779
## std        0.041122
## min       -0.191278
## 25%       -0.004433
## 50%        0.017686
## 75%        0.031819
## max        0.184080
## Name: EW_ret, dtype: float64

#plot cumulative returns
ew_port['EW_ret_cum']= (1 + ew_port['EW_ret']).cumprod() - 1

plt.clf()
ew_port[['EW_ret_cum']].plot(kind='line', title='Cumulative returns of SP500 EW portfolio 2000-2020')
plt.xticks(rotation=90)

plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.tight_layout()
plt.show()

The returns are not exactly the ones of the SP 500 index, why? This is because the S&P 500 index weights stocks by their market capitalization. Indexers invest in stocks based on their market capitalization to reflect their relative importance in the market. To get closer to that, we need to know the market capitalization at each dates and for each stock. We use EIKON Refinitiv Excel add-in to get the market capitalization over time for our list of tickers (training on EIKON Refinitiv to collect data are offered by the library - take advantage of it - also request an account to the library if you want to have access to the data). The csv file marketcap_sp500_stocks.csv is available on Blackboard. We generate the returns of a value-weighted (i.e., weighted by market capitalization) portfolio based on S&P 500 Stock universe as follows:

#import relevant packages
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as wb
#load the monthly capitalization of SS&P 500 Stock that we have collected with EIKOn Refinitiv Excel Add-in
market_caps = pd.read_csv("marketcap_sp500_stocks.csv", sep=";")
#we look at what the dataset looks like
market_caps.head()

##   ticker        date Company Market Cap
## 0    MMM  29/10/2021        1.02965E+11
## 1    MMM  30/09/2021        1.01086E+11
## 2    MMM  31/08/2021        1.12684E+11
## 3    MMM  30/07/2021        1.14536E+11
## 4    MMM  30/06/2021        1.14935E+11

#Next, we clean the data we get, dropping undefined values
market_caps.dropna(inplace=True)
#we further drop duplicates
market_caps=market_caps.drop_duplicates(subset=['date', 'ticker'], keep='first')
#we pivot the dataframe because we prefer having the rows as columns
market_caps=market_caps.pivot(index='date', columns='ticker',values='Company Market Cap')
#we add a time restriction
market_caps.index = pd.to_datetime(market_caps.index)
market_caps = market_caps.sort_index()
market_caps=market_caps[market_caps.index.year>=2000]
market_caps=market_caps[market_caps.index.year<=2020]
#corrections
market_caps=market_caps.drop(pd.to_datetime('2007-10-07'))
market_caps=market_caps.drop(pd.to_datetime('2004-05-31'))
market_caps=market_caps.drop(pd.to_datetime('2007-09-21'))
market_caps=market_caps.drop(pd.to_datetime('2010-05-31'))
print ("the min date in our portfolio is", port.index.min(), "and the max date is ",port.index.max())
#we look at what the dataset looks like

## the min date in our portfolio is 2000-01-31 00:00:00 and the max date is  2020-12-31 00:00:00

market_caps.head()

## ticker                A  AAP          ABC  ...         YUM  ZBH  ZTS
## date                                       ...                      
## 2000-01-31  29916750000  NaN  928183243.8  ...  4322375000  NaN  NaN
## 2000-02-29  46951500000  NaN  745747226.9  ...  3946497042  NaN  NaN
## 2000-03-31  47008000000  NaN    768045360  ...  4604246549  NaN  NaN
## 2000-04-28  40058500000  NaN   1024060480  ...  4994519634  NaN  NaN
## 2000-05-31  33298523570  NaN   1241673332  ...  4321826030  NaN  NaN
## 
## [5 rows x 324 columns]

#we now create the weights
#we make sure the stored values are floats
market_caps=market_caps.astype(float)
#we sum the capitalization by month
market_caps['tot_cap'] = market_caps.sum(axis = 1, numeric_only=True, skipna=True)
#then for each column, we express the market cap of the stock for a given month as a percentage of the total market cap of the market this month
for i in market_caps.columns:
    market_caps[i]=market_caps[i]/market_caps['tot_cap']
#we look at what the dataset looks like
market_caps.head()

## ticker             A  AAP       ABC       ABT  ...       YUM  ZBH  ZTS  tot_cap
## date                                           ...                             
## 2000-01-31  0.005484  NaN  0.000170  0.009239  ...  0.000792  NaN  NaN      1.0
## 2000-02-29  0.009282  NaN  0.000147  0.010097  ...  0.000780  NaN  NaN      1.0
## 2000-03-31  0.008344  NaN  0.000136  0.009667  ...  0.000817  NaN  NaN      1.0
## 2000-04-28  0.007071  NaN  0.000181  0.010512  ...  0.000882  NaN  NaN      1.0
## 2000-05-31  0.005764  NaN  0.000215  0.010912  ...  0.000748  NaN  NaN      1.0
## 
## [5 rows x 325 columns]

#we then load return data - we have seen how to do that already
port = pd.read_csv("sp500_const_monthly_prices.csv", parse_dates=['Date'], index_col='Date', sep=";")
for i in port.columns:
    port[i]=port[i].pct_change()
port=port[port.index.year>=2000]
port=port[port.index.year<=2020]
#make sure the date format is the same in both datasets (last day of the month is not consistent, so we go for Year-Month)
port.index=port.index.strftime('%Y-%m')
market_caps.index=market_caps.index.strftime('%Y-%m')
#we create a new dataframe that will be used to store the weighted returns
wr_portfolio = pd.DataFrame(index=market_caps.index)
#we loop through the stocks of the dataframe port (the columns), and multiply for each row, the returns by their corresponding weights tin the market_caps dataframe.
for i in port.columns:
    try:
     wr_portfolio[i]=(port[[i]]*market_caps[[i]]).values
    except:
         continue
#we look at what the dataset looks like
wr_portfolio.head()

##               MMM       AOS       ABT  ACN  ...  XYL       YUM  ZBH  ZTS
## 2000-01       NaN       NaN       NaN  NaN  ...  NaN       NaN  NaN  NaN
## 2000-02 -0.000361 -0.000006  0.000136  NaN  ...  NaN -0.000055  NaN  NaN
## 2000-03  0.000026  0.000002  0.000641  NaN  ...  NaN  0.000136  NaN  NaN
## 2000-04 -0.000077  0.000005  0.000852  NaN  ...  NaN -0.000025  NaN  NaN
## 2000-05 -0.000076  0.000004  0.000818  NaN  ...  NaN -0.000022  NaN  NaN
## 
## [5 rows x 324 columns]

# We next generate the returns of the value-weighted portfolio
wr_portfolio['VW_ret'] = wr_portfolio.sum(axis = 1)
wr_portfolio['VW_ret'].describe()

## count    252.000000
## mean       0.009610
## std        0.034778
## min       -0.145400
## 25%       -0.004952
## 50%        0.014027
## 75%        0.027175
## max        0.120006
## Name: VW_ret, dtype: float64

#We plot the cumulative returns of the vw portfolio:
wr_portfolio['VW_ret_cum']= (1 + wr_portfolio['VW_ret']).cumprod() - 1
plt.clf()
wr_portfolio[['VW_ret_cum']].plot(kind='line', title='Cumulative returns of S&P 500 value-weighted portfolio over 2000-2020')
plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.tight_layout()
plt.show()

This patterns looks much more familiar (it is close to the one S&P500 index charts we saw before). This was expected. No we have generated the returns of two portfolios using the S&P 500 universe of stocks. We are going to merge the returns of both strategies in a same dataframe to compare the performance of both strategies:

#we create a new dataframe and add the returns from the equally-weighted and value-weighted portfolios
my_two_port = pd.DataFrame(index=wr_portfolio.index)
my_two_port=pd.merge(my_two_port,wr_portfolio[['VW_ret_cum','VW_ret']],left_index=True, right_index=True)
ew_port.index=ew_port.index.strftime('%Y-%m')
my_two_port=pd.merge(my_two_port,ew_port[['EW_ret_cum','EW_ret']],left_index=True, right_index=True)
my_two_port.describe()

##        VW_ret_cum      VW_ret  EW_ret_cum      EW_ret
## count  252.000000  252.000000  251.000000  251.000000
## mean     2.352893    0.009610    5.007270    0.012779
## std      2.275817    0.034778    4.800174    0.041122
## min     -0.056850   -0.145400    0.016671   -0.191278
## 25%      0.548677   -0.004952    1.281712   -0.004433
## 50%      1.245664    0.014027    2.683182    0.017686
## 75%      3.811630    0.027175    8.135304    0.031819
## max      8.571950    0.120006   18.631502    0.184080

#generating and showing sharpe ratios
Sharpe_ratio_EW=my_two_port["EW_ret"].mean()/my_two_port["EW_ret"].std()
Sharpe_ratio_VW=my_two_port["VW_ret"].mean()/my_two_port["VW_ret"].std()
print("The sharpe ratio of the equally-weighted portfolio " + str(round(Sharpe_ratio_EW,2)) + ", while the sharpe ratio of the portfolio where we pick our own stocks is  " + str(round(Sharpe_ratio_VW,2))+ ".")

## The sharpe ratio of the equally-weighted portfolio 0.31, while the sharpe ratio of the portfolio where we pick our own stocks is  0.28.

#plot of cumulative returns for both series
plt.clf()
my_two_port[['VW_ret_cum','EW_ret_cum']].plot(kind='line', title='Cumulative returns of two S&P 500 portfolios over 2000-2020')
plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.tight_layout()
plt.show()

We notice that the equally-weighted portfolio outperforms the value-weighted one. The average return is bigger and even adjusted for the risk (as shown by the Sharpe ratios), i.e., the performance is better. This is because small caps tend to beat bigger caps (refereed to as the SMB (Small Minus Big) factor). Relative to a value-weighted approach, the equally-weighted one gives bigger weights to the small caps relative to the bigger caps, explaining the better performance.

Given that it has been repeatedly shown that small-cap stocks outperform big-cap stocks, can we test a strategy that takes advantage of this? We try to exploit an anomaly (in the sense that this outperformance is not predicted by the CAPM model). If this anomaly does not exposed us to higher risk, being long on small-cap stocks and short on big-cap stocks seems to be a great strategy because it exposes us to few risk (systematic (market-wide) risk is cancelled out and it is still a diversified portfolio so we are not exposed to the specific risk of each company,idiosyncratic risk). Let us give it a try. Each month, we are long on stocks with a market capitalization in the bottom quintile (20% lowest values) and short on stocks with a market capitalization in the top quintile (20% top values):

#import required packages
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as wb
import numpy as np
#load and retreat market cap data
market_caps = pd.read_csv("marketcap_sp500_stocks.csv", sep=";")
market_caps=market_caps.drop_duplicates(subset=['date', 'ticker'], keep='first')
market_caps=market_caps.pivot(index='date', columns='ticker',values='Company Market Cap')
market_caps.index = pd.to_datetime(market_caps.index)
market_caps = market_caps.sort_index()
market_caps=market_caps[market_caps.index.year>=2000]
market_caps=market_caps[market_caps.index.year<=2020]
market_caps=market_caps.drop(pd.to_datetime('2007-10-07'))
market_caps=market_caps.drop(pd.to_datetime('2004-05-31'))
market_caps=market_caps.drop(pd.to_datetime('2007-09-21'))
market_caps=market_caps.drop(pd.to_datetime('2010-05-31'))
print ("the min date in our portfolio is", market_caps.index.min(), "and the max date is ",market_caps.index.max())

## the min date in our portfolio is 2000-01-31 00:00:00 and the max date is  2020-12-31 00:00:00

#for each month, generate top quintile and bottom quintile market cap values
market_caps=market_caps.astype(float)
market_caps['bot_quint'] = market_caps.quantile(q=0.20,axis=1)
market_caps['top_quint'] = market_caps.quantile(q=0.80,axis=1)
#look at the data
market_caps.head()

## ticker                 A  AAL  AAP  AAPL  ...  ZION  ZTS     bot_quint     top_quint
## date                                      ...                                       
## 2000-01-31  2.991675e+10  NaN  NaN   NaN  ...   NaN  NaN  1.331235e+09  1.771372e+10
## 2000-02-29  4.695150e+10  NaN  NaN   NaN  ...   NaN  NaN  1.151575e+09  1.685044e+10
## 2000-03-31  4.700800e+10  NaN  NaN   NaN  ...   NaN  NaN  1.306204e+09  1.821853e+10
## 2000-04-28  4.005850e+10  NaN  NaN   NaN  ...   NaN  NaN  1.339296e+09  1.843272e+10
## 2000-05-31  3.329852e+10  NaN  NaN   NaN  ...   NaN  NaN  1.424639e+09  1.824903e+10
## 
## [5 rows x 504 columns]

#we create a copy of the market_caps dataframe
market_caps_small=market_caps.copy()
#we loop through the market cap values, if the value is lower or equal to the bottom quintile this month, we create a weight of 1, and 0 otherwise
for i in market_caps_small.columns:
    try:
        market_caps_small[i] = np.where(market_caps_small[i] <= market_caps_small['bot_quint'] , 1, 0)
    except:
         continue
market_caps_small=market_caps_small.drop(columns=['bot_quint'])
market_caps_small=market_caps_small.drop(columns=['top_quint'])

#we follow a similar approach for big caps
market_caps_big=market_caps.copy()
for i in market_caps_big.columns:
    try:
        market_caps_big[i] = np.where(market_caps_big[i] >= market_caps_big['top_quint'] , 1, 0)
    except:
         continue
market_caps_big=market_caps_big.drop(columns=['top_quint'])
market_caps_big=market_caps_big.drop(columns=['bot_quint'])

#load return data as usual
port = pd.read_csv("sp500_const_monthly_prices.csv", parse_dates=['Date'], index_col='Date', sep=";")
for i in port.columns:
    port[i]=port[i].pct_change()
port=port[port.index.year>=2000]
port=port[port.index.year<=2020]

#assign weights to stocks
#make sure both dataframe have the same number of rows (same dates) and columns (same tickers)
#make sure dates have the same format
port.index=port.index.strftime('%Y-%m')
market_caps_small.index=market_caps_small.index.strftime('%Y-%m')

#create small-cap port returns
portfolio_small = pd.DataFrame(index=market_caps_small.index)
for i in port.columns:
    try:
        portfolio_small[i]=(port[[i]]*market_caps_small[[i]]).values
    except:
         continue
#keep where returns defined and different from 0
portfolio_small.replace(0, np.nan, inplace=True)
portfolio_small['small_ret'] = portfolio_small.mean(axis = 1)
portfolio_small['small_ret'].describe()

## count    251.000000
## mean       0.012763
## std        0.051308
## min       -0.259929
## 25%       -0.004045
## 50%        0.018283
## 75%        0.033268
## max        0.245605
## Name: small_ret, dtype: float64

#create big-cap port returns
market_caps_big.index=market_caps_big.index.strftime('%Y-%m')
portfolio_big = pd.DataFrame(index=market_caps_big.index)
for i in port.columns:
    try:
        portfolio_big[i]=(port[[i]]*market_caps_big[[i]]).values
    except:
         continue
#keep where returns defined and different from 0
portfolio_big.replace(0, np.nan, inplace=True)
portfolio_big['big_ret'] = portfolio_big.mean(axis = 1)
portfolio_big['big_ret'].describe()

## count    251.000000
## mean       0.009777
## std        0.035934
## min       -0.136525
## 25%       -0.007003
## 50%        0.014476
## 75%        0.027770
## max        0.127006
## Name: big_ret, dtype: float64

#put small and big returns together in the same dataframe
SMB_port = pd.DataFrame(index=portfolio_big.index)
SMB_port=pd.merge(SMB_port,portfolio_big[['big_ret']],left_index=True, right_index=True)
SMB_port=pd.merge(SMB_port,portfolio_small[['small_ret']],left_index=True, right_index=True)
SMB_port.describe()

##           big_ret   small_ret
## count  251.000000  251.000000
## mean     0.009777    0.012763
## std      0.035934    0.051308
## min     -0.136525   -0.259929
## 25%     -0.007003   -0.004045
## 50%      0.014476    0.018283
## 75%      0.027770    0.033268
## max      0.127006    0.245605

#implement long short strategy:
SMB_port['port_ret'] = SMB_port['small_ret']-SMB_port['big_ret']
SMB_port['port_ret'].describe()

## count    251.000000
## mean       0.002986
## std        0.026447
## min       -0.148546
## 25%       -0.009583
## 50%        0.002620
## 75%        0.014673
## max        0.137161
## Name: port_ret, dtype: float64

#show cumulative return on plot
SMB_port['port_ret_cum']= (1 + SMB_port['port_ret']).cumprod() - 1
plt.clf()
SMB_port[['port_ret_cum']].plot(kind='line', title='Cumulative returns of SMB strategy over 2000-2020')
plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.tight_layout()
plt.show()

Such a strategy delivers a positive return on average, one would need to adjust for transaction costs to see whether it is worth pursuing.

We next use a so-called smart beta approach. A smart beta strategy might use the same constituents as the S&P 500, but assigns alternative weightings to each stock in an effort to boost risk-adjusted returns. Smart beta funds are measured against benchmark indices, with the aim of yielding a few percentage points worth of extra return or reduced volatility risk. Strategy returns are measured on a risk-adjusted basis. We implement a smart beta portfolio taking advantage of the superior performance of small-cap stocks. We pick weights so that we overweight the small-cap stocks:

import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as wb
import numpy as np
market_caps = pd.read_csv("marketcap_sp500_stocks.csv", sep=";")
market_caps=market_caps.drop_duplicates(subset=['date', 'ticker'], keep='first')
market_caps=market_caps.pivot(index='date', columns='ticker',values='Company Market Cap')
market_caps.index = pd.to_datetime(market_caps.index)
market_caps = market_caps.sort_index()
market_caps=market_caps[market_caps.index.year>=2000]
market_caps=market_caps[market_caps.index.year<=2020]
market_caps=market_caps.drop(pd.to_datetime('2007-10-07'))
market_caps=market_caps.drop(pd.to_datetime('2004-05-31'))
market_caps=market_caps.drop(pd.to_datetime('2007-09-21'))
market_caps=market_caps.drop(pd.to_datetime('2010-05-31'))
market_caps=market_caps.astype(float)
market_caps=market_caps.rank(axis=1)
#this time we pick the weights according to the rank of market capitalization of a stock for a given month, weighting more the stocks with a high rank (so a lower market cap)
market_caps['tot_rank'] = market_caps.sum(axis = 1, numeric_only=True, skipna=True)
for i in market_caps.columns:
    market_caps[i]=market_caps[i]/market_caps['tot_rank']
market_caps=market_caps.drop(columns='tot_rank')
#load return data
port = pd.read_csv("sp500_const_monthly_prices.csv", parse_dates=['Date'], index_col='Date', sep=";")
for i in port.columns:
    port[i]=port[i].pct_change()
port=port[port.index.year>=2000]
port=port[port.index.year<=2020]
#create a new dataframe with weighted returns
port.index=port.index.strftime('%Y-%m')
market_caps.index=market_caps.index.strftime('%Y-%m')
new_portfolio = pd.DataFrame(index=market_caps.index)
for i in port.columns:
    try:
     new_portfolio[i]=(port[[i]]*market_caps[[i]]).values
    except:
         continue
#generate port returns
new_portfolio['W_ret'] = new_portfolio.sum(axis = 1)
new_portfolio['W_ret'].describe()

## count    252.000000
## mean       0.010960
## std        0.037667
## min       -0.177444
## 25%       -0.004597
## 50%        0.015634
## 75%        0.028599
## max        0.154123
## Name: W_ret, dtype: float64

#plot cumulative returns
new_portfolio['W_ret_cum']= (1 + new_portfolio['W_ret']).cumprod() - 1
plt.clf()
new_portfolio[['W_ret_cum']].plot(kind='line', title='Cumulative returns of SP500 VW portfolio 2000-2020')
plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.tight_layout()
plt.show()

#compare to standard value-weighted approach (S&P 500 benchmark)
my_two_port = pd.DataFrame(index=wr_portfolio.index)
my_two_port=pd.merge(my_two_port,wr_portfolio[['VW_ret_cum','VW_ret']],left_index=True, right_index=True)
my_two_port=pd.merge(my_two_port,new_portfolio[['W_ret_cum','W_ret']],left_index=True, right_index=True)
my_two_port.describe()

##        VW_ret_cum      VW_ret   W_ret_cum       W_ret
## count  252.000000  252.000000  252.000000  252.000000
## mean     2.352893    0.009610    3.497749    0.010960
## std      2.275817    0.034778    3.219116    0.037667
## min     -0.056850   -0.145400   -0.065183   -0.177444
## 25%      0.548677   -0.004952    0.959120   -0.004597
## 50%      1.245664    0.014027    1.997892    0.015634
## 75%      3.811630    0.027175    5.684738    0.028599
## max      8.571950    0.120006   12.054691    0.154123

#plot of cum returns
plt.clf()
my_two_port[['VW_ret_cum','W_ret_cum']].plot(kind='line', title='Cumulative returns of SP500 portfolios 2000-2020')
plt.xticks(rotation=90)

plt.ylabel("Cumulative monthly returns")
plt.xlabel("")
plt.tight_layout()
plt.show()

#sharpe ratios
Sharpe_ratio_VW=my_two_port["VW_ret"].mean()/my_two_port["VW_ret"].std()
Sharpe_ratio_W=my_two_port["W_ret"].mean()/my_two_port["W_ret"].std()
print("The sharpe ratio of the value-weighted portfolio " + str(round(Sharpe_ratio_VW,2)) + ", while the sharpe ratio of the portfolio where we pick our own weights is  " + str(round(Sharpe_ratio_W,2))+ ".")

## The sharpe ratio of the value-weighted portfolio 0.28, while the sharpe ratio of the portfolio where we pick our own weights is  0.29.

We could also apply the momentum strategy we saw at the index level at a stock level (try it!). There are numerous other market factors and anomalies we can use to try to derive a positive return while being hedged against both market risks and idiosyncratic risks or to create a portfolio that beats the S&P 500 (the value-weighted strategy benchmark). For instance, it has been shown that investors tend to overinvest in stocks whose ticker starts with letter “A” (top of the list) or tickers that actually mean something (like “BROS”, “LUCK”, or “DNA”). Feel free to test strategies based on such insights.

7.7. Exercise

Test a momentum strategy of your choice at the stock level within the S&P 500 universe, that is short on stocks experiencing negative returns and long on stocks experiencing positive returns. Assess its performance. Then, see how you could create a smart beta portfolio using momentum, compare the performance of such a portfolio to the one of the S&P 500 index.

8. Assignment 2 (50%)

Design an investment strategy (it can be the security of your choice (e.g., stocks, bonds, forex, crypto, commodities) and the strategy of your choice (e.g., factors, anomalies, technical analysis, smart beta)). Implement is using Python. Discuss its performance relative to appropriate benchmarks if any. You can collect the raw data (returns and other characteristics) from the source of your choice (Yahoo, Quandl, Bloomberg, EIKON…). Write a short report about it. The report presents the strategy (motivations), details how you got the data, introduces your analysis, shows the key outputs and comments them, and finally concludes. The marking criteria are:

2 marks for identifying and downloading a dataset and formatting it the right way in Python.
8 marks for generating relevant tables and plots when it comes to a strading strategy.
4 marks for introducing the dataset and the quality of the discussion of the tables and plots.
2 marks for commenting the code appropriately.
4 marks for ability to incorporate features not covered in class or dealing with other securities

Everything you do should be well explained so that by reading your code and the document your lecturer understands what you examine and what is reported. Send the code, data files, and the report by the due date to your lecturer at agarel@audencia.com, make sure to use as a title of the email PYTHON - ASSIGNMENT 2 - NAME. The size of both files together should not exceed 10 MB.

9. Textual analysis

With Python we can also analyze a text. It can be useful to derive information from news, transcripts of earnings calls, or financial reports. For instance, researchers have tested trading strategies based on the readability of reports, the quality of management answers to analysts’ questions during earnings calls, or the tone/sentiment conveyed by news about a company. In the following section, we show how to conduct such a textual analysis.

9.1. Analyzing a piece of text

We will need to install and import the package nltk. It provides a set of diverse natural languages algorithms. We start with a blog article that you can find here. We save it as txt file (article_edmans.txt on Blackboard) and open it in Python:

txt=open("article_edmans.txt",'r',encoding="utf8")
txt = txt.read()
#lower case
txt=txt.lower()
#print the first 1000 characters only
print(txt[0:1000])

## are ceos really the problem?
## a reader (in fact, one that provided invaluable input into “grow the pie”) alerted me to the article “ceos are the problem” by renowned mit economist daron acemoglu. the article makes the following arguments:
## 
## the current model of capitalism (shareholder primacy) is excessively influenced by milton friedman’s 1970 article that “the social responsibility of business is to increase its profits”.
## capitalism has erroneously followed the recommendations of michael jensen – that ceo wealth should be more closely tied to shareholder value, to prevent “agency costs” that arise when ceos pursue their own objectives rather than shareholders’.
## “the problem with shareholder primacy … was that it handed a massive amount of power to top managers. many ceos now run their companies according to their own personal vision. there is very little social oversight, and executive compensation has soared”.
## under shareholder primacy, “there is nothing that prevents [ceos] from purs

We can identify sentences in the text using the nltk package:

from nltk.tokenize import sent_tokenize
tokenized_text=sent_tokenize(txt)
tokenized_text[0]

## 'are ceos really the problem?'

tokenized_text[1]

## 'a reader (in fact, one that provided invaluable input into “grow the pie”) alerted me to the article “ceos are the problem” by renowned mit economist daron acemoglu.'

We can also identify words in the text using the nltk package:

from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(txt)
tokenized_word[1:10]

## ['ceos', 'really', 'the', 'problem', '?', 'a', 'reader', '(', 'in']

Next, still relying on the nltk package, we can show the distribution of words of the text, to get a sense of what the text is about:

from nltk.probability import FreqDist
fdist = FreqDist(tokenized_word)
plt.clf()
fdist.plot(30,cumulative=False)

As you can see the result is polluted by the presence of non-alphabetical characters and stop words such as (the, as, I, am, and so on…), they add noise to our analysis, we want to remove them.

We first remove special characters by defining our own set of characters to exclude:

sign = {".",":","(",")","-","'","e.g","?",";",'’',"!",'.',',','”','“','–'}
filtered_sent=[]
for w in tokenized_word:
    if w not in sign:
        filtered_sent.append(w)
tokenized_word[0:30]

## ['are', 'ceos', 'really', 'the', 'problem', '?', 'a', 'reader', '(', 'in', 'fact', ',', 'one', 'that', 'provided', 'invaluable', 'input', 'into', '“', 'grow', 'the', 'pie', '”', ')', 'alerted', 'me', 'to', 'the', 'article', '“']

filtered_sent[0:30]

## ['are', 'ceos', 'really', 'the', 'problem', 'a', 'reader', 'in', 'fact', 'one', 'that', 'provided', 'invaluable', 'input', 'into', 'grow', 'the', 'pie', 'alerted', 'me', 'to', 'the', 'article', 'ceos', 'are', 'the', 'problem', 'by', 'renowned', 'mit']

Another approach that is more general (and more efficient), is to use the package re to get rid off the special characters:

import re
txt = re.sub('[^A-Za-z0-9]+', " ",txt)
print(txt[0:1000])

## are ceos really the problem a reader in fact one that provided invaluable input into grow the pie alerted me to the article ceos are the problem by renowned mit economist daron acemoglu the article makes the following arguments the current model of capitalism shareholder primacy is excessively influenced by milton friedman s 1970 article that the social responsibility of business is to increase its profits capitalism has erroneously followed the recommendations of michael jensen that ceo wealth should be more closely tied to shareholder value to prevent agency costs that arise when ceos pursue their own objectives rather than shareholders the problem with shareholder primacy was that it handed a massive amount of power to top managers many ceos now run their companies according to their own personal vision there is very little social oversight and executive compensation has soared under shareholder primacy there is nothing that prevents ceos from pursuing excesisve automation to reduce

We start from there, recreate the tokens, and now clean for the stop words as follows:

from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
tokenized_word=word_tokenize(txt)
filtered_sent=[]
for w in tokenized_word:
    if w not in stop_words:
        filtered_sent.append(w)
print("Filtered Sentence:",filtered_sent[1:25])

## Filtered Sentence: ['really', 'problem', 'reader', 'fact', 'one', 'provided', 'invaluable', 'input', 'grow', 'pie', 'alerted', 'article', 'ceos', 'problem', 'renowned', 'mit', 'economist', 'daron', 'acemoglu', 'article', 'makes', 'following', 'arguments', 'current']

We now can show a cleaner distribution:

from nltk.probability import FreqDist
fdist = FreqDist(filtered_sent)
plt.clf()
fdist.plot(30,cumulative=False)

At this point, we may want to show a cloud of words based on the text file using the package wordcloud, to display the words with the most occurrences (the package processes the text in its own way and takes as argument the raw text):

from wordcloud import WordCloud
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(txt)
plt.clf()
plt.imshow(word_cloud, interpolation='bilinear')
plt.show()

We can also count the occurrences of specific words:

short_term={"short-term","now","one day","quarter","recent", "short"}
count_short_term=0
for w in filtered_sent:
    if w in short_term:
        count_short_term=count_short_term+1
    else:
        count_short_term=count_short_term
print("Nb of short-term voc words: ", count_short_term)

## Nb of short-term voc words:  6

long_term={"long-term","future","years","long"}
count_long_term=0
for w in filtered_sent:
    if w in long_term:
        count_long_term=count_long_term+1
    else:
        count_long_term=count_long_term
print("Nb of long-term voc words: ", count_long_term)

## Nb of long-term voc words:  7

esg={"social","esg","responsibility","csr"}
count_esg=0
for w in filtered_sent:
    if w in long_term:
        count_esg=count_esg+1
    else:
        count_esg=count_esg
print("Nb of ESG voc words: ", count_esg)

## Nb of ESG voc words:  7

We can further refine our analysis. We have two options. Stemming is a process of linguistic normalization, which reduces words to their word root word or chops off the derivational affixes. For example, connection, connected, and connecting words reduce to a common word “connect”. To narrow down the list of words used and extrapolate the tone, sentiment, or topic of the text, this can be useful. In Python, using the nltk package, we go:

from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed_words=[]
for w in filtered_sent:
    stemmed_words.append(ps.stem(w))
fdist = FreqDist(stemmed_words)
plt.clf()
fdist.plot(30,cumulative=False)

Alternatively, we can use lemmatization, which reduces words to their base word. It transforms root word with the use of vocabulary and morphological analysis. For example, the word “better” has “good” as its lemma. This will be missed by stemming because it requires a dictionary look-up.

from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
lem_words=[]
for w in filtered_sent:
    lem_words.append(lem.lemmatize(w))
fdist = FreqDist(lem_words)
plt.clf()
fdist.plot(30,cumulative=False)

Finally, we can ask how readable is this article? #The SMOG grade is a measure of readability that estimates the years of education needed to understand a piece of writing. The higher the reading score, the harder a piece of text is to read. More information there. We use the readability package to compute such measures. The package will analyze the text and provide us with readability scores but also other valuable information:

import readability
#we restart from scratch because the package has its own way to process the text
txt=open("article_edmans.txt",'r',encoding="utf8")
txt = txt.read()
results = readability.getmeasures(txt, lang='en')
print(results)

## OrderedDict([('readability grades', OrderedDict([('Kincaid', 33.09553976143141), ('ARI', 40.6877435387674), ('Coleman-Liau', 14.258367613651426), ('FleschReadingEase', -7.831441848906536), ('GunningFogIndex', 37.89371769383698), ('LIX', 107.060337972167), ('SMOGIndex', 23.89258241577618), ('RIX', 23.85), ('DaleChallIndex', 14.536116222664017)])), ('sentence info', OrderedDict([('characters_per_word', 5.178926441351889), ('syll_per_word', 1.63220675944334), ('words_per_sentence', 75.45), ('sentences_per_paragraph', 1.5384615384615385), ('type_token_ratio', 0.4300861497680583), ('characters', 7815), ('syllables', 2463), ('words', 1509), ('wordtypes', 649), ('sentences', 20), ('paragraphs', 13), ('long_words', 477), ('complex_words', 291), ('complex_words_dc', 684)])), ('word usage', OrderedDict([('tobeverb', 59), ('auxverb', 26), ('conjunction', 39), ('pronoun', 100), ('preposition', 178), ('nominalization', 27)])), ('sentence beginnings', OrderedDict([('pronoun', 0), ('interrogative', 0), ('article', 4), ('subordination', 0), ('conjunction', 1), ('preposition', 3)]))])

#SMOG Index in particular
print(results['readability grades']['SMOGIndex'])

## 23.89258241577618

9.2. Financial reports

Now let us say we want to compare the text of Facebook’s financial report in 2012 to the one of 2020. First we download and put in a txt file the financial reports of 2012 and 2020 from Facebook’ investor webpage. You can find both reports on Blackboard under rep_10K_2012.txt and rep_10K_2020.txt on Blackboard.

We load the data:

#load the reports
f_10k_2020 = open('rep_10k_2020.txt','r')
contents_2020 = f_10k_2020.read()
f_rep_2012 =open('rep_10k_2012.txt','r')
contents_2012 = f_rep_2012.read()

#lower cases and remove special characters
import re
contents_2020 = re.sub('[^A-Za-z0-9]+', " ",contents_2020)
contents_2012 = re.sub('[^A-Za-z0-9]+', " ",contents_2012)
contents_2020 = contents_2020.lower()
contents_2012 = contents_2012.lower()

We first compare the cloud of words of both reports:

from wordcloud import WordCloud
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(contents_2020)
plt.clf()
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")

plt.title("2020")
plt.show()

word_cloud = WordCloud(collocations = False, background_color = 'white').generate(contents_2012)
plt.clf()
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")

plt.title("2012")
plt.show()

Then, we compare the presence of climate-change related words:

from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(contents_2020)
voc={"temperature","global warming","warming","climate change", "heat", "environment", "carbon"}
count_voc=0
for w in tokenized_word:
    if w in voc:
        count_voc=count_voc+1
    else:
        count_voc=count_voc
print("2020: Nb of climate-change-related words: ", count_voc, "over a total of words of: ", len(tokenized_word), "that is a ratio of:", round((count_voc/len(tokenized_word))*100,3), "%")

## 2020: Nb of climate-change-related words:  12 over a total of words of:  70489 that is a ratio of: 0.017 %

tokenized_word=word_tokenize(contents_2012)
voc={"temperature","global warming","warming","climate change", "heat", "environment", "carbon"}
count_voc=0
for w in tokenized_word:
    if w in voc:
        count_voc=count_voc+1
    else:
        count_voc=count_voc
print("2012: Nb of climate-change-related words: ", count_voc, "over a total of words of: ", len(tokenized_word), "that is a ratio of:", round((count_voc/len(tokenized_word))*100,3), "%")

## 2012: Nb of climate-change-related words:  5 over a total of words of:  56380 that is a ratio of: 0.009 %

As you an see, code-wise what we have done is not optimal, we have written twice almost he same piece of code. This is the perfect setting to use a function instead:

#we create the function
#it takes two argument: the text of the report and the year of the report (to print information)
def assess_prev_climate(txt,year):
  tokenized_word=word_tokenize(txt)
  voc={"temperature","global warming","warming","climate change", "heat", "environment", "carbon"}
  count_voc=0
  for w in tokenized_word:
    if w in voc:
        count_voc=count_voc+1
    else:
        count_voc=count_voc
  print(year, ": Nb climate-change-related words: ", count_voc, "over a total of words of: ", len(tokenized_word), "that is a ratio of:", round((count_voc/len(tokenized_word))*100,3), "%")
#we call the function
assess_prev_climate(contents_2012,2012)

## 2012 : Nb climate-change-related words:  5 over a total of words of:  56380 that is a ratio of: 0.009 %

assess_prev_climate(contents_2020,2020)

## 2020 : Nb climate-change-related words:  12 over a total of words of:  70489 that is a ratio of: 0.017 %

9.3. Earnings calls

Next, we analyze earnings calls transcripts. An earnings call is a conference call between the management of a public company, analysts, investors, and the media to discuss the company’s financial results during a given reporting period, such as a quarter or a fiscal year. An earnings call is usually preceded by an earnings report, which contains summary information on financial performance for the period.

Below, we consider the transcript of Apple’s 2021-Q2 earnings call, available here and that you can find in a txt format on Blackboard under earnings calls apple Q2-2021.txt. We import it into Python and analyze it:

f = open('earnings calls apple Q2-2021.txt','r')
txt = f.read()
print(txt[0:1000])

## Operator
## 
## Good day, and welcome to the Apple Q2 FY 2021 earnings conference call. Today's call is being recorded. At this time, for opening remarks and introductions, I'd like to turn the call over to Tejas Gala, director, investor relations, and corporate finance. Please go ahead.
## 
## Tejas Gala -- Director, Investor Relations, and Corporate Finance
## 
## Thank you. Good afternoon and thank you for joining us. Speaking first today is Apple's CEO, Tim Cook. And he'll be followed by CFO, Luca Maestri.
## 
## After that, we'll open the call to questions from analysts. Please note that some of the information you'll hear during our discussion today will consist of forward-looking statements, including, without limitation, those regarding revenue, gross margin, operating expenses, other income and expense, taxes, capital allocation and future business outlook, including the potential impact of COVID-19 on the company's business results of operation. These statements involve risks and uncertainties that

#prepare set of names to remove them from transcript
names = {
"Operator","Tejas Gala -- Director, Investor Relations, and Corporate Finance", 
"Tim Cook -- Chief Executive Officer", 
"Luca Maestri -- Chief Financial Officer", 
"Shannon Cross -- Cross Research LLC -- Analyst",
"Amit Daryanani -- Evercore ISI -- Analyst",
"Katy Huberty -- Morgan Stanley -- Analyst",
"Wamsi Mohan -- Bank of America Merrill Lynch -- Analyst",
"Aaron Rakers -- Wells Fargo Securities -- Analyst",
"Harsh Kumar -- Piper Sandler -- Analyst",
"Krish Sankar -- Cowen and Company -- Analyst",
"Kyle McNealy -- Jefferies -- Analyst",
"David Vogt -- UBS -- Analsyt",
"Samik Chatterjee -- J.P. Morgan -- Analyst"}
names= [x.lower() for x in names]

from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from wordcloud import WordCloud

#remove uppercase
txt=txt.lower()
#clean non-alphabetical car
txt = re.sub('[^A-Za-z0-9]+', " ",txt)
#identify words
tokenized_word=word_tokenize(txt)
stop_words=set(stopwords.words("english"))
words_without_names=[]
for w in tokenized_word:
    if w not in names:
        if w not in stop_words:
            words_without_names.append(w)

#show distrib       
fdist = FreqDist(words_without_names)
plt.clf()
fdist.plot(30,cumulative=False)

#show cloud of words
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(txt)
plt.clf()
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")

plt.show()

Now, we can investigate another aspect of a text, which is the sentiment it conveys. To do that, we are going to rely on the package textblob. TextBlob is a Python library for processing textual data. For more information, visit this link.

from textblob import TextBlob
f = open('earnings calls apple Q2-2021.txt','r')
txt = f.read()
blob = TextBlob(txt)

Now that the test has been processed, we can know what the tags are for instance:

blob.tags[0:20]

## [('Operator', 'NN'), ('Good', 'NNP'), ('day', 'NN'), ('and', 'CC'), ('welcome', 'VB'), ('to', 'TO'), ('the', 'DT'), ('Apple', 'NNP'), ('Q2', 'NNP'), ('FY', 'NNP'), ('2021', 'CD'), ('earnings', 'NNS'), ('conference', 'NN'), ('call', 'NN'), ('Today', 'NN'), ("'s", 'POS'), ('call', 'NN'), ('is', 'VBZ'), ('being', 'VBG'), ('recorded', 'VBN')]

Next, we can find the tone of the text. The package calculates Polarity and Subjectivity. Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

print(blob.sentiment)

## Sentiment(polarity=0.20160016726961513, subjectivity=0.4511521125429976)

We can see that polarity is 0.2, which means that the statements are positive on average and 0.45 subjectivity refers that mostly a factual information is provided (versus opinion). It makes sense for an earnings calls.

Next, we can count the sentences that are positive, negative, or neutral, using the polarity scores given by the package textblob:

negative = 0
positive = 0
neutral = 0
all_sentences = []

for sentence in blob.sentences:
  if sentence.sentiment.polarity < 0:
    negative +=1
  if sentence.sentiment.polarity > 0:
    positive += 1
  else:
    neutral += 1
 
  all_sentences.append(sentence.sentiment.polarity) 

print('positive: '  +  str(positive) + " //// " + 'negative: ' +  str(negative) + " //// " + 'neutral: ' + str(neutral))

## positive: 277 //// negative: 43 //// neutral: 224

Another interesting feature is the spelling correction, for instance:

blob = TextBlob('I think that she is the best plyer on earthh')
blob.correct()

## TextBlob("I think that she is the best player on earth")

9.4. Exercise

Find a piece text of your choice (article, news, report, transcript) and analyze it : extract most used words, sentiment, information about average sentence lengths, and readability. Show outputs rendering the information.

10. Regression analysis

We conclude this introductory course by looking at how to run a statistical test and a regression analysis in Python. We want to know the determinants of the performance of French listed companies during the COVID crisis. In other words, we aim to identify determinants that are significantly associated to better or worse performance during the COVID crisis. For more background information, feel free to read this research article. The csv file for this analysis, also available on Blackboard, is data_regression.csv.

The sample we use consists of publicly listed French companies, headquartered in France, listed on Euronext (Euronext Paris or Euronext Growth Paris), with available data to compute crisis returns and baseline control variables, and not belonging to the financials or utilities industries.

We want to explain the performance in the COVID crisis (variation in the market value of the firm - hence total return over this period) by a set of usual determinants of firms’ performance in times of crisis (based on prior research literature) such as the CAPM beta (measure of exposure to market-wide risk) or financial flexibility (debt or profitability).

Regarding the data, notice that:

RIC is the EIKON identifier of the company.
bh_ret_covid_crisis is the cumulated return in the COVID-19 crisis period (i.e., the period February 20-March 16, 2020).
beta_mkt is the CAPM beta, computed as the slope parameter of a regression of monthly excess stock returns on monthly CAC 40 index excess return over 2017-2019 (60 months).
market cap is the market capitalization in billions of euros as of end 2019.
book_to_market is the book value of shareholder equity divided by the market capitalization as of end 2019.
momentum_3m is the cumulated stock return in the last three months of 2019. Cash Holdings is cash and marketable securities scaled by assets.
total_debt is debt in current liabilities and long-term debt divided by total assets as of end 2019.
profitability is EBITDA scaled by total assets as of end 2019.
naicssectorname is the NAICS sector of the firm.
naicsindustrygroupname is the NAICS industry group of the firm.

We import some packages, load the data, and show some descriptive statistics about our sample firms:

#imports
import pandas as pd
#to get all the colmuns displayed in the console:
pd.set_option('display.max_columns', 20)
#get the dataset
data = pd.read_csv("data_regression.csv", sep=";")
#look the first rows
data.head()

##         ric  momentum_3m  beta_mkt  bh_ret_covid_crisis  \
## 0  2CRSI.PA    -0.110000  0.535461            -55.26316   
## 1   ABEO.PA    -0.102273  0.424667            -44.28969   
## 2   ABNX.PA     0.283721  0.978936            -53.80710   
## 3    ABS.PA     0.510280  0.434276            -29.62500   
## 4   ABVX.PA     0.716354  0.856822            -31.21693   
## 
##                             naicsindustrygroupname  \
## 0  Computer and Peripheral Equipment Manufacturing   
## 1                Other Miscellaneous Manufacturing   
## 2     Scientific Research and Development Services   
## 3     Scientific Research and Development Services   
## 4     Scientific Research and Development Services   
## 
##                                     naicssectorname  total_debt  \
## 0                                     Manufacturing    0.136640   
## 1                                     Manufacturing    0.374026   
## 2  Professional, Scientific, and Technical Services    0.115426   
## 3  Professional, Scientific, and Technical Services    0.809865   
## 4  Professional, Scientific, and Technical Services    0.201673   
## 
##    profitability   market_cap  book_to_market  long_term_debt  short_term_debt  
## 0       0.054863   71200000.0        0.715909        0.116309         0.000091  
## 1       0.061362  127000000.0        0.785116        0.314257         0.015813  
## 2      -0.239340    7289216.0        0.914502        0.002244         0.000000  
## 3      -0.924465  224000000.0       -0.119883        0.790755         0.000338  
## 4      -0.616119  229000000.0        0.125273        0.201673         0.000000

#data info
data.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 437 entries, 0 to 436
## Data columns (total 12 columns):
## ric                       437 non-null object
## momentum_3m               437 non-null float64
## beta_mkt                  437 non-null float64
## bh_ret_covid_crisis       437 non-null float64
## naicsindustrygroupname    437 non-null object
## naicssectorname           437 non-null object
## total_debt                437 non-null float64
## profitability             437 non-null float64
## market_cap                437 non-null float64
## book_to_market            437 non-null float64
## long_term_debt            437 non-null float64
## short_term_debt           436 non-null float64
## dtypes: float64(9), object(3)
## memory usage: 41.1+ KB

#create measure of size (missing atm)
data["size"] = np.log(data["market_cap"])
data.drop(columns=['market_cap'], inplace=True)
#make Python treat industry and sector names as categories
data["naicsindustrygroupname"]=data["naicsindustrygroupname"].astype("category")
data["naicssectorname"]=data["naicssectorname"].astype("category")
#get summary statistics on the data
data.describe()

##        momentum_3m    beta_mkt  bh_ret_covid_crisis  total_debt  \
## count   437.000000  437.000000           437.000000  437.000000   
## mean      0.066078    0.745535           -33.721645    0.262290   
## std       0.231867    0.626571            15.035371    0.187516   
## min      -0.337452   -1.528690           -65.730340    0.000199   
## 25%      -0.061188    0.337194           -44.156710    0.129541   
## 50%       0.016949    0.712194           -34.743590    0.240967   
## 75%       0.114942    1.121302           -24.489230    0.344642   
## max       1.174174    2.902456            16.666670    1.084589   
## 
##        profitability  book_to_market  long_term_debt  short_term_debt  \
## count     437.000000      437.000000      437.000000       436.000000   
## mean        0.008379        0.681371        0.187270         0.021007   
## std         0.230737        0.769105        0.162202         0.047808   
## min        -1.055941       -2.345557        0.000000         0.000000   
## 25%         0.007419        0.286042        0.076011         0.000000   
## 50%         0.079472        0.583342        0.159767         0.000560   
## 75%         0.120129        0.959655        0.253803         0.016189   
## max         0.385376        3.425897        0.914144         0.250870   
## 
##              size  
## count  437.000000  
## mean    18.948788  
## std      2.571449  
## min      4.271137  
## 25%     17.126054  
## 50%     18.577684  
## 75%     20.424860  
## max     26.065600

Next we can show the average performance in the COVID crisis by sector group:

#what are these sectors?
data["naicssectorname"].unique()
#we limit sector names to 50 char to be able to create a nice plot

## [Manufacturing, Professional, Scientific, and Technical Services, Accommodation and Food Services, Transportation and Warehousing, Real Estate and Rental and Leasing, ..., Construction, Arts, Entertainment, and Recreation, Other Services (except Public Administration), Health Care and Social Assistance, Finance and Insurance]
## Length: 17
## Categories (17, object): [Manufacturing, Professional, Scientific, and Technical Services,
##                           Accommodation and Food Services, Transportation and Warehousing, ...,
##                           Arts, Entertainment, and Recreation, Other Services (except Public Administration),
##                           Health Care and Social Assistance, Finance and Insurance]

data["naicssectorname"]=data["naicssectorname"].str[:15]

import matplotlib.pyplot as plt
#we show the number of observations by sector
plt.clf()
data[["bh_ret_covid_crisis","naicssectorname"]].groupby("naicssectorname").count().plot(kind='bar')
plt.title("Nb observations per sector")
plt.tight_layout()
plt.show()

#we show the average performance in the COVID crisis by sector
plt.clf()
data[["bh_ret_covid_crisis","naicssectorname"]].groupby("naicssectorname").mean().plot(kind='bar')
plt.title("Average return in the COVID crisis by sector")
plt.tight_layout()
plt.show()

Now, it is time to run our fist statistical test. We assume we know how to run a test of a sample mean being equal to 0 on paper. We will use the statistical package scipy to run statistical tests and our first OLS regressions:

from scipy import stats
#Is the performance in the crisis on average equal to 0? 
stats.ttest_1samp(data["bh_ret_covid_crisis"], 0)

## Ttest_1sampResult(statistic=-46.88515046582387, pvalue=2.123474203321066e-172)

We strongly reject the null hypothesis that, in the crisis, on average, performance is = 0.

Next, as it is usual in this type of analysis, we show the correlation matrix for the main variables of interest (we need the package jinja2 to be installed beforehand):

corr = data[["bh_ret_covid_crisis","total_debt", "beta_mkt", "book_to_market", "momentum_3m"]].corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

	bh_ret_covid_crisis	total_debt	beta_mkt	book_to_market	momentum_3m
bh_ret_covid_crisis	1	-0.16	-0.29	0.0043	0.02
total_debt	-0.16	1	0.065	-0.19	-0.034
beta_mkt	-0.29	0.065	1	-0.0043	-0.079
book_to_market	0.0043	-0.19	-0.0043	1	-0.081
momentum_3m	0.02	-0.034	-0.079	-0.081	1

We observe some positive and negative correlations (CAPM beta and total debt stand out).

Now, we want to start being able to say something regarding the characteristics of the companies that performed poorly in the crisis versus those that performed relatively well. To do so, we first create two groups by splitting at the median value of the total return in the crisis (Above-Median Performers and Below-Median Performers), then we show some summary statistics for our main variables of interest in both groups:

#we create the two groups
data['ret_group'] = np.where(data['bh_ret_covid_crisis']>data['bh_ret_covid_crisis'].median(), "Above Median Perf", "Below Median Perf")
data["ret_group"]=data["ret_group"].astype("category")
data["ret_group"].value_counts()

## Below Median Perf    219
## Above Median Perf    218
## Name: ret_group, dtype: int64

data["ret_group"].value_counts(normalize = True)

## Below Median Perf    0.501144
## Above Median Perf    0.498856
## Name: ret_group, dtype: float64

#for a given variable, we can show the descriptive statistics by group
data.groupby("ret_group")["total_debt"].describe().reset_index()

##            ret_group  count      mean       std       min       25%       50%  \
## 0  Above Median Perf  218.0  0.249689  0.180125  0.000199  0.123336  0.229590   
## 1  Below Median Perf  219.0  0.274833  0.194195  0.000199  0.136607  0.253803   
## 
##         75%       max  
## 0  0.346307  1.084589  
## 1  0.343652  1.084589

data.groupby("ret_group")["beta_mkt"].describe().reset_index()
#data.groupby("ret_group")["profitability"].describe().reset_index()
#data.groupby("ret_group")["book_to_market"].describe().reset_index()
#data.groupby("ret_group")["size"].describe().reset_index()

##            ret_group  count      mean       std       min       25%       50%  \
## 0  Above Median Perf  218.0  0.592393  0.545064 -0.792336  0.231823  0.581912   
## 1  Below Median Perf  219.0  0.897976  0.665322 -1.528690  0.474036  0.918492   
## 
##         75%       max  
## 0  0.925259  2.238012  
## 1  1.272749  2.902456

To render this information in a nice plot, we can compute a standardized mean value per group then use a bar chart:

plt.clf()
mean= data.groupby("ret_group")["total_debt", "beta_mkt", "book_to_market", "momentum_3m"].mean()
std= data.groupby("ret_group")["total_debt", "beta_mkt", "book_to_market", "momentum_3m"].std()
std_value=(mean)/std
std_value.plot(kind='bar')
plt.title("Average standardized characteristics by group")
plt.xlabel("")
plt.tight_layout()
plt.show()

While most characteristics look about same in both groups, beta (book-to-market) seems markedly higher(lower) in the group of bad performers.

Building on these observations, we can test whether there is significant difference (statistically speaking) in the mean beta CAPM in both groups. We run a t-test using the scipy package:

group_1=data[data["ret_group"]=="Above Median Perf"]["beta_mkt"]
group_2=data[data["ret_group"]=="Below Median Perf"]["beta_mkt"]
stats.ttest_ind(group_1,group_2)

## Ttest_indResult(statistic=-5.2506590709109435, pvalue=2.3756415277964736e-07)

We reject the null of equality of mean beta in both groups. It is worth further investigation…

We show, for all the sample observation, the relation between beta CAPM and performance in the COVID crisis with a plot:

plt.clf()
data[["bh_ret_covid_crisis", "beta_mkt"]].plot(x="beta_mkt", y ="bh_ret_covid_crisis", kind="scatter")
plt.show()

It seems that there is a negative relationship: the higher the beta CAPM, the worse the performance in the crisis is on average. This should be expected as the beta CAPM is an indicator of the exposure of a stock to market-wide shocks (the undiversifiable ones). We next want to show what the regression line looks like (univariate regression) on the plot, to do so we use the package seaborn:

import seaborn as sns
plt.clf()
sns.lmplot(x='beta_mkt',y='bh_ret_covid_crisis',data=data,fit_reg=True,line_kws={'color': 'red'})

plt.show()

We now want to run the univariate OLS regression and show the estimation results. We could use the scipy package to that end but the output, while correct, is not nicely presented.

stats.linregress(x=data["beta_mkt"], y=data["bh_ret_covid_crisis"])

## LinregressResult(slope=-6.993634679165248, intercept=-28.507648704277972, rvalue=-0.29144676614261505, pvalue=5.283540773708276e-10, stderr=1.1005849511973613)

We rather use the package statsmodel:

import statsmodels.formula.api as smf
reg = smf.ols('bh_ret_covid_crisis ~ beta_mkt', data = data)
res = reg.fit()
print(res.summary().as_html)

## <bound method Summary.as_html of <class 'statsmodels.iolib.summary.Summary'>
## """
##                              OLS Regression Results                            
## ===============================================================================
## Dep. Variable:     bh_ret_covid_crisis   R-squared:                       0.085
## Model:                             OLS   Adj. R-squared:                  0.083
## Method:                  Least Squares   F-statistic:                     40.38
## Date:                 Mon, 06 Dec 2021   Prob (F-statistic):           5.28e-10
## Time:                         23:05:51   Log-Likelihood:                -1784.6
## No. Observations:                  437   AIC:                             3573.
## Df Residuals:                      435   BIC:                             3581.
## Df Model:                            1                                         
## Covariance Type:             nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept    -28.5076      1.071    -26.610      0.000     -30.613     -26.402
## beta_mkt      -6.9936      1.101     -6.354      0.000      -9.157      -4.831
## ==============================================================================
## Omnibus:                       26.220   Durbin-Watson:                   1.972
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):               36.666
## Skew:                           0.473   Prob(JB):                     1.09e-08
## Kurtosis:                       4.058   Cond. No.                         2.75
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """>

We can observe the global fit of the model (R-squared), it explains 8.5% of the variation in the dependent variable (the performance in the crisis). The slope of the line, that is the coefficient of the linear relationship between the performance and the beta, is about -7. It is significantly different from 0 as indicated by the t-stat (-6.35) and p-value (0.00) associated with the coefficient. Economically, a one-standard-deviation increase in the beta CAPM (standard deviation is 0.62 - which is a plausible within-sample deviation) corresponds to a lower performance in the crisis of about -4.34% (0.62*-7). It is thus an important determinant.

Finally, we can run a multivariate regression, to be sure there still an effect of beta CAPM while controlling for the effect of other determinants. What we do here is very close to what is reported in Table 2 (Pre-Crisis Financial Flexibility and Stock Returns in the COVID-19 crisis) of the benchmark research paper. We just omit the sector dummies.

reg = smf.ols('bh_ret_covid_crisis ~ total_debt + beta_mkt + book_to_market + momentum_3m + size ', data = data)
res = reg.fit()
print(res.summary().as_html)

## <bound method Summary.as_html of <class 'statsmodels.iolib.summary.Summary'>
## """
##                              OLS Regression Results                            
## ===============================================================================
## Dep. Variable:     bh_ret_covid_crisis   R-squared:                       0.105
## Model:                             OLS   Adj. R-squared:                  0.095
## Method:                  Least Squares   F-statistic:                     10.11
## Date:                 Mon, 06 Dec 2021   Prob (F-statistic):           3.60e-09
## Time:                         23:05:52   Log-Likelihood:                -1779.8
## No. Observations:                  437   AIC:                             3572.
## Df Residuals:                      431   BIC:                             3596.
## Df Model:                            5                                         
## Covariance Type:             nonrobust                                         
## ==================================================================================
##                      coef    std err          t      P>|t|      [0.025      0.975]
## ----------------------------------------------------------------------------------
## Intercept        -24.8945      5.532     -4.500      0.000     -35.768     -14.021
## total_debt       -11.6054      3.737     -3.105      0.002     -18.950      -4.260
## beta_mkt          -6.7742      1.115     -6.074      0.000      -8.966      -4.582
## book_to_market    -0.5118      0.920     -0.557      0.578      -2.319       1.296
## momentum_3m       -0.6349      3.046     -0.208      0.835      -6.622       5.352
## size              -0.0181      0.279     -0.065      0.948      -0.566       0.530
## ==============================================================================
## Omnibus:                       31.492   Durbin-Watson:                   1.953
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.311
## Skew:                           0.554   Prob(JB):                     3.94e-10
## Kurtosis:                       4.072   Cond. No.                         160.
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """>

We find that, holding all the other variables constant, a higher debt pre-crisis is significantly associated with a worst performance in the COVID crisis. Holding all the other variables constant, a higher CAPM beta pre-crisis is also significantly associated with a worst performance in the COVID crisis. As we did before, we can quantify the economic effect of both relations. Size, momentum, and book-tot-market do not seem to be related to the performance in the COVID crisis based on this regression.

The below plot shows that for book-to-market, the linear relationship with the performance in the crisis is not different from zero, it looks very flat:

plt.clf()
sns.lmplot(x='book_to_market',y='bh_ret_covid_crisis',data=data,fit_reg=True,line_kws={'color': 'red'})

plt.show()

Now it is your turn to run regressions with Python to uncover the association between some variable of interest. Define the association you want to study, collect the data, import the data into Python, do the necessary treatments, and apply what has been covered in this section to explore and show key features of the data as well as have a say on whether the association you expect is borne out in the data.

11. Contact information

You can contact your lecturer, Alexandre Garel, at agarel@audencia.com. My personal website is there.

12. Beyond our scope

Because it can be of interest for some of you, the Python file extract_sec_filing_data.py (using some new packages) is available on Blackboard. It scraps data from the SEC repository (financial reports in txt format), extracts accounting figures, and creates a table out of them. This process can of course be further automated to create a larger set of accounting data. In this example, it is applied to Facebook 10-K report to extract balance sheet information.

Introduction to Python for Data Analysis (Finance)

Alexandre Garel, Audencia Business School

01/11/2021