Introduction to Data Science in Python

Lecturer: Hillary Green-Lerman

1 Course Description

Begin your journey into Data Science! Even if you’ve never written a line of code in your life, you’ll be able to follow this course and witness the power of Python to perform Data Science. You’ll use data to solve the mystery of Bayes, the kidnapped Golden Retriever, and along the way you’ll become familiar with basic Python syntax and popular Data Science modules like Matplotlib (for charts and graphs) and pandas (for tabular data).

2 Getting Started in Python

2.1 Lecture: Dive into Python

2.2 Importing Python Modules

Modules (sometimes called packages or libraries) help group together related sets of tools in Python. Below are sample imports of modules that are frequently used by Data Scientists:

statsmodels: used in machine learning; usually aliased as sm;
seaborn: a visualization library; usually aliased as sns;
numpy: performs math operations; usually aliased as np.

import statsmodels as sm
import seaborn as sns
import numpy as np

Note that each module has a standard alias, which allows you to access the tools inside of the module without typing as many characters. For example, aliasing lets us shorten seaborn.scatterplot() to sns.scatterplot().

2.3 Lecture: Creating Variables

2.4 Creating Numbers & Strings

Before we start looking for Bayes’ kidnapper, we need to fill out a Missing Puppy Report with details of the case. Each piece of information will be stored as a variable.

We define a variable using an equals sign \((=).\) For instance,

# Bayes' favorite toy
favorite_toy = "Mr. Squeaky"
type(favorite_toy)

## <class 'str'>

# Bayes' owner
owner = 'DataCamp'
owner

## 'DataCamp'

# Bayes' height
height = 24
print('height  || ', height, ' || ', type(height))

## height  ||  24  ||  <class 'int'>

# Bayes' age
bayes_age = 4.0
print('bayes_age  || ', bayes_age, ' || ', type(bayes_age))

## bayes_age  ||  4.0  ||  <class 'float'>

Notes: it’s easy to make errors when you’re trying to type strings quickly.

Don’t forget to use quotes! Without quotes, you’ll get a name error.

# owner = DataCamp

Use the same type of quotation mark. If you start with a single quote, and end with a double quote, you’ll get a syntax error.

# fur_color = "blonde'

2.5 Lecture: Fun with Functions

2.6 Load a DataFrame

A ransom note was left at the scene of Bayes’ kidnapping. Eventually, we’ll want to analyze the frequency with which each letter occurs in the note, to help us identify the kidnapper. For now, we just need to load the data from ransom.csv into Python. The data can be found here.

We’ll load the data into a DataFrame, a special data type from the pandas module. It represents spreadsheet-like data (something with rows and columns).

We can create a DataFrame from a CSV (comma-separated value) file by using the function pd.read_csv().

# Import pandas
import pandas as pd

# Load the 'ransom.csv' into a DataFrame
url = 'https://raw.githubusercontent.com/QuanNguyenIU/DataCamp/main/Python/Intro.%20to%20Data%20Science%20in%20Python/ransom.csv'
ransom = pd.read_csv(url)

# Display DataFrame
ransom

##     letter_index letter  frequency
## 0              1      A       7.38
## 1              2      B       1.09
## 2              3      C       2.46
## 3              4      D       4.10
## 4              5      E      12.84
## 5              6      F       1.37
## 6              7      G       1.09
## 7              8      H       3.55
## 8              9      I       7.65
## 9             10      J       0.00
## 10            11      K       3.01
## 11            12      L       3.28
## 12            13      M       2.46
## 13            14      N       7.38
## 14            15      O       6.83
## 15            16      P       7.65
## 16            17      Q       0.00
## 17            18      R       4.92
## 18            19      S       4.10
## 19            20      T       6.28
## 20            21      U       4.37
## 21            22      V       1.09
## 22            23      W       2.46
## 23            24      X       0.00
## 24            25      Y       4.64
## 25            26      Z       0.00

3 Loading Data in Pandas

3.1 Lecture: What is pandas?

3.2 Loading a DataFrame

We’re still working hard to solve the kidnapping of Bayes, the Golden Retriever. Assume that we have narrowed the list of suspects to:

-Fred Frequentist -Ronald Aylmer Fisher -Gertrude Cox -Kirstine Smith We’ve obtained credit card records for all four suspects. Perhaps some of them made suspicious purchases before the kidnapping?

The records are in a CSV called “credit_records.csv”. The data can be found here.

# Import pandas under the alias pd
import pandas as pd

# Load the CSV "credit_records.csv"
url = 'https://raw.githubusercontent.com/QuanNguyenIU/DataCamp/main/Python/Intro.%20to%20Data%20Science%20in%20Python/credit_records.csv'
credit_records = pd.read_csv(url)

# Display the first five rows of credit_records using the .head() method
credit_records.head()

##             suspect         location              date         item  price
## 0    Kirstine Smith   Groceries R Us   January 6, 2018     broccoli   1.25
## 1      Gertrude Cox  Petroleum Plaza   January 6, 2018  fizzy drink   1.90
## 2  Fred Frequentist   Groceries R Us   January 6, 2018     broccoli   1.25
## 3      Gertrude Cox   Groceries R Us  January 12, 2018     broccoli   1.25
## 4    Kirstine Smith    Clothing Club   January 9, 2018        shirt  14.25

3.3 Inspecting a DataFrame

We’ve loaded the credit card records of our four suspects into a DataFrame called credit_records. Let’s learn more about the structure of this DataFrame. How many rows are in credit_records?

credit_records.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 104 entries, 0 to 103
## Data columns (total 5 columns):
##  #   Column    Non-Null Count  Dtype  
## ---  ------    --------------  -----  
##  0   suspect   104 non-null    object 
##  1   location  104 non-null    object 
##  2   date      104 non-null    object 
##  3   item      104 non-null    object 
##  4   price     104 non-null    float64
## dtypes: float64(1), object(4)
## memory usage: 4.2+ KB

3.4 Lecture: Selecting columns

3.5 Two methods for selecting columns

Once again, we’ve loaded the credit card records of our four suspects into a DataFrame called credit_records. Let’s examine the items that they’ve purchased.

# Select the column item from credit_records
# Use brackets and string notation
credit_records["item"]

## 0         broccoli
## 1      fizzy drink
## 2         broccoli
## 3         broccoli
## 4            shirt
##           ...     
## 99           shirt
## 100          pants
## 101          dress
## 102         burger
## 103      cucumbers
## Name: item, Length: 104, dtype: object

# Select the column item from credit_records
# Use dot notation
items = credit_records.item

Another junior detective is examining a DataFrame of Missing Puppy Reports. The data can be found here.

url = 'https://raw.githubusercontent.com/QuanNguyenIU/DataCamp/main/Python/Intro.%20to%20Data%20Science%20in%20Python/mpr.csv'
mpr = pd.read_csv(url)

# Use info() to inspect mpr
print(mpr.info())

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 6 entries, 0 to 5
## Data columns (total 5 columns):
##  #   Column      Non-Null Count  Dtype 
## ---  ------      --------------  ----- 
##  0   Dog Name    6 non-null      object
##  1   Owner Name  5 non-null      object
##  2   Dog Breed   6 non-null      object
##  3   Status      6 non-null      object
##  4   Age         6 non-null      int64 
## dtypes: int64(1), object(4)
## memory usage: 368.0+ bytes
## None

# Select column "Dog Name" from mpr
name = mpr["Dog Name"]

# Select column "Status" from mpr
is_missing = mpr["Status"]

# Display the columns
print(name, is_missing)

## 0      Bayes
## 1    Sigmoid
## 2     Sparky
## 3    Theorem
## 4        Ned
## 5      Benny
## Name: Dog Name, dtype: object 0    Still Missing
## 1    Still Missing
## 2            Found
## 3            Found
## 4    Still Missing
## 5            Found
## Name: Status, dtype: object

3.6 Lecture: Selecting rows with logic

3.7 Logical testing

Let’s practice writing logical statements and displaying the output.

Recall that we use the following operators:

-== tests that two values are equal. -!= tests that two values are not equal. -> and < test that greater than or less than, respectively. ->= and <= test greater than or equal to or less than or equal to, respectively.

The variable height_inches represents the height of a suspect. Is height_inches greater than 70 inches?

height_inches = 65
height_inches > 70

## False

The variable plate1 represents a license plate number of a suspect. Is it equal to FRQ123?

plate1 = 'FRQ123'
plate1 == "FRQ123"

## True

The variable fur_color represents the color of Bayes’ fur. Is fur_color equal to “brown”?

fur_color = 'blonde'
fur_color != "brown"

## True

3.8 Selecting missing puppies

Let’s return to our DataFrame of missing puppies, which is loaded as mpr. Let’s select a few different rows to learn more about the other missing dogs.

# Select the dogs where Age is greater than 2
mpr[mpr.Age > 2]

##   Dog Name             Owner Name       Dog Breed Status  Age
## 2   Sparky             Dr. Apache   Border Collie  Found    3
## 3  Theorem  Joseph-Louis Lagrange  French Bulldog  Found    4
## 5    Benny   Hillary Green-Lerman          Poodle  Found    3

# Select the dogs whose Status is equal to Still Missing
mpr[mpr.Status == "Still Missing"]

##   Dog Name    Owner Name         Dog Breed         Status  Age
## 0    Bayes      DataCamp  Golden Retriever  Still Missing    1
## 1  Sigmoid           NaN         Dachshund  Still Missing    2
## 4      Ned  Tim Oliphant          Shih Tzu  Still Missing    2

# Select all dogs whose Dog Breed is not equal to Poodle
mpr[mpr["Dog Breed"] != "Poodle"]

##   Dog Name             Owner Name         Dog Breed         Status  Age
## 0    Bayes               DataCamp  Golden Retriever  Still Missing    1
## 1  Sigmoid                    NaN         Dachshund  Still Missing    2
## 2   Sparky             Dr. Apache     Border Collie          Found    3
## 3  Theorem  Joseph-Louis Lagrange    French Bulldog          Found    4
## 4      Ned           Tim Oliphant          Shih Tzu  Still Missing    2

3.9 Narrowing the list of suspects

Recall the list of suspects that might have kidnapped Bayes:

-Fred Frequentist -Ronald Aylmer Fisher -Gertrude Cox -Kirstine Smith We’d like to narrow this list down, so we obtained credit card records for each suspect. We’d like to know if any of them recently purchased dog treats to use in the kidnapping. If they did, they would have visited ‘Pet Paradise’.

The credit records have been loaded into a DataFrame called credit_records.

# Select purchases from 'Pet Paradise'
credit_records[credit_records.location == 'Pet Paradise']

## Empty DataFrame
## Columns: [suspect, location, date, item, price]
## Index: []

4 Plotting Data with Matplotlib

4.1 Lecture: Creating line plots

4.2 Working hard

Several police officers have been working hard to help us solve the mystery of Bayes, the kidnapped Golden Retriever. Their commanding officer wants to know exactly how hard each officer has been working on this case. Officer Deshaun has created DataFrames called deshaun to track the amount of time he spent working on this case. The DataFrame contains two columns:

-day_of_week: a string representing the day of the week. -hours_worked: the number of hours that a particular officer worked on the Bayes case.

The data can be found here.

# From matplotlib, import pyplot under the alias plt
from matplotlib import pyplot as plt

# Plot Officer Deshaun's hours_worked vs. day_of_week
plt.plot(deshaun.day_of_week, deshaun.hours_worked)

# Display Deshaun's plot
plt.show()

4.3 Or hardly working?

Two other officers have been working with Deshaun to help find Bayes. Their names are Officer Mengfei and Officer Aditya. Deshaun used their time cards to create two more DataFrames: mengfei and aditya. Let’s plot all three lines together to see who was working hard each day.

# Plot Officer Deshaun's hours_worked vs. day_of_week
plt.plot(deshaun.day_of_week, deshaun.hours_worked)

# Plot Officer Aditya's hours_worked vs. day_of_week
plt.plot(aditya.day_of_week, aditya.hours_worked)

# Plot Officer Mengfei's hours_worked vs. day_of_week
plt.plot(mengfei.day_of_week, mengfei.hours_worked)

# Display all three line plots
plt.show()

4.4 Lecture: Adding text to plots

4.5 Adding a legend

Officers Deshaun, Mengfei, and Aditya have all been working with you to solve the kidnapping of Bayes. Their supervisor wants to know how much time each officer has spent working on the case.

Deshaun created a plot of data from the DataFrames deshaun, mengfei, and aditya previously. Now he wants to add a legend to distinguish the three lines.

# Officer Deshaun
plt.plot(deshaun.day_of_week, deshaun.hours_worked, label='Deshaun')

# Add a label to Aditya's plot
plt.plot(aditya.day_of_week, aditya.hours_worked, label='Aditya')

# Add a label to Mengfei's plot
plt.plot(mengfei.day_of_week, mengfei.hours_worked, label='Mengfei')

# Add a command to make the legend display
plt.legend()

# Display plot
plt.show()

4.6 Adding labels

If we give a chart with no labels to Officer Deshaun’s supervisor, she won’t know what the lines represent.

We need to add labels to Officer Deshaun’s plot of hours worked.

# Lines
plt.plot(deshaun.day_of_week, deshaun.hours_worked, label='Deshaun')
plt.plot(aditya.day_of_week, aditya.hours_worked, label='Aditya')
plt.plot(mengfei.day_of_week, mengfei.hours_worked, label='Mengfei')

# Add a title
plt.title('Hours Worked per Days of Week')

# Add y-axis label
plt.ylabel('Hours Worked')

# Legend
plt.legend()

# Display plot
plt.show()

4.7 Adding floating text

Officer Deshaun is examining the number of hours that he worked over the past six months. The data can be found here.

url = 'https://raw.githubusercontent.com/QuanNguyenIU/DataCamp/main/Python/Intro.%20to%20Data%20Science%20in%20Python/six_months.csv'
six_months = pd.read_csv(url)
six_months

##   month  hours_worked
## 0   Jan           160
## 1   Feb           185
## 2   Mar           182
## 3   Apr           195
## 4   Jun            50

The number for June is low because he only had data for the first week. Let’s help Deshaun by adding an annotation to the graph to explain this.

# Create plot
plt.plot(six_months.month, six_months.hours_worked)

# Add annotation "Missing June data" at (2.5, 80)
plt.text(2.5, 80, "Missing June data")

# Display graph
plt.show()

4.8 Lecture: Styling graphs

4.9 Tracking crime statistics

Sergeant Laura wants to do some background research to help her better understand the cultural context for Bayes’ kidnapping. She has plotted Burglary rates in three U.S. cities using data from the Uniform Crime Reporting Statistics. The data can be found here.

plt.plot(data["Year"], data["Phoenix Police Dept"], label="Phoenix", color="DarkCyan")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label="Los Angeles", linestyle=':')
plt.plot(data["Year"], data["Philadelphia Police Dept"], label="Philadelphia", marker='s')
plt.legend()
plt.show()

Remember:

-You can change linestyle to dotted (‘:’), dashed(‘–’), or no line (’‘). -You can change the marker to circle (’o’), diamond(‘d’), or square (‘s’).

4.10 Playing with styles

Changing the plotting style is a fast way to change the entire look of your plot without having to update individual colors or line styles. Some popular styles include:

-‘fivethirtyeight’ - Based on the color scheme of the popular website. -‘grayscale’ - Great for when you don’t have a color printer! -‘seaborn’ - Based on another Python visualization library. -‘classic’ - The default color scheme for Matplotlib.

# Change the style to fivethirtyeight
plt.style.use('fivethirtyeight')

# Plot lines
plt.plot(data["Year"], data["Phoenix Police Dept"], label="Phoenix")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label="Los Angeles")
plt.plot(data["Year"], data["Philadelphia Police Dept"], label="Philadelphia")

# Add a legend
plt.legend()

# Display the plot
plt.show()

# Change the style to ggplot
plt.style.use('ggplot')

# Plot lines
plt.plot(data["Year"], data["Phoenix Police Dept"], label="Phoenix")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label="Los Angeles")
plt.plot(data["Year"], data["Philadelphia Police Dept"], label="Philadelphia")

# Add a legend
plt.legend()

# Display the plot
plt.show()

# View all available styles
plt.style.available

## ['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10']

4.11 Identifying Bayes’ kidnapper

We’ve narrowed the possible kidnappers down to two suspects:

-Fred Frequentist. -Gertrude Cox. The kidnapper left a long ransom note containing several unusual phrases. Let’s use a line plot to compare the frequency of letters in the ransom note to samples from the two main suspects.

Two more DataFrames have been loaded, beside ransom:

-suspect1 contains the letter frequencies for the sample from Fred Frequentist. -suspect2 contains the letter frequencies for the sample from Gertrude Cox.

# Plot each line
plt.plot(ransom.letter, ransom.frequency,
         label = 'Ransom', linestyle = ':', color = 'gray')
plt.plot(suspect1.letter, suspect1.frequency, label='Fred Frequentist')
plt.plot(suspect2.letter, suspect2.frequency, label='Gertrude Cox')

# Add x- and y-labels
plt.xlabel("Letter")
plt.ylabel("Frequency")

# Add a legend
plt.legend()

# Display plot
plt.show()