Data Pipeline using Python

Build a data pipeline in Python that downloads data using the urls given below, trains a random forest model on the training dataset using sklearn and scores the model on the test dataset. The homework will be scored based on code efficiency (hint: use functions, not stream of consciousness coding), code cleaniless, code reproducibility, and critical thinking (hint: commenting lets me know what you are thinking!).

pull_data.py

When this is called using python pull_data.py in the command line, this will go to the 2 Kaggle urls provided below, authenticate using your own Kaggle sign on, pull the two datasets, and save as .csv files in the current local directory. The authentication login details (aka secrets) need to be in a hidden folder (hint: use .gitignore). There must be a data check step to ensure the data has been pulled correctly and clear commenting and documentation for each step inside the .py file.

Training dataset url: https://www.kaggle.com/c/titanic/download/train.csv
Scoring dataset url: https://www.kaggle.com/c/titanic/download/test.csv

import pandas as pd
import requests, os, getpass

# Set environment variable
os.environ['REQUESTS_CA_BUNDLE'] = os.path.join(
   '/etc/ssl/certs/',
   'ca-certificates.crt');

# Kaggle files, url, and credentials
data_sets = ['train.csv', 'test.csv']
base_url = "https://www.kaggle.com/c/titanic/download/"
user = getpass.getpass(prompt='What is your Kaggle username? ')
pw = getpass.getpass(prompt='What is your Kaggle password? ')
# user, pw = open('secrets/credentials.txt').read().split('\n')

for file in data_sets:

    # Get redirect URL, then post info to obtain data file
    r_get = requests.get(base_url + file)
    kaggle_info = {'UserName': user, 'Password': pw}
    r_post = requests.post(r_get.url, data = kaggle_info)

    # Writes data to local file
    with open(file, 'wb') as f:
      f.write(r_post.content)
      f.close()

    # Generate descriptive dataset statistics
    df = pd.read_csv(file)
    print(df.describe())

train_model.py

When this is called using python train_model.py in the command line, this will take in the training dataset csv, perform the necessary data cleaning and imputation, and fit a classification model to the dependent \(Y\). There must be data check steps and clear commenting for each step inside the .py file. The output for running this file is the random forest model saved as a .pkl file in the local directory. Remember that the thought process and decision for why you chose the final model must be clearly documented in this section.

import pickle, pandas as pd
from sklearn.preprocessing import Imputer, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Leaving Nulls in Sex and Embarked effectively adds another value to each
# variable which eliminates the need to drop first since there will be n-1
# values where 1 represents the nulls. The below function creates dummy
# variables and drops irrelevant columns.
def prep_data(df):
    dummy_cols = ['Sex', 'Embarked']
    drop_cols = ['Name', 'Ticket', 'Cabin']
    for col in dummy_cols:
        new_cols = pd.get_dummies(df[col])
        df = df.join(new_cols)
    df = df.drop(dummy_cols, axis=1)
    return df.drop(drop_cols, axis=1)

train = pd.read_csv("train.csv")
df_train = prep_data(train)
print(df_train.dtypes)

# Create arrays for the features and the response variable
X = df_train.drop('Survived', axis=1).values
y = df_train['Survived'].values

# Setup the pipeline with steps for Imputation transformer and classifier
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
         ('random_forest', RandomForestClassifier())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2017)

eda.ipynb

[Optional] This supplements the commenting inside train_model.py. This is the place to provide scratch work and plots to convince me why you did certain data imputations and manipulations inside the train_model.py file.

import pandas as pd

train = pd.read_csv("train.csv")
print(train.dtypes, "\n")
print(train.describe(), "\n")
print(train.head())

score_model.py

When this is called using python score_model.py in the command line, this will ingest the .pkl random forest file and apply the model to the locally saved scoring dataset csv. There must be data check steps and clear commenting for each step inside the .py file. The output for running this file is a csv file with the predicted score, as well as a png or text file output that contains the model accuracy report (e.g. sklearn’s classification report or any other way of model evaluation).

    import pickle, csv, pandas as pd
from train_model import prep_data
from sklearn.metrics import classification_report

# Load the dictionary from the pickle file.
pickle_dict = pickle.load(open("pipeline.pkl", "rb"))
X_test = pickle_dict['X_test']
y_test = pickle_dict['y_test']
pipeline = pickle_dict['pipeline']

# Predict the labels of the locally saved scoring dataset
test = pd.read_csv("test.csv")
df_test = prep_data(test)
predictions = pipeline.predict(df_test)

# Data Check
print(predictions)

# Predict the labels of the train-test split data
y_pred = pipeline.predict(X_test)

# Compute and save model accuracy score
score = pipeline.score(X_test, y_test)
csv_df = pd.Series({'score': score})
print(csv_df)
with open('score.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(csv_df)

# Compute and save classification report
report = classification_report(y_test, y_pred)
print(report)
with open('report.txt', 'w') as f:
    f.write(report)
    f.close()

requirements.txt

This file documents all dependencies needed on top of the existing packages in the Docker Dataquest image. When called upon using pip install -r requirements.txt, this will install all python packages needed to run the .py files. (Hint: use pip freeze to generate the .txt file)

touch requirements.txt
pip freeze > requirements.txt

Note: Using the requirements file to update everything using pip install --upgrade -r requirements.txt does not work because the package versions in the file explicitly state == rather than >=.

Data Pipeline

Jose Zuniga