import pandas as pd
import numpy as np
cc = pd.read_csv("AER_credit_card_data.csv")

Preparation

  • Create the target variable by mapping yes to 1 and no to 0.
  • Split the dataset into 3 parts: train/validation/test with 60%/20%/20% distribution. Use train_test_split funciton for that with random_state=1.
cc.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1319 entries, 0 to 1318
## Data columns (total 12 columns):
##  #   Column       Non-Null Count  Dtype  
## ---  ------       --------------  -----  
##  0   card         1319 non-null   object 
##  1   reports      1319 non-null   int64  
##  2   age          1319 non-null   float64
##  3   income       1319 non-null   float64
##  4   share        1319 non-null   float64
##  5   expenditure  1319 non-null   float64
##  6   owner        1319 non-null   object 
##  7   selfemp      1319 non-null   object 
##  8   dependents   1319 non-null   int64  
##  9   months       1319 non-null   int64  
##  10  majorcards   1319 non-null   int64  
##  11  active       1319 non-null   int64  
## dtypes: float64(4), int64(5), object(3)
## memory usage: 123.8+ KB
cc.head()
##   card  reports       age  income  ...  dependents  months majorcards active
## 0  yes        0  37.66667  4.5200  ...           3      54          1     12
## 1  yes        0  33.25000  2.4200  ...           3      34          1     13
## 2  yes        0  33.66667  4.5000  ...           4      58          1      5
## 3  yes        0  30.50000  2.5400  ...           0      25          1      7
## 4  yes        0  32.16667  9.7867  ...           2      64          1      5
## 
## [5 rows x 12 columns]

Question 1

  • Install Pipenv

  • What’s the version of pipenv you installed?

  • Use --version to find out

    Answer: 2022.10.9

Question 2

  • Use Pipenv to install Scikit-Learn version 1.0.2
  • What’s the first hash for scikit-learn you get in Pipfile.lock?

Answer: “sha256”: “480ecffabeb794cfb062424728ab3467aa7dc03e3c3485253b6371ed543d20b9”

Pickle

We’ve prepared a dictionary vectorizer and a model.

They were trained (roughly) using this code:

features = ['reports', 'share', 'expenditure', 'owner']
dicts = df[features].to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X = dv.fit_transform(dicts)

model = LogisticRegression(solver='liblinear').fit(X, y)

Note: You don’t need to train the model. This code is just for your reference.

And then saved with Pickle. Download them:

DictVectorizer

LogisticRegression

With wget:

PREFIX=https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/course-zoomcamp/cohorts/2022/05-deployment/homework
wget $PREFIX/model1.bin
wget $PREFIX/dv.bin

Question 3

Write a script for loading these models with pickle

Score this client:

{"reports": 0, "share": 0.001694, "expenditure": 0.12, "owner": "yes"}

What’s the probability that this client will get a credit card?

  • 0.162

  • 0.391

  • 0.601

  • 0.993

If you’re getting errors when unpickling the files, check their checksum.

Get

import pickle
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
dv_file = "dv.bin"
with open(dv_file, "rb") as f_in: # rb read
    dv = pickle.load(f_in)
## C:\Users\husad\CONDA~1\envs\ML-ZOO~1\lib\site-packages\sklearn\base.py:329: UserWarning: Trying to unpickle estimator DictVectorizer from version 1.0.2 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
## https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
##   warnings.warn(
model_file = "model1.bin"
with open(model_file, "rb") as f_in: # rb read
    model = pickle.load(f_in)
## C:\Users\husad\CONDA~1\envs\ML-ZOO~1\lib\site-packages\sklearn\base.py:329: UserWarning: Trying to unpickle estimator LogisticRegression from version 1.0.2 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
## https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
##   warnings.warn(
customer = {"reports": 0, "share": 0.001694, "expenditure": 0.12, "owner": "yes"}
X = dv.transform([customer])
y_pred = model.predict_proba(X)[0, 1]
y_pred 
## 0.16213414434326598

Answer: 0.162

Question 4

Now let’s serve this model as a web service

Install Flask and gunicorn (or waitress, if you're on Windows)
Write Flask code for serving the model
Now score this client using requests:

url = “YOUR_URL” client = {“reports”: 0, “share”: 0.245, “expenditure”: 3.438, “owner”: “yes”} requests.post(url, json=client).json() What’s the probability that this client will get a credit card?

0.274
0.484
0.698
0.928

Answer: 0.92

---

So that’s that. I have yet to understand Docker so, I give up question 5 and 6.