Dev-Ops for Data Scientists Workshop

Exercise Worksheet

Activity 1: Git workflow

Fork repository

Click fork on the Posit Conf 2024 Dev-Ops workshop Github repository.

Give your fork a name and click Create.
Click on the Code button in your forked repository and copy the HTTPS URL.
Grab the https URL of your forked repo.
Open a terminal and cd to your desktop.
git clone the repository using the URL.

cd ~
cd Desktop
git clone https://github.com/{your-GH-username}/dev-ops.git

You will be using this repo primarily for resources and instructions. We will be creating a separate git repository for class exercises.

Open a code editor in Workbench

Throughout this workshop we will be writing code in both R and Python. We will be starting with a python model so you will need to use an IDE with a Python interpretor.

Login to your Workbench environment with your credentials.
Open up a vscode session.

Note

If using a different Python editor on your local machine make sure that you have Python installed and on PATH. This is a good article on best practices for setting up Python using Pyenv.

Initialize a new git repository

Open your vscode terminal (press Control + Shift+ dash for quick open) and create a folder where you’ll be saving all of your classwork for this workshop.
Call it something like devops_classwork. See the code example below for changing into {my} Desktop directory, creating a folder, and then changing the directory into that folder.

cd ~
mkdir devops_classwork
cd devops_classwork

Initialize git in your folder.

git init
git config --global user.name "Your Name"
git config --global user.email "youremail@yourdomain.com"
git config --global init.defaultBranch main

Create some code content

Create 3 files in your editor. Copy the code for each file from the Dev-ops repo linked below.

1. README.md - a markdown file used for documentation

2. model_pthon.py - a python file with a linear regression model

3. r_eda.R - an R file with some exploratory data analysis

Add and commit your code locally

git add README.md
git add model_python.py
git add r_eda.R
git status
git commit -m "model and readme added"

Push local repo to remote github repo

Create a new repository on github.com:

Click on New in your github repositories page
Under Owner choose your username
Call the repository devops-classwork
Make it public
Click Create repository
Copy the HTTPS url in the quick setup section

Go back to the terminal in your vscode session and add the remote upstream github repo.

git remote add origin https://github.com/{your-username}/devops-classwork.git
git branch -M main
git push -u origin main

You may be asked to authenticate your Github with vscode. Follow the instructions and authenticate.

Create a branch and merge new file

Create a new branch and a new file called .gitignore

git checkout -b {yourname}/small-edits
touch .gitignore # this creates a file called .gitignore

Copy the code from this file into your newly created .gitignore file.
Add, commit, push and merge your changes.

git add .gitignore
git commit "added .gitignore"
git push origin {yourname}/small-edits
git checkout main
git merge {yourname}/small-edits
git push

Go to your repository in Github and refresh your screen. You should now see all of your files there! 🤩
Delete your local branch with git branch -d {yourname}/small-edits
Create a new branch for the following exercises:

git checkout -b {yourname}/venv-renv.

Activity 2: Virtual Environments

R

Open a new Rstudio IDE session available in Workbench. If you were previously in vscode click on the Posit Workbench icon on the bottom of the screen - this will bring you back to your Workbench homepage.

Create a new project from a working directory

File > New Project > Existing Directory > devops-classwork

Install renv with install.packages("renv")

# library path without renv initialized.
.libPaths()

# initialize renv
renv::init()
y

# library path with renv initialized.
.libPaths()

Install the following packages:

install.packages(c("palmerpenguins", "dplyr", "ggplot2"))

Feel free to run your R code which is reproduced below and see the output of your analysis.

library(palmerpenguins)
library(dplyr)
library(ggplot2)

df <- palmerpenguins::penguins

table1 <- df %>%
  group_by(species, sex) %>%
  summarise(
    across(
      where(is.numeric), 
      \(x) mean(x, na.rm = TRUE)
      )
    ) %>%
  knitr::kable()

## Penguin Size vs Mass by Species


plot <- df %>%
  ggplot(aes(x = bill_length_mm, y = body_mass_g, color = species)) +
  geom_point() + 
  geom_smooth(method = "lm")

Make sure to snapshot your code with renv::snapshot().

Python

Open the model_python.py file in your vscode session.

If you were previously in Rstudio you can click on the R logo on the top left of the screen - this will bring you back to your Workbench homepage.

Type ‘python’ into your terminal to start an executable. You should see something similar to the below output. You can exit back to the terminal with exit().

Python 3.11.4 (main, Jun 12 2024, 14:00:16) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin Type “help”, “copyright”, “credits” or “license” for more information.

>>> [Type code here]

Run the below code to create a virtual environment

# library path without venv initialized
import sys
print(sys.path)
exit()

# run in your bash terminal
pip list
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
pip list

# library path with venv initialized
import sys
print(sys.path)
exit()

# run in your bash terminal
pip install palmerpenguins
pip install duckdb
pip install scikit-learn

Run your model_python.py in the terminal with python model_python.py. The model code should input some model statistics.

import duckdb
from palmerpenguins import penguins
from pandas import get_dummies
from sklearn.linear_model import LinearRegression

con = duckdb.connect("my-db.duckdb")
df = penguins.load_penguins()
df = con.execute("SELECT * FROM df").fetchdf().dropna()
con.close()

df.head(3)

X = get_dummies(df[["bill_length_mm", "species", "sex"]], drop_first=True)
y = df["body_mass_g"]

model = LinearRegression().fit(X, y)

print(f"R^2 {model.score(X,y)}")
print(f"Intercept {model.intercept_}")
print(f"Columns {X.columns}")
print(f"Coefficients {model.coef_}")

Freeze your pip installed packages in a requirements.txt file. You should do this every time you install additional packages or modules.

# run in bash terminal

pip list > requirements.txt
pip list

Add, commit, push your code

# run in your bash terminal

git add README.md
git add r_eda.R
git add model_python.py
git add requirements.txt
git add devops_classwork.Rproj
git commit -m "creating virtual environments"
git push origin {yourname}/venv-renv
git checkout main
git merge {yourname}/venv-renv
git push

Activity 3: Github Actions

Enable actions in your repository

Go to Actions in your github repository and enable github actions if you have never done so before.

In your project terminal (vscode or Rstudio IDE) create a .yaml file.

Important

Github actions MUST be created as a .yaml file with the directory structure

`.github/workflows/.yaml`

mkdir .github
cd .github/
mkdir workflows
cd workflows/
touch .yaml

Copy the [following(https://docs.github.com/en/actions/writing-workflows/quickstart)] into your yaml file.

Create a new branch, add, commit, and push your code to main.

git checkout -b {yourname}/action
git add ~/devops_classwork/.github/workflows/.yaml
git commit -m "added an action"
git push origin {yourname}/action
git checkout main
git merge {yourname}/action
git push

Visit the Actions tab of your repo to see if the Action succeeded.

Activity 4: Pre-commit hooks

Install pre-commit

Run the below in your terminal

# run in your bash terminal
pip install pre-commit
pip install black

Create a file called .pre-commit-config.yaml and add the following:

Run the below in your terminal

# run in your bash terminal
pre-commit install

Stage and commit your code and fix any errors that the hooks identify.

git checkout -b {yourname}/precommit
git add .pre-commit.config.yaml
git add model_python.py
git commit -m "added pre-commit hook"

You will see that parts of your commits have failed. The hook automatically “fixes” errors it finds. Keep adding and committing your files until your commits have all passed.

If the black commit is skipped try calling it directly with pre-commit run black -a. You can also run a single file outside of the hook using black model_python.py as well. You can open the model python file and see what was modified.

Find out a little more about the hooks and packages that you’ve been using:

Finish pushing your code

git push origin {yourname}/precommit
git checkout main
git merge {yourname}/precommit
git push

Activity 5: Docker Run

The basic docker run command takes this form:

docker run [OPTIONS] [IMAGE:TAG] [COMMAND] [ARG...]

In the below exercise we will practice running docker containers with different options or “flags.”

Currently we have no docker images downloaded. Confirm this with docker image ls -a.
Pull down a Dockerhub linux image. Confirm that the image is downloaded with the ls command. docker pull ubuntu docker image ls -a
Run an interactive container with the bash shell attached. Run a few linux commands to explore your environment and then exit the container.

ls
whoami
hostname

sudo docker run -it ubuntu bash
ls
whoami
hostname

Exit the container with Ctrl+D or exit.

This docker command runs the container in the foreground so you are unable to access the command prompt for your original alpine server. For this reason interactive mode is often used for development and testing.
Run the container in detached mode and then list all your containers.

sudo docker run -d ubuntu
sudo docker container ls -a

You should see that the ubuntu container was created and then exited. The container ID is shown with an exited status and the command line is still accessible.

Detached containers run in the background, so the container keeps running until the application process exits (which is what happened here), or you stop the container. You can stop a container with sudo docker container stop {name or ID of container}. For this reason detached mode is often used for production purposes.

Activity 6: Debugging Containers

The docker exec command is very similar to the docker run -it command. Both are very helpful for debugging containers as they allow you to jump inside your container instance.

The exec command needs a running container to execute any command, whereas the -it flag starts a container and places you into a terminal in interactive mode.
Use the docker exec command to execute a bash command in a running container. This can be used to execute any command within a running container.

Caution

Be careful not to use docker exec to change your container as once it is deleted you will lose any changes you’ve made!

Remember containers are ephemeral! For change you should change the dockerfile or image.

docker exec requires two arguments - the container name and the command you want to run.

docker exec [OPTIONS] CONTAINER [COMMAND] [ARG…]

Run the below in your terminal

sudo docker run -d nginx
sudo docker container ls -a #grab the nginx container ID that you just started
sudo docker exec -it CONTAINER_ID bash
exit
sudo docker exec CONTAINER_ID ls
sudo docker container stop CONTAINER_ID

Lets run a detached MySQL container and then check out some logs.

The database requires a password to work. In production you should never pass credentials directly in your command but we will do it for testing purposes. (The forward slashes below allow you to use a new line for your code)

 sudo docker container run -d --name mydb \
 -e MYSQL_ROOT_PASSWORD=my-secret-pw \ 
 mysql
 
sudo docker container logs mydb
sudo docker container stop mydb

Activity 7: Write a Dockerfile

Create a basic shiny app

Put your app.r file in a directory called /apps

library(shiny)

# Define UI for application that draws a histogram
ui <- fluidPage(

    # Application title
    titlePanel("Old Faithful Geyser Data"),

    # Sidebar with a slider input for number of bins 
    sidebarLayout(
        sidebarPanel(
            sliderInput("bins",
                        "Number of bins:",
                        min = 1,
                        max = 50,
                        value = 30)
        ),

        # Show a plot of the generated distribution
        mainPanel(
           plotOutput("distPlot")
        )
    )
)

# Define server logic required to draw a histogram
server <- function(input, output) {

    output$distPlot <- renderPlot({
        # generate bins based on input$bins from ui.R
        x    <- faithful[, 2]
        bins <- seq(min(x), max(x), length.out = input$bins + 1)

        # draw the histogram with the specified number of bins
        hist(x, breaks = bins, col = 'darkgray', border = 'white',
             xlab = 'Waiting time to next eruption (in mins)',
             main = 'Histogram of waiting times')
    })
}

# Run the application 
shinyApp(ui = ui, server = server)

Create a Dockerfile

Create a file called Dockerfile with no extension! This file should be one directory up from /apps

FROM rocker/shiny:4.3.1
# comes preinstalled with a bunch of packages

RUN apt-get update && apt-get install -y \
    libcurl4-gnutls-dev \
    libssl-dev

RUN R -e "install.packages(('palmerpenguins'), \
    repos = 'https://packagemanager.posit.co/cran/__linux__/jammy/latest')"
    
RUN rm /srv/shiny-server/index.html
ruN rm -rf /srv/shiny-server/sample-apps

COPY ./apps/* /srv/shiny-server/

CMD ["/usr/bin/shiny-server"]

Build the dockerfile

3. docker build -t new_app .

Run the shiny app

4. docker run -d -p 3838:3838 new_app

Open port 3838 and you should see your new app!

Activity 8: APIs

Create a simple FastAPI

In your vscode or Python editor create a file called app.py with the following code:

from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/say_hello/{name}")
def say_hello(name):
    return {"Hello from Seattle": name}

Run the app with uvicorn app:app --reload in the terminal.

Click into the workbench icon on the left side of the screen and go to the proxied server link that shows up. This link should have an error that says detail: not found. 😥

Add /say_hello/{your-name-here} to the end of the url and click enter. It should work now!!!

Add `/docs#/ to the end of the initial url and click enter. It should bring you to the Swagger documentation! You can also test the API from the Swagger docs as well.

Activity 9: Use API to server model

Create a FastAPI using our penguin model

Note

This code was tested using Python 3.11.4.

In your terminal run:

pip install fastapi
pip install pydantic
pip install joblib
pip install pandas
pip install rsconnect-python
pip freeze > requirements.txt

Modify and run your model_python.py :

Add this line to the beginning of your script: import joblib
Add this line to the end of your script:joblib.dump(model, 'penguin_model.joblib')
Remove the print statements at the end of your script.

"""
This script creates a linear regression model
to predict the body mass of palmer penguins
given their sex, species, and bill length in mm

"""
import joblib
import duckdb
from palmerpenguins import penguins
from pandas import get_dummies
from sklearn.linear_model import LinearRegression

con = duckdb.connect("my-db.duckdb")
df = penguins.load_penguins()
df = con.execute("SELECT * FROM df").fetchdf().dropna()
con.close()

df.head(3)

X = get_dummies(df[["bill_length_mm", "species", "sex"]], drop_first=True)
y = df["body_mass_g"]

model = LinearRegression().fit(X, y)
joblib.dump(model, 'penguin_model.joblib')

Run your script again with python model_python.py
Create a new file called main.py. This will be our API script.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import pandas as pd

# Load the model
model = joblib.load('penguin_model.joblib')

# Initialize FastAPI app
app = FastAPI()

# Define the request body
class PenguinFeatures(BaseModel):
    species: str
    sex: str
    bill_length_mm: float

# Define the endpoint for prediction
@app.post("/predict")
def predict(features: PenguinFeatures):
    # Map species and sex to the appropriate format used in training
    species_map = {"Adelie": 0, "Chinstrap": 1, "Gentoo": 2}
    sex_map = {"male": 0, "female": 1}

    try:
        species = species_map[features.species]
        sex = sex_map[features.sex]
    except KeyError:
        raise HTTPException(status_code=400, detail="Invalid species or sex")

    # Prepare the input data for the model
    input_data = pd.DataFrame([{
        "bill_length_mm": features.bill_length_mm,
        "species_Chinstrap": 1 if features.species == "Chinstrap" else 0,
        "species_Gentoo": 1 if features.species == "Gentoo" else 0,
        "sex_male": 1 if features.sex == "male" else 0,
    }])

    # Make prediction
    prediction = model.predict(input_data)[0]

    # Return the prediction
    return {"body_mass_g": prediction}

Get the server url and API key for our Connect environment

Start the Automated Stock report jumpstart example. Click through until you get to Step 5 and then copy the url. This is the url for your Connect server. Close out of the jumpstart example.

Click on your name > API Keys tab. Create an API key and note it down somewhere safe.

Jump back to your vscode terminal and add your server url and api key to the rsconnect-python CLI.

To add a server, you need the following:

Your server URL
Your API key. See the API Keys section.
A nickname for the server that you provide

pip install rsconnect-python
rsconnect add \
    --server {server url from step 5 jumpstart example} \
    --name myServer \
    --api-key {$CONNECT_API_KEY}

Deploy our FastAPI to the Connect server

Run the below code in your terminal

rsconnect deploy fastapi \
    -n myServer \
    --entrypoint main:app \
    ./

Visit the App in your Connect homepage. To test out the predictions change the example values and click Try it out and then Execute.

{ “species”: “Adelie”, “sex”: “female”, “bill_length_mm”: 40.0 }

Activity 10: Testing the API

Run the below code in your terminal

curl -X 'POST' \
  'https://granite-mole.fd049.fleeting.rstd.io/rsconnect/content/e444fd65-634f-4b6a-bc78-be70c790cc3f/predict' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "species": "Adelie",
  "sex": "female",
  "bill_length_mm": 40.0
}'

Fix the authorization error by passing in your API key.

export CONNECT_API_KEY=xxxxx

curl -X 'POST' \
  'https://granite-mole.fd049.fleeting.rstd.io/rsconnect/content/e444fd65-634f-4b6a-bc78-be70c790cc3f/predict' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Key ${CONNECT_API_KEY}" \
  -d '{
  "species": "Adelie",
  "sex": "female",
  "bill_length_mm": 40.0
}'

Activity 11: Create a Shiny UI for the prediction API

Open an Rstudio IDE session and create a new project.
Initialize with renv.
Create a shiny app file called app.R.
Add your API key to your .Renviron file. You can do this easily using the usethis package: usethis::edit_r_environ().

renv::init()
usethis::edit_r_environ()

key="xxxxxx"

library(shiny)
library(httr2)

# API URL
api_url <- "https://granite-mole.fd049.fleeting.rstd.io/rsconnect/content/e444fd65-634f-4b6a-bc78-be70c790cc3f/predict"

ui <- fluidPage(
  titlePanel("Penguin Mass Predictor"),
  
  # Model input values
  sidebarLayout(
    sidebarPanel(
      sliderInput(
        "bill_length",
        "Bill Length (mm)",
        min = 30,
        max = 60,
        value = 45,
        step = 0.1
      ),
      selectInput(
        "sex",
        "Sex",
        c("male", "female") # Ensure values match what FastAPI expects
      ),
      selectInput(
        "species",
        "Species",
        c("Adelie", "Chinstrap", "Gentoo")
      ),
      # Get model predictions
      actionButton(
        "predict",
        "Predict"
      )
    ),
    
    mainPanel(
      h2("Penguin Parameters"),
      verbatimTextOutput("vals"),
      h2("Predicted Penguin Mass (g)"),
      textOutput("pred")
    )
  )
)

server <- function(input, output) {
  # Input params
  vals <- reactive(
    list(
      bill_length_mm = input$bill_length,
      species = input$species, # Send the species directly
      sex = tolower(input$sex) # Ensure "male" and "female" are lowercase
    )
  )
  
  # Fetch prediction from API
  pred <- eventReactive(
    input$predict,
    {
      req <- httr2::request(api_url) |>
        httr2::req_body_json(vals()) |>
        httr2::req_headers(
          "Content-Type" = "application/json",
          "Authorization" = paste("Bearer", Sys.getenv("key")) # Add API key to headers
        ) |>
        httr2::req_perform()
      
      httr2::resp_body_json(req)
    },
    ignoreInit = TRUE
  )
  
  
  # Render to UI
  output$pred <- renderText(pred()$body_mass_g)
  output$vals <- renderPrint(vals())
}

# Run the application
shinyApp(ui = ui, server = server)

Push-button deploy to Connect:

Press on the little blue button in the middle of your IDE and follow the instructions to deploy to Connect.

Give your app a name
Make sure you’re only deploying the app.r file
Click Deploy

Your app should show an error when you press predict because the Connect server doesnt have your API key credentials. Check the app logs to confirm.

Add your key in the Connect app vars tab and press predict again. It should work!