Images are uploaded to github image folder in case they are too small, attmept to zoom in on rpubs document if possible
Setup software on VM
- Setup virtual machine on GCP compute engine
- I chose a windows VM with desktop
- Install Google chrome
- First step was to initialize VM, Google chrome has a RDP add on which allows you to access VM easily
- I didn’t choose to load the VM with a container.
- There is potential that this could cause problems in workflow between team if other members were to use Linux or some other version/platform.
- Issues aside, loading up VM and installing Anaconda is extremely easy.
- Anaconda comes with python 3, pip, Jupyter, and some base packages
- This allows for pipenv setup as well as testing within jupyter for quick editing
Accessing data
- My initial csv files were loaded into my script files from github
- Being that this project had us dealing with cloud, I decided all files from would be loaded from Google cloud storage buckets
- The initial setup of the buckets was easy, but messing around with authorizations was not so easy.
- That is why i decided to just make my bucket public
- This can be accomplished through gsutil within Google command prompt and through the web browser like in the image below
Initial problems
- Writing to buckets or accessing other Google API, from compute engine, requires permissions which need to be set on compute engine on installation.
- Luckily Google has updated the engine to allow for you to change these permissions, but the engine needs to be stopped in order to accomplish this
- Writing from python to buckets within virtual machine, required use of a custom function within the Google cloud documentation
End Intial Setup
- We now have a virtual machine, with anaconda loaded and access to read and write to Google storage buckets
Create testing environment
- Through experimentation within jupyter, I was able to edit my original scripts to begin testing in a clean environment.
- This highlights a problem with the data flow process.
- Building requirements.txt file is more of an iterative process.
- I wanted to make my project flow purely pythonic
- My requirements.txt file, needs to exist outside of my python code.
- Perhaps this is semantic, as several other aspects of the project were done through web browser, but it stuck with me that I couldn’t build a requirements text from within an empty python environment
- Therefore, I chose to load a preconstructed requirements.txt file into my virtual environment, which i document below
Workflow
- All steps below occur on new virtual environment compute engine
Step2- load _requirements
- Load in txt file
- below I show packages in clean environment
Step3- load csv files from cloud bucket
- Creates local df of train/test csv from original project
Step4- Begin loading scripts
- Train_model.py does the following
- Writes a classification_report_csv and pipeline locally to cwd
- Writes a classification_report_csv and pipeline to Google cloud bucket
- Images of new bucket displayed below
Step4- update old score script
- load pipeline off cloud
- new updated code posted below
- Incorporating above code into score.py
- Incorporating write to cloud for results