1 Introduction

1.1 Why Study Python ?

I love using R! R is brilliant when working with statistical methods, data wrangling and visualization. Since many R developers are statisticians or mathematicians, we get to try out their new research findings through R. For example, earlier this week I participated in a presentation on Anomaly Detection where applied mathematician Savvandi Kandanaarachchi introduced her research on applying Item Response Theory to construct unsupervised AD Ensembles (preprint of the research article), the algorithm is also published as an R Package outlierensembles for any R user to test and to apply.

However, Python is a general-purpose language and is hence much more versatile. Recently, I started to think about how to deploy statistical or ML models in production, say on a website or in a mobile App. After some research, I reached the understanding that in order to deploy a statistical/ML model in production, some knowledge of Python would be very helpful, especially when you need to work with other developers and engineers on various cloud platforms. For more discussion on R vs Python, this is a good read.

So here I am, an R user learning Python.

1.2 Learning Materials Used

I have some very basic understanding of Python, but only in terms of data visualization. In 2020 when Covid-19 case numbers started to surge in Australia, I started to make some DataViz to help interpret the numbers. Because a) R is not very supportive of dual axis graphs (for good reason, they should be used with caution) and b) I am curious about how Python works, so I tried to use Python (especially the Matplotlib and Seaborn package) to make some of the covid-19 visuals (please see an example of dual-axis graph I made with Python).

But beyond DataViz, I know very little about Python. So the first material I am using is a beginner-friendly course Using Python for Automation on LinkedIn Learning by Madecraft and Sam Pettus. I know very little about the topic, automation, and is curious to learn more. Upon finishing the first chapter of this course, I will move on to more data science specific topics.

1.3 The Content of This Document

The above-mentioned course has four chapters. This document will record my notes for Chapter 1. Automate File, Folder and Terminal Interactions. The note is real simple, mainly for records, and for some reflection from the perspective of an R user. Please refer to the original course on LinedIn Learning for a fuller understanding of the topic.

Where relevant, I will produce some R code to compare with the Python code. Therefore, hopefully it might be of interest to R users learning Python, and vice versa. To help distinguish between the two, in this note, R code chunks are in light pink, while python in light blue..

1.4 IDE

The IDE I use is RStudio, which allows you to run Python code through the reticulate package.

Let’s start.

2 Read a txt file

The first task in the course is to read a txt file. The txt file has some hypothetical data of

name,
age and
P/F, denoting whether the person passed or failed a test

The values are separated by space.

2.1 Python Approach

The following code is offered in the Python course mentioned above.

# "r"ead the file
f=open("Exercise Files/inputFile.txt","r")
print(f.read())

## Mary 25 P
## John 32 P
## Dylan 19 F
## Julia 23 F
## Chad 17 F
## Jack 20 F
## Hailey 26 P
## Iris 23 P
## Jacob 29 P
## Jalen 37 F
## Anthony 47 F
## Jamie 26 P
## Sydney 19 F
## Chloe 21 P
## Natalie 29 P
## David 23 P
## Mario 51 P
## Jenna 28 F
## Josh 39 P
## Damean 42 F
## Zach 19 F
## Kayla 28 P
## Ben 14 F
## Hunter 61 P
## Albert 29 F
## Lauren 31 F
## Erica 16 P
## Kyle 19 P
## Jordan 29 F
## Kaleb 47 F
## Rosanna 45 P
## Joy 28 P
## Bill 29 F
## Jim 67 P
## Robin 24 F
## Sansa 28 P
## Juan 73 P
## Colin 59 P
## Jake 8 F
## Kate 58 P
## Jade 26 P
## River 29 P
## Chris 31 P
## Grace 18 F

# close the file after the task
f.close()

2.2 R approach

Using read_delim() function, we can read the text file as such. The three columns in inputFile can be recognized by defining delim to be space.

f<-read_delim("Exercise Files/inputFile.txt",delim=" ",col_names=FALSE)

f %>% 
  # show the first five lines of the data
  head(n=5)

## # A tibble: 5 x 3
##   X1       X2 X3   
##   <chr> <dbl> <chr>
## 1 Mary     25 P    
## 2 John     32 P    
## 3 Dylan    19 F    
## 4 Julia    23 F    
## 5 Chad     17 F

#remove the file from the environment
rm(f)

2.3 Python vs R: Interesting Difference

Even with such a simple task, a very interesting difference between the R and Python approach already emerged: the relationship between “objects” and “methods/functions”.

In the Python approach, method is associated with the object (e.g., read() and close()), so when you want to read file f, you use f.read().
Whereas in R, method is independent from the object, so we read the file using the read_delim() and assign it to the object f.

This then leads to a very interesting discussion about functional versus object-oriented programming architecture for data science. Here is a good read on the topic. Maybe I will write more about it after I have gone deeper in my Python learning journey too.

BTW, you can do object-oriented programming in R too. But OOP is a bit more challenging in R than in other languages.

3 Print Part of the txt File

Here we would like to print only part of the file: people who passed the test.

3.1 Python Approach

The approach introduced in the course is to identify the 3rd element in the column and filter it to be P. Important to note that Python counts from 0, so [2] indicates the 3rd element.

f=open("Exercise Files/inputFile.txt","r")

for line in f:
  # split each line by space
  line_split=line.split()
  # check whether the 3rd element is P
  if line_split[2]=="P":
      print(line)

3.2 R Approach

Since read_delim automatically recognizes three elements in the document, we can refer to the 3rd element by name.

f<-read_delim("Exercise Files/inputFile.txt",delim=" ",col_names=FALSE)

f %>% 
  filter(X3=="P")

## # A tibble: 25 x 3
##    X1         X2 X3   
##    <chr>   <dbl> <chr>
##  1 Mary       25 P    
##  2 John       32 P    
##  3 Hailey     26 P    
##  4 Iris       23 P    
##  5 Jacob      29 P    
##  6 Jamie      26 P    
##  7 Chloe      21 P    
##  8 Natalie    29 P    
##  9 David      23 P    
## 10 Mario      51 P    
## # ... with 15 more rows

4 Separate and Save Files

Now let’s try to

separate the txt file into two subsets by Pass/Fail, and then
save them in two different files

4.1 Python Approach

Note that we need to create passFile and failFile object first, and operate on them through open(), write() and close().


f = open("Exercise Files/inputFile.txt","r")
# Create pass and fail file respectively, and write on them
passFile = open("Exercise Files/passFile.txt","w")
failFile = open("Exercise Files/failFile.txt","w")

for line in f:
  line_split=line.split()
  # if P, save passFile; else save failFile
  if line_split[2]=="P":
     passFile.write(line)
  else:
     failFile.write(line)
f.close()
passFile.close()
failFile.close()

4.2 R Approach

We don’t need to create passFile or failFile object in advance. Just filter and write to disk directly. Function filter and write_csv are independent from the objects/data files.

f <- read_delim("Exercise Files/inputFile.txt",delim=" ",col_names=FALSE)

# save the R files as .csv so it won't replace the ones created by Python

f %>% 
  filter(X3=="P") %>% 
  write_csv("Exercise Files/passFile.csv")

f %>% 
  filter(X3=="F") %>% 
  write_csv("Exercise Files/failFile.csv")

5 Executing Terminal Commands

I can’t execute the following Python code in RStudio. The error message is CalledProcessError: Command '['python3', 'example_chapter1.py']' returned non-zero exit status 9009.

(Same error message received when I run the code on Jupyter Notebook. Will move forward with the course, but bear this in mind and come back later)

import subprocess

for i in range(0,5):
  subprocess.check_call(["python3","example_chapter1.py"])

# example_chapter1.py contains a simple print() command that is supposed to be repeated five times

6 Organizing Directories

6.1 A function to Identify File Type

First, let’s create a function to identify categories of a file based on its suffix.

SUBDIRECTORIES = {
    "DOCUMENTS": ['.pdf','.rtf','.txt'],
    "AUDIO":['.m4a','.m4b','.mp3'],
    "VIDEOS": ['.mov','.avi','.mp4'],
    "IMAGES": ['.jpg','.jpeg','.png']
}
def pickDirectory(value):
    for category, suffixes in SUBDIRECTORIES.items():
        for suffix in suffixes:
            if suffix == value:
                return category
    return 'MISC' #If filetype doesn't exist in our dictionary

Let’s try the function pickDirectory() .

print(pickDirectory('.pdf'))

## DOCUMENTS

print(pickDirectory('.png'))

## IMAGES

print(pickDirectory('.py')) # when the suffix is not in the dictionary, return MISC

## MISC

In the rest of the session, Sam showed how to reorganize files into relevant folders, and the code worked from RStudio. Please check out the course instructions for details.

Chapter 1 is a good learning experience! Moving forward, I will continue with some Data Science focused Python courses, and maybe come back in the future for Chapter 4 on API.

Automation with Python - Chapter 1

An R user’s Python learning note

Mena WANG

18/09/2021