I love using R! R is brilliant when working with statistical methods, data wrangling and visualization. Since many R developers are statisticians or mathematicians, we get to try out their new research findings through R. For example, earlier this week I participated in a presentation on Anomaly Detection where applied mathematician Savvandi Kandanaarachchi introduced her research on applying Item Response Theory to construct unsupervised AD Ensembles (preprint of the research article), the algorithm is also published as an R Package outlierensembles for any R user to test and to apply.
However, Python is a general-purpose language and is hence much more versatile. Recently, I started to think about how to deploy statistical or ML models in production, say on a website or in a mobile App. After some research, I reached the understanding that in order to deploy a statistical/ML model in production, some knowledge of Python would be very helpful, especially when you need to work with other developers and engineers on various cloud platforms. For more discussion on R vs Python, this is a good read.
So here I am, an R user learning Python.
I have some very basic understanding of Python, but only in terms of data visualization. In 2020 when Covid-19 case numbers started to surge in Australia, I started to make some DataViz to help interpret the numbers. Because a) R is not very supportive of dual axis graphs (for good reason, they should be used with caution) and b) I am curious about how Python works, so I tried to use Python (especially the Matplotlib and Seaborn package) to make some of the covid-19 visuals (please see an example of dual-axis graph I made with Python).
But beyond DataViz, I know very little about Python. So the first material I am using is a beginner-friendly course Using Python for Automation on LinkedIn Learning by Madecraft and Sam Pettus. I know very little about the topic, automation, and is curious to learn more. Upon finishing the first chapter of this course, I will move on to more data science specific topics.
The above-mentioned course has four chapters. This document will record my notes for Chapter 1. Automate File, Folder and Terminal Interactions. The note is real simple, mainly for records, and for some reflection from the perspective of an R user. Please refer to the original course on LinedIn Learning for a fuller understanding of the topic.
Where relevant, I will produce some R code to compare with the Python code. Therefore, hopefully it might be of interest to R users learning Python, and vice versa. To help distinguish between the two, in this note, R code chunks are in light pink, while python in light blue..
The IDE I use is RStudio, which allows you to run Python code through the reticulate package.
Let’s start.
The first task in the course is to read a txt file. The txt file has some hypothetical data of
The values are separated by space.
The following code is offered in the Python course mentioned above.
# "r"ead the file
f=open("Exercise Files/inputFile.txt","r")
print(f.read())
## Mary 25 P
## John 32 P
## Dylan 19 F
## Julia 23 F
## Chad 17 F
## Jack 20 F
## Hailey 26 P
## Iris 23 P
## Jacob 29 P
## Jalen 37 F
## Anthony 47 F
## Jamie 26 P
## Sydney 19 F
## Chloe 21 P
## Natalie 29 P
## David 23 P
## Mario 51 P
## Jenna 28 F
## Josh 39 P
## Damean 42 F
## Zach 19 F
## Kayla 28 P
## Ben 14 F
## Hunter 61 P
## Albert 29 F
## Lauren 31 F
## Erica 16 P
## Kyle 19 P
## Jordan 29 F
## Kaleb 47 F
## Rosanna 45 P
## Joy 28 P
## Bill 29 F
## Jim 67 P
## Robin 24 F
## Sansa 28 P
## Juan 73 P
## Colin 59 P
## Jake 8 F
## Kate 58 P
## Jade 26 P
## River 29 P
## Chris 31 P
## Grace 18 F
# close the file after the task
f.close()
Using read_delim() function, we can read the text file as such. The three columns in inputFile can be recognized by defining delim to be space.
f<-read_delim("Exercise Files/inputFile.txt",delim=" ",col_names=FALSE)
f %>%
# show the first five lines of the data
head(n=5)
## # A tibble: 5 x 3
## X1 X2 X3
## <chr> <dbl> <chr>
## 1 Mary 25 P
## 2 John 32 P
## 3 Dylan 19 F
## 4 Julia 23 F
## 5 Chad 17 F
#remove the file from the environment
rm(f)
Even with such a simple task, a very interesting difference between the R and Python approach already emerged: the relationship between “objects” and “methods/functions”.
read() and close()), so when you want to read file f, you use f.read().read_delim() and assign it to the object f. This then leads to a very interesting discussion about functional versus object-oriented programming architecture for data science. Here is a good read on the topic. Maybe I will write more about it after I have gone deeper in my Python learning journey too.
BTW, you can do object-oriented programming in R too. But OOP is a bit more challenging in R than in other languages.
Here we would like to print only part of the file: people who passed the test.
The approach introduced in the course is to identify the 3rd element in the column and filter it to be P. Important to note that Python counts from 0, so [2] indicates the 3rd element.
f=open("Exercise Files/inputFile.txt","r")
for line in f:
# split each line by space
line_split=line.split()
# check whether the 3rd element is P
if line_split[2]=="P":
print(line)
Since read_delim automatically recognizes three elements in the document, we can refer to the 3rd element by name.
f<-read_delim("Exercise Files/inputFile.txt",delim=" ",col_names=FALSE)
f %>%
filter(X3=="P")
## # A tibble: 25 x 3
## X1 X2 X3
## <chr> <dbl> <chr>
## 1 Mary 25 P
## 2 John 32 P
## 3 Hailey 26 P
## 4 Iris 23 P
## 5 Jacob 29 P
## 6 Jamie 26 P
## 7 Chloe 21 P
## 8 Natalie 29 P
## 9 David 23 P
## 10 Mario 51 P
## # ... with 15 more rows
Now let’s try to
Note that we need to create passFile and failFile object first, and operate on them through open(), write() and close().
f = open("Exercise Files/inputFile.txt","r")
# Create pass and fail file respectively, and write on them
passFile = open("Exercise Files/passFile.txt","w")
failFile = open("Exercise Files/failFile.txt","w")
for line in f:
line_split=line.split()
# if P, save passFile; else save failFile
if line_split[2]=="P":
passFile.write(line)
else:
failFile.write(line)
f.close()
passFile.close()
failFile.close()
We don’t need to create passFile or failFile object in advance. Just filter and write to disk directly. Function filter and write_csv are independent from the objects/data files.
f <- read_delim("Exercise Files/inputFile.txt",delim=" ",col_names=FALSE)
# save the R files as .csv so it won't replace the ones created by Python
f %>%
filter(X3=="P") %>%
write_csv("Exercise Files/passFile.csv")
f %>%
filter(X3=="F") %>%
write_csv("Exercise Files/failFile.csv")
I can’t execute the following Python code in RStudio. The error message is CalledProcessError: Command '['python3', 'example_chapter1.py']' returned non-zero exit status 9009.
(Same error message received when I run the code on Jupyter Notebook. Will move forward with the course, but bear this in mind and come back later)
import subprocess
for i in range(0,5):
subprocess.check_call(["python3","example_chapter1.py"])
# example_chapter1.py contains a simple print() command that is supposed to be repeated five times
First, let’s create a function to identify categories of a file based on its suffix.
SUBDIRECTORIES = {
"DOCUMENTS": ['.pdf','.rtf','.txt'],
"AUDIO":['.m4a','.m4b','.mp3'],
"VIDEOS": ['.mov','.avi','.mp4'],
"IMAGES": ['.jpg','.jpeg','.png']
}
def pickDirectory(value):
for category, suffixes in SUBDIRECTORIES.items():
for suffix in suffixes:
if suffix == value:
return category
return 'MISC' #If filetype doesn't exist in our dictionary
Let’s try the function pickDirectory() .
print(pickDirectory('.pdf'))
## DOCUMENTS
print(pickDirectory('.png'))
## IMAGES
print(pickDirectory('.py')) # when the suffix is not in the dictionary, return MISC
## MISC
In the rest of the session, Sam showed how to reorganize files into relevant folders, and the code worked from RStudio. Please check out the course instructions for details.
Chapter 1 is a good learning experience! Moving forward, I will continue with some Data Science focused Python courses, and maybe come back in the future for Chapter 4 on API.
.