Scripted Harmony: Act I

Introduction

Welcome! My name is Trey, and I’ll be guiding you through 3 acts of what I like to call Scripted Harmony: A Guide to R & Python. In collaboration with Dr. Nicole Dalzell, we’ve built a learning experience that will heighten your technological abilities and begin your exposure to the usage of two different programming languages, practicing with them simultaneously.

In my experience, I had no idea where to start learning a new language, so these activities will serve as an easy-to-use, thought-provoking simulation. As a note, this is the only time you’ll see a long introduction, promise ;).

This is a set of modules geared towards teaching you Python techniques from R or vice versa. Through practicing certain data science methods, we’ll work towards developing an understanding of the fundamental concepts of data manipulation in both R and Python code through Quarto documentation.

If you are beginning these modules at Wake Forest, you’re probably aware the basis for your statistical learning is mostly R and RStudio-focused. We want to make sure we diversify our coding abilities as much as possible! This helps diversify your programming palette as you prepare to enter industry (or higher education).

Whether you’re an experienced coder, or just freshly enrolled in your Introduction to Statistics class, you’ll have all the resources you’ll need to practice the fundamentals of coding in both languages. Just remember, this public resource gets progressively more difficult in the 3 learning modules you’ll see. You’ll encounter many concepts that can build your resilience and hone your programming skills as you learn more difficult statistical prowess in parallel.

Okay, I’m done with my introduction, so with all of this jargon in mind, let’s start with some basics of data wrangling and usage in both R and Python!

Note: A link to the answer key can be found here: ___, but make sure to give it your best before relying too heavily on it ;)

Another Note: This lab assumes that you have already installed the software and gone through a RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.

Goal

Starting with some simple functions, we’ll explore a data set, and its missing data, and begin to look at how the basics of linear regression are coded in both R and Python.

Loading the Data

Using Quarto allows you to easily switch between Python and R code, allowing you to practice both simultaneously, rather than separately. Open up a new Quarto document.

After you’ve opened up a new Quarto document, we’ll first need to load the data into our environment. Today, we’ll be working with data already set up in the software, but in future Acts, we’ll use data from various sources.

To do so, we’ll be working with a data set that’s my personal favorite, the Palmer Penguins data set.

The Penguins data set is already built into the data infrastructure of R, so we can use code to extract it from its already pre-loaded position in memory. If you do not have the palmerpenguins library downloaded onto your computer, make sure to do that now.

(If you’re like: “I have no idea how to do that”, don’t fret. Click “Tools” at the top of your screen and select “Install Packages”. Afterwards, type in “palmerpenguins” and hit install. This will take a minute. An intermission, one may say)

Now, open a new chunk in your Quarto document (The shortcut for my friends using Windows, is Ctrl + Alt + I). Copy this code chunk into a new R chunk:

library(palmerpenguins)
data(penguins)
penguins <- data.frame(penguins)

Hit the little play button on the top right corner of this code chunk. You should see a green bar flash on the side of the code chunk as it runs. Once finished, you can see this in your global environment located (most likely) on the top right side of your screen.

Question 1

According to your environment, what are the dimensions of this data frame you’ve just loaded (i.e. how many rows and columns are in the penguin’s data set)?

Let’s go ahead and try this in Python.

Create another code chunk. Once you create a code chunk, rather than wanting to run R code, we want to switch to Python. To do so, just delete the lowercase “r” in your brackets at the top of the code chunk and type “python”. The Quarto document will automatically switch over and check for Python code. Now, input the same code as we did before.

An error? Gah, already? It hasn’t even been 2 chunks and we’ve already seen issues. Don’t worry, we’ll encounter a couple of these throughout our journey. Just remember, debugging is just as important as coding itself. Translating data from an R package requires a few alterations in our code to load our data.

We need to download the penguins data set as a CSV, or comma-separated value file, and load it in from our directory. This is going to differ for everybody considering we’re all working in different spaces, folders, and file paths, so I’ll give you a general idea of what needs to be done to load the data. First, let’s download the data using this R code:

write.csv(penguins, "penguins.csv", row.names = FALSE)

Let’s load this data using Python. Don’t worry about all the intricacies of this code, we’ll discuss a lot of these concepts later in this lab, but for now, trust me:

import pandas as pd
penguins_py = pd.read_csv("*Insert Filepath to get to the Dataset*/penguins.csv")

This file path (or directory) I’ve inputted as a general note is how your computer searches through your files and finds the CSV or file that you’re telling it to grab and put in your global environment.

If you need help finding this file path, go into your files in the folder with the data set. Now click the search bar that contains all the folders you aimlessly clicked on to get to the data set. Copy and paste that into the asterisk portion of the code I provided.

Last error check: If you have a question box that looks like this:

INSERT LAST PIC

Don’t worry, this beast isn’t a difficult one to tackle. Go ahead and press no and continue with your code. You’ll usually have to do this each time you restart an R session or exit the program. If you didn’t get this box, that’s great, ignore everything I just said.

But why did this work? When utilizing R objects in our Python code, we access them with Python-specific functions rather than R code. Some of this code may seem very similar, like write.csv and read_csv, but there are subtle differences you’ll discover through these Acts. In this case, we were accessing our penguins’ data set and called it penguins_py.

Here’s another question, you see the penguins data set in your environment from the R code, but where did our penguins_py data get stored?

Switching to our Python Environment

When switching to Python, you’ll notice in your environment a button that has the letter “R” on it. Once you click on it you should see a drop down menu appear containing “Python” in it. Click on that, you’ll see your Python version of the data set.

Ah! There it is, our penguins_py data set. Click on it. Another window will pop up containing the penguins data set we loaded previously.

Question 2

In our data, how many different variables (also known as features) do you see in our data?

Task

We’re going to perform some preliminary data manipulations, exploratory data analysis (EDA), and linear regression modeling.

Our client wants to learn more about penguins in Antarctica and has asked us to monitor their growth and any features related to such. More specifically, they’d like to build a model with the body mass of our penguin as the response variable.

Question 3

What is the name of our response variable in the data?

Data Summaries

If you want to find out how this data was collected and the motivation behind it, you can always type ?penguins into your console to access the help/tutorial page on this data set.

Fun note, putting a ? in front of any function you wish to run in R will allow you to see what parameters are needed to use that function, and will provide proper examples of that code in use! However, make sure you have run a code chunk in R prior to this, or else the console will still run code under a Python pretense.

With our response variable, we’ve got seven different features we could look at in our model:

species: the specific species of penguin
island: the island the penguin resides on
bill_length_mm: a penguin’s bill length in millimeters
bill_depth_mm: a penguin’s bill depth in millimeters
flipper_length_mm: a penguin’s flipper length in millimeters
sex: the penguin’s biological sex
year: the year the penguin measurements were recorded

Question 4

How many variables are numeric and how many are categorical?

The Response Variable

After identifying what the client is looking for, we need to look at our Y variable, body mass, to gain insight into its distribution. This process of preliminary data analysis is called Exploratory Data Analysis (EDA). This can look like graphs, tables, charts, etc.

To see some numeric statistics of our response feature, body_mass_g, we can use the summary() function in R.

Question 5

Using the summary() function in R, find the maximum body mass recorded in our penguins data set

In R, to access features inside a data frame, we use $ or type out the name of the column directly. For example, if we wanted to access the flipper length columns, we could type either of these two lines of code:

penguins$flipper_length_mm
summary(penguins[,“flipper_length_mm”])

In Python, we access and execute commands in a slightly different way. We use . instead of $. Let’s put this idea to use.

Question 6

Load the response variable column in Python

Similar to the summary() function in R, the describe() function grants us some summary statistics about a certain column. But we have to access this column first. If you’re coming from an R-centered background, this may feel odd. However, to access the functions of a certain object or data frame, we need to attach it to the end with a ., just like we do when accessing columns. This sounds pretty weird I know, but if you think of it like a pipe (|> or %>%) from the dplyr package in R, it’s actually very similar.

What we’ve been trained in R to think is to do function(thing), whereas, we need to adapt to thing.function.

Let’s look at an example of this:

penguins_py.body_mass_g.min()

Notice how I’m putting the function without any parameters in it, but adding it after a . following the column and data set I want to access. Comparatively, we would just do min(penguins$body_mass_g) in R.

Question 7

Find the maximum body mass recorded in our penguins data set using Python code

Nice! So we now know how to access some information in both languages. This is the foundation for accessing the columns and rows within any data set we’re working with. We’ve also explored our response variable and the different values it takes on.

But wait! Before we can begin fitting models and giving our client results, you may have noticed a couple of NAs or missing data in the summary table given by our output. That means we don’t know every single value in our data set.

Question 8

How many penguins don’t have a body mass recorded for them?

Basics of Missing Data

Before we proceed, we need to use a data set without any missing values. There are a multitude of methods used to rid your data set of missing values. For now, we’ll use a complete case analysis, which basically means we’ll remove the missing data (NAs) in both of our data sets, renaming them penguins_noNA and penguins_py_noNA, respectively.

In R, not only do we need to drop the rows that contain missing values, but we also need to assign it to a new name. Create a new code chunk and copy the following R code:

penguins_noNA <- na.omit(penguins)

The new name of our data is penguins_noNA (very subtle). The <- is R’s way of assigning a data frame, variable, function, etc. The na.omit function contains our previous data frame penguins, where it removes all missing values from our data set.

In Python, the logic is very similar. Instead of <-, we use =. The function to use instead of na.omit() is dropna().

Question 9

Make the penguins_py_noNA data set. Remember the syntax of accessing python data frames

Question 10

How many rows did we get rid of to make our penguins_noNA data set?

There are a ton of pros and cons to using this method, and if you want to learn more, there’s a ton of literature on the differences in handling missing data. So, we’ve explored our target variables and cleaned our data set, so we need to do a little more EDA!

EDA (Data Visualizations)

Our client is particularly interested in exploring the association between body mass and the flipper length of a penguin.

In order to proceed with a Linear Regression Model, we want to make sure it’s the right choice and fit for our data. There are certain conditions we need to consider before jumping to any one solution for our data.

In R, we’re going to use the package ggplot2. Make sure this is installed, and load the library, like we did using the Palmer Penguins data set.

Question 11

What code is needed to load the ggplot2 library?

This is a skill you’ll use A LOT in your data science journey. There are so many different libraries out there for so many different purposes. But, you can’t use the meticulous methods buried in a library unless you load it. And, loading them in can always be a pain. You get messages telling you how they’ve been loaded, what version they are, and so much more information where you go: “why do I need all this”.

Well, there’s actually a way to load these libraries in softly, not giving you all the information you don’t want to see after loading a specific library. While loading in the library, you can use the R code:

suppressMessages(library(ggplot2))

This will prevent any automatic reply from R coming at you after you hit play. We’ll run through sections of our Exploratory Data Analysis using R, and then repeat the process with some Python. Some of this may be reviewed, so take it at your own pace, and feel free to bounce around whenever you deem necessary.

Response Variable EDA

First, let’s look at the distribution of our response variable, body mass. Now, this set of learning modules assumes you are already familiar with the basics of R, especially ggplot. If you feel confident about this topic, every time we make visualizations in R, go ahead and skip to where we parallel the code in Python. However, for those of you who would like a quick refresher/review, stick around.

Now I know what you’re probably thinking: Ew. Graphs can be boring. Where’s the color, the title, and all the add-ons that ggplot has to offer?

You’re right. I hate it too, so let’s fix this mystification by adding some color to our graphs!

First, let’s understand what the ggplot function is doing. The ggplot() portion of the code is functioning as a background creator. To specify what we want our x and y variables to be, we use the aes() inside, after specifying the data, to tell R we want to visualize those columns. It’s laying the foundation for us to draw our graph on. When adding to this background, we’re using the “+” sign and indicating through geom_histogram() we want to make a histogram. In this function, we can specify the number of bins, the color and outline of our bars, etc.

Now, all we’re ready to do is add some color and labels. Copy and paste the following code:

ggplot(penguins_noNA, aes(x = body_mass_g)) + geom_histogram(fill = 'cyan', col = 'black', bins = 20) + labs(title = "Figure 1: A Nice Graph (with color!)", x = "Body Mass (grams)", y = "Count")

The + labs() portion of the code indicates we want to add some labels, with parameters like title, x, and y to show where we want to put these labels on our graph. We can change these arguments to whatever we want them to be, which allows us to create more effective visualizations for whoever needs to see them. You try!

Question 12

Make the same histogram, with bins = 30, and outline the bars with a color of yellow, and a fill color of cyan. Call this plot Figure 1

As a note, keep in mind your axes and titles shouldn’t be too complicated. You’ve got a limited amount of space to convey the information from the data, and you need to be able to use that wisely. However, try not to be too concise, where you start to use abbreviations no one can read. It’s a difficult equilibrium to reach. Finding that balance comes with practice, and continuing to ask yourself and others whether or not you can interpret the information on a graph without explanation will be an invaluable skill in whatever field you choose.

Visualizations in Python

Alright! That was a lot of information to do in R, now it’s time to translate some of those skills into Python. This section will be a little shorter, as there aren’t as many straightforward and simplistic functions Python uses to visualize data. Its strength relies more on data manipulation, which will be explored in later acts (and a little at the end). We’ll see more Python-heavy labs, don’t you worry.

In Python, we’re going to use the package entitled matplotlib. This works very similar to a library or package in R, but due to the syntax and differing uses, loading this in also looks quite far from what we coded above.

In a Python chunk, copy the following code:

import matplotlib as plt

(Check-in: Don’t forget to keep switching those rs to pythons and vice versa as you continue bouncing between languages. Isn’t it so hard, coding in two languages at once?!)

Quick Intermission Regarding Libraries

This is known as importing a library. The import function serves to load in a certain library, and the as plt portion tells Python what you’ll type to access functions within the matplotlib library. If this wasn’t specified, you’d have to type matplotlib every single time you want to use it, which is just super inconvenient and unnecessarily time-consuming. Who has time to type 7 more letters?

Like the library() function in R, we’re telling Python “Hey, I’d like to use this library and all the functions that go along with it, please check it out. Also, anytime I use it, I only want to have to type plt, thanks”.

Back to our Regularly Scheduled Program

Let’s make a histogram for our response variable. Don’t forget the change in syntax here, or else you’ll receive errors galore from Quarto spewing all types of messages at you.

To make any plot using the Matplotlib package, you want to specify the type of visualization you want to make and use plt.show() to show said picture.

For example, here’s the structure for exploring our response variables distribution in Python:

x = penguins_py_noNA.body_mass_g
plt.hist(x)
plt.show()

Here, we have to specify the x variable before we actually begin to plot it. This is because the .hist() function takes in a Series, or vector, which is only a part of the full data frame. Remember, there are two ways of declaring this variable:

x = penguins_py_noNA.body_mass_g
x = penguins_py_noNA[,’body_mass_g’]

`You’ll see that the default themes and packaging of the visualizations in Python are a little different. That’s okay, the premise and data doesn’t change. Let’s add some labels to this graph so we can read it:

x = penguins_py_noNA.body_mass_g
plt.hist(x)

plt.title("Python Figure 1")
plt.xlabel("Body Mass (g)")
plt.ylabel("Count")

plt.show()

Not much different, huh? We’re just specifying each line with a new addition to the graph we’d like to make.

Question 13

Explore the same two numeric variables with island or species as your indicator variable instead. Do there seem to be any patterns or trends in your visualization?

Next, try to make the same scatter plot,

Question 14

Before any code is actually done, what do you think the commands will be to make a scatter plot using the Matplotlib library?

Just like the last question, we’re going to change one aspect of the code, the type of plot!

x = penguins_py_noNA.flipper_length_mm
y = penguins_py_noNA.body_mass_g
plt.scatter(x,y)

plt.title("Python Figure 2")
plt.xlabel("Flipper Length (mm)")
plt.ylabel("Body Mass (g)"")

But wait! Something’s missing!

Question 15

What code is missing from above?

You’re on a roll! These visualizations are great, but remember, these are just with the numeric variables. Let’s add on a penguin’s sex to see if we see a similar trend like before. To change the color of the dots, we’re actually going to use a different function called an lmplot in the seaborn package. It works very similarly to ggplot. The code to plot is below:

Import seaborn as sns
sns.lmplot(x='flipper_length_mm', y= ‘body_mass_g’, data=penguins_py_noNA, hue='sex', fit_reg=False)

Here, most everything is the same, where hue is just what we put for “color” in our R code, and we want to make sure it doesn’t fit a regression line, but if we did, fit_reg would need to be changed to True.

Tables in Python (Categorical EDA):

To view the categorical variables in a table, rather than another visualization, we’ll use a feature of pandas, one of the most popular libraries. In it, it’s the same process of finding the correct function, and translating the syntax from R to Python, just like we’ve done before.

Question 16

Import the pandas package, indicating you want to type pd every time you use it.

We’re going to use the value_counts() functions apart of the pandas package, which takes the structure: df.colName.value_counts()

penguins_py_noNA.sex.value_counts()

Question 17

Are these results similar to what was previously found using R? Why or why not?

After this, it’s just a matter of exploring the different variations Python and R has to work with. There are cheat sheets and tutorials for several different visualizations for many scenarios you’ll encounter in the real world. From this, however, you’ve gained knowledgeable and very useful insight into the basics of exploratory data analysis in both R and Python.

There are a ton of possibilities in Python to explore with visualizations. However, R is second to none in terms of its limitless possibilities with visualizing data. Due to its easy-to-use format, ggplot gives an exceptional standard for certain EDA visualizations that are needed quickly.

However, this is just my opinion. If you find Python to be more effective and think it looks prettier, by all means, continue to use it! I may be a little biased in R’s favor, but we’ll hit several trials where Python beats out R any day in future acts of this play, so for my die hard Python fans out there, we’ll show off in a bit.

Unfortunately, Python doesn’t necessarily have a simple function that adds a fitted line to our scatter plot. In order to do so, you’ll have to fit the model separately and add its equation to the plot yourself, which we won’t do in this module. Which means you’re done!

Rendering/Knitting your Document

If you want to export this beautiful lab of yours, with all your hard work, you can click the render button towards the top of your screen. After a while, Quarto will generate an .html (or .pdf) depending on your default setting or preferences when creating the document itself, that you can then right click and save to (re)view at a later date!

Conclusion

That’s it! You’ve muscled your way through the first act in your journey to becoming more well-versed in a completely new coding language. The second act of this play focuses more on complex statistical learning in both languages. As you’ve seen visualizations look better in R, you’ll find Python relies heavily on its machine learning.

Keep practicing! After practice, this switch in between languages will begin to become light work, and you’ll soon develop not just a new programming language, but hone your statistical and R programming skills as well!

(If you’d like to read the research that went into these modules, and other work I’ve done, click here ____, shout out shameless plugs).

References

This work was created by Trey Roark is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2023 December 6th.

The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .