Set Up

library(reticulate) # to use python in RStudio
library(tidyverse) # data wrangling and plotting with R

This is a short note about data structures in R and Python. Throughout the note, R and Python code are compared side by side, with R code chunks in light pink, while python in light blue.. I keep it mainly as a study note, but hopefully it might be of interest to fellow R users learning Python, and Python users learning R.

Summary

As an R user, what I found particularly interesting in terms of data structures in Python, is the idea of mutable objects. In Python, lists and dictionaries are mutable, you can simply change them by applying a method (details here).

Another novel and intriguing concept for me is the contrast between deep and shallow copies. For the mutable lists and dictionaries, copies are simply alias, or shallow copies, that will change when the originals change. Check out examples and how to make a deep copy that does not change with the original in lists and dictionaries, respectively.

Strings in Python

Both R and Python has string data structures. Working with strings is very helpful skills in many scenarios, for example:

Sometimes numerical values comes in string format, because it might contains dollar sign or commas.
Text analysis and NLP needs us to work with strings
Manipulating strings help us print nicely formatted message (for string formatting, check out two special types of strings in Python, f-strings and raw strings here)

There are lots of methods that can be applied to strings in Python. There is no need to remember all of them, we can always check them up with dir(). We can also get help on a particular method using help().

myString = "working with Python is fun"

dir(myString) # check up available methods

## ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

This is pretty cool, let’s try some methods. Methods such as title() or lower() are self-explanatory. For other methods that we are not familiar with yet, try the help() function.

myString.title()

## 'Working With Python Is Fun'

myString.upper()

## 'WORKING WITH PYTHON IS FUN'

myString.lower()

## 'working with python is fun'

myString.find('i') # where did it first appear?

## 4

myString.rfind('i') # where did it last appear?

## 20

myString.count('i') # how many times did it appear?

## 3

help(myString.split) # find help on a particular method

## Help on built-in function split:
## 
## split(...) method of builtins.str instance
##     S.split(sep=None, maxsplit=-1) -> list of strings
##     
##     Return a list of the words in S, using sep as the
##     delimiter string.  If maxsplit is given, at most maxsplit
##     splits are done. If sep is not specified or is None, any
##     whitespace string is a separator and empty strings are
##     removed from the result.

separatedStrings = myString.split() # split the elements
separatedStrings

## ['working', 'with', 'Python', 'is', 'fun']

help(myString.join) # learned about the "separator" in help

## Help on built-in function join:
## 
## join(...) method of builtins.str instance
##     S.join(iterable) -> str
##     
##     Return a string which is the concatenation of the strings in the
##     iterable.  The separator between elements is S.

'-'.join(separatedStrings) # use '-' as the separator

## 'working-with-Python-is-fun'

myString.replace('Python', 'both R and Python')

## 'working with both R and Python is fun'

'$12,345,678'.replace(',','') # comes in handy when working with numbers, e.g., replace commas with empty strings

## '$12345678'

Lists in Python

Lists are flexible, we can have different data types and structures in one list.
Lists are very helpful when we need to carry lots of data in one package. For example, a list may contain data frames of transaction, customer and product data respectively, and a dictionary of metadata about when the list was created.

Define a List

# use brackets
list_r <- list(5,"this is an r list with error",True)

## Error in eval(expr, envir, enclos): object 'True' not found

# TRUE in R, while True in Python
list_r <- list(5,"this is another r list",TRUE)


# a vector from 1 to 15 by 2
vector_r <- seq(1,15,2)
vector_r

## [1]  1  3  5  7  9 11 13 15

# use square brackets, no need to explicitly name the list "list"
# TRUE in R, while True in Python
list_p = [5,"this is a python list with error",TRUE]

## Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'TRUE' is not defined
## 
## Detailed traceback:
##   File "<string>", line 1, in <module>

list_p = [5,'this is another python list',True]

Get the 1st element in a list

A note to fellow R users: Python is a 0-index language. The 1st element is indexed 0.

# index=1 refers to the 1st element
# two pairs of square brackets to get a single item from the list
list_r[[1]]

## [1] 5

# single bracket returns a list (although it only contains one element)
list_r[1]

## [[1]]
## [1] 5


# index=0 refers to the 1st element
list_p[0]

## 5

To get a list from list

Basic format of slicing: data[start:end:stride]
Noted that the start and the end in the above command is left inclusive, right exclusive


l=[1,2,3,4,5,6]

# left inclusive, l[0:] shows the 1st element in the list
l[0]

## 1

# right exclusive; l[0:3] doesn't show the 4th element in the list
l[:3]

## [1, 2, 3]

l[3]

## 4

# so how do i get the last item in the list? 
l[0:6] # method 1

## [1, 2, 3, 4, 5, 6]

l[0:] # method 2

## [1, 2, 3, 4, 5, 6]

l[0:4:2] # we can use the stride to set how we would like to skip too

## [1, 3]

l[::2] # this gives us all the odd numbers

## [1, 3, 5]

l[1::2] # this gives us all the even numbers

## [2, 4, 6]

l[-1::-1] # this reverse the list

## [6, 5, 4, 3, 2, 1]

# this creates a vector from 1 to 6
v <- c(1:6)
# get the 1st to 3rd element of the vector
v[1:3]

## [1] 1 2 3

# this creates a list with two elements, each element is a list
l <- list(1:6,7:10)
l[1:2]

## [[1]]
## [1] 1 2 3 4 5 6
## 
## [[2]]
## [1]  7  8  9 10

# get the first and last item of each list
# sapply in R is similar to lambda in Python, more later when we talk about functions
sapply(l,head,1)

## [1] 1 7

sapply(l,tail,1)

## [1]  6 10

length(l)

## [1] 2

Above in R, in order to get the length of a list, we use the length() function. In Python, we use both functions and methods to operate on an object.

# function applied onto an object, 
len(l)

## 6

# methods are functions that an object can call upon itself 
l.append('this is the appended value')
l

## [1, 2, 3, 4, 5, 6, 'this is the appended value']

Deep vs Shallow copy

As an R user, one thing I found really interesting about Python lists is the idea of deep vs shallow copies. If we make a copy of a list in Python, it actually creates an alias, or shallow copy of it which will change with the original. I took a note about this here before, but it is always good to review and remind myself of this. Followed is an example.

l_copy = l
l_copy.append('this is an appended value added to the copy') # add a value to the copy
l # the original changed too

## [1, 2, 3, 4, 5, 6, 'this is the appended value', 'this is an appended value added to the copy']

l.insert(0, 'this is a value inserted at the beginning of the original') # add one more value to the original
l_copy # the copy changed too

## ['this is a value inserted at the beginning of the original', 1, 2, 3, 4, 5, 6, 'this is the appended value', 'this is an appended value added to the copy']

Isn’t this fascinating? If you would like to create a copy that does not change with the original, make it a deep copy

l_deepcopy = l[:]

l.pop(-1) # remove and return the last element from the original list

## 'this is an appended value added to the copy'

l_copy # the shallow copy changed accordingly

## ['this is a value inserted at the beginning of the original', 1, 2, 3, 4, 5, 6, 'this is the appended value']

l_deepcopy # the deep copy is not changed

## ['this is a value inserted at the beginning of the original', 1, 2, 3, 4, 5, 6, 'this is the appended value', 'this is an appended value added to the copy']

Mutable vs Immutable Objects

Another topic I took note before but need to remind myself from time to time is how Python lists are mutable. Therefore, as shown above, I can change it simply by applying a method.

l.reverse() # this reverse the items in l
l

## ['this is the appended value', 6, 5, 4, 3, 2, 1, 'this is a value inserted at the beginning of the original']

In contrast, remember strings are immutable in Python. Therefore, applying a method doesn’t change it.

myString.upper()

## 'WORKING WITH PYTHON IS FUN'

myString # myString is unchanged

## 'working with Python is fun'

myString = myString.title() # 'changing' myString by creating a new one
myString

## 'Working With Python Is Fun'

In R, most of the objects are immutable. We can sort of change an object as follows, but we are effectively creating a new object (like what we did with myString in the above code chunk)

l <- list(1,2,3,4,5)

# this shows us the result if we reverse l
rev(l)

## [[1]]
## [1] 5
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 1

l # but l itself is unchanged

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5

# to actually change l, I need to assign the reversed list back to it
l <- rev(l)
l # now l is reversed

## [[1]]
## [1] 5
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 1

Tuples in Python

In python, tuples are also immutable. So if we would like to change a tuple, we need to explicitly assign it to a new object as in R. Lots of methods that work for lists don’t work for tuples, because tuples are immutable. But let’s play with some built in functions in python.

t = (4,8,2,7) # create a tuple
dir(t) # fewer methods to choose from

## ['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'count', 'index']

dir(__builtins__) # check out python's built-in functions

## ['ArithmeticError', 'AssertionError', 'AttributeError', 'BaseException', 'BlockingIOError', 'BrokenPipeError', 'BufferError', 'BytesWarning', 'ChildProcessError', 'ConnectionAbortedError', 'ConnectionError', 'ConnectionRefusedError', 'ConnectionResetError', 'DeprecationWarning', 'EOFError', 'Ellipsis', 'EnvironmentError', 'Exception', 'False', 'FileExistsError', 'FileNotFoundError', 'FloatingPointError', 'FutureWarning', 'GeneratorExit', 'IOError', 'ImportError', 'ImportWarning', 'IndentationError', 'IndexError', 'InterruptedError', 'IsADirectoryError', 'KeyError', 'KeyboardInterrupt', 'LookupError', 'MemoryError', 'ModuleNotFoundError', 'NameError', 'None', 'NotADirectoryError', 'NotImplemented', 'NotImplementedError', 'OSError', 'OverflowError', 'PendingDeprecationWarning', 'PermissionError', 'ProcessLookupError', 'RecursionError', 'ReferenceError', 'ResourceWarning', 'RuntimeError', 'RuntimeWarning', 'StopAsyncIteration', 'StopIteration', 'SyntaxError', 'SyntaxWarning', 'SystemError', 'SystemExit', 'TabError', 'TimeoutError', 'True', 'TypeError', 'UnboundLocalError', 'UnicodeDecodeError', 'UnicodeEncodeError', 'UnicodeError', 'UnicodeTranslateError', 'UnicodeWarning', 'UserWarning', 'ValueError', 'Warning', 'WindowsError', 'ZeroDivisionError', '_', '__build_class__', '__debug__', '__doc__', '__import__', '__loader__', '__name__', '__package__', '__spec__', 'abs', 'all', 'any', 'ascii', 'bin', 'bool', 'bytearray', 'bytes', 'callable', 'chr', 'classmethod', 'compile', 'complex', 'copyright', 'credits', 'delattr', 'dict', 'dir', 'divmod', 'enumerate', 'eval', 'exec', 'exit', 'filter', 'float', 'format', 'frozenset', 'getattr', 'globals', 'hasattr', 'hash', 'help', 'hex', 'id', 'input', 'int', 'isinstance', 'issubclass', 'iter', 'len', 'license', 'list', 'locals', 'map', 'max', 'memoryview', 'min', 'next', 'object', 'oct', 'open', 'ord', 'pow', 'print', 'property', 'quit', 'range', 'repr', 'reversed', 'round', 'set', 'setattr', 'slice', 'sorted', 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'vars', 'zip']

sorted(t) # apply the sorted function to sort t

## [2, 4, 7, 8]

t # applying the function doesn't change t itself

## (4, 8, 2, 7)

t = sorted(t) # this 'changes' t (but actually created a new object)
t

## [2, 4, 7, 8]

print('The original tuple:', t)

## The original tuple: [2, 4, 7, 8]

print('The reversed tuple:', tuple(reversed(t)))

## The reversed tuple: (8, 7, 4, 2)

len(t) # len() for length

## 4

print(f'The tuple contains {len(t)} items.')

## The tuple contains 4 items.

Dictionaries in Python

Lists are not labeled,
In contrast, a dictionary provides key : value pairs
Dictionaries are important because data is often exported in JSON format, which can resemble a dictionary.

# Create a dictionary with key : value pairs
# note: order of the keys are not guaranteed
person_dict = {
    'fname': 'John',
    'lname': 'Doe',
    'age': 35,
    'employed': True,
    'twitter_followers': 3985
}

# Get the first and last names from the dict
print(person_dict['lname'])

## Doe

print(person_dict['fname'])

## John

Again, we can always use dir() to find out what methods are available for our objects.

dir(person_dict) # explore methods available

## ['__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

Keys, Values and Items in a dictionary

person_dict.keys() # get all the keys in the dictionary

## dict_keys(['fname', 'lname', 'age', 'employed', 'twitter_followers'])

list(person_dict.keys()) # turn the keys into a list

## ['fname', 'lname', 'age', 'employed', 'twitter_followers']

tuple(person_dict.keys()) # turn the keys into a tuple

## ('fname', 'lname', 'age', 'employed', 'twitter_followers')

for k in person_dict.keys(): # print out all the keys
  print(f'key: {k}')

## key: fname
## key: lname
## key: age
## key: employed
## key: twitter_followers

for k,v in person_dict.items(): # print out all the key:value pairs
  print(f'{k}:{v}')

## fname:John
## lname:Doe
## age:35
## employed:True
## twitter_followers:3985

help(person_dict.pop) # and we can always get help on a particular method

## Help on built-in function pop:
## 
## pop(...) method of builtins.dict instance
##     D.pop(k[,d]) -> v, remove specified key and return the corresponding value.
##     If key is not found, d is returned if given, otherwise KeyError is raised

Deep and shallow copy of dictionaries

# dictionaries are mutable, applying a method will change the object itself
person_dict.pop('age')

## 35

person_dict

## {'fname': 'John', 'lname': 'Doe', 'employed': True, 'twitter_followers': 3985}

import copy
person_dict_deepcopy = copy.deepcopy(person_dict) # make a deep copy before further change
person_dict_shallowcopy = person_dict # make a shallow copy for comparison

help(person_dict.popitem) # learned that this method return and remove the last item

## Help on built-in function popitem:
## 
## popitem(...) method of builtins.dict instance
##     D.popitem() -> (k, v), remove and return some (key, value) pair as a
##     2-tuple; but raise KeyError if D is empty.

person_dict.popitem()

## ('twitter_followers', 3985)

person_dict # last key:value `twitter followers` gone

## {'fname': 'John', 'lname': 'Doe', 'employed': True}

person_dict_shallowcopy # the same happened to the shallow copy

## {'fname': 'John', 'lname': 'Doe', 'employed': True}

person_dict_deepcopy # the deep copy was not changed

## {'fname': 'John', 'lname': 'Doe', 'employed': True, 'twitter_followers': 3985}

Other Data Structures

Python also support other data structures such as set and collections.

One interesting about set is that it automatically drops duplicates. Please see the example as follows.

mySet = {'cat','dog','koala','dog','kangaroo'} # create a set
print(mySet) # notice that the duplicate item 'dog' were dropped

## {'koala', 'kangaroo', 'cat', 'dog'}

print('Is cat in my set?', 'cat' in mySet)

## Is cat in my set? True

print('Is penguin in my set?', 'penguin' in mySet)

## Is penguin in my set? False

R & Python Basics: Data Structure

An R user’s Python learning note

Mena WANG

18/03/2021