R & Python Cheat Sheet

Task 1: Loading Packages

Loading a package in R and Python is a critical task that all data scientists will likely perform during a data science project, usually at the beginning of their workflow.

As R and Python are open source languages there is a large number of community developed packages available. These packages will generally have a specific purpose and provide enhanced functionality and efficiency for given data science tasks.

To load a package it must be previously installed in the user’s environment.

Packages must be installed through the install.packages("packagename") function first.

Once a package is installed there are multiple ways to load it in R, a package is loaded by calling the library() function and passing the name of the package:

R.1.0 - a specific package is loaded
R.1.1 - a specific function from a package is called without loading the entire package, this can also be used when two packages have the same function name to specify the correct one.

#R.1.0 - Load specific package
library(ggplot2)

#R.1.1 Load specific function (sub_str) from a package (stringr) without loading the whole package
stringr::str_sub("programming",1,7)

## [1] "program"

Packages must be installed through Anaconda or via PIP in the terminal first.

Once a package is installed there are multiple ways to load it in Python, a package is loaded by calling the import statement followed by the name of the package:

Py.1.0 - a specific package is loaded
Py.1.1 - multiple packages are loaded at once
Py.1.2 - a specific function from a package (matplotlib) is loaded

#Py.1.0 - Load specific package only
import pandas

#Py.1.1 - Load multiple packages at once
import numpy, sklearn, matplotlib, seaborn

#Py.1.2 Load specific function (pyplot) from a package (matplotlib)
from matplotlib import pyplot

Task 2: Data Importation

Importing data is a key skill needed in any data science project, bringing data into your user environment can be achieved through a variety of ways and will mainly be dependent on your data source.

Data sources could include Excel CSV files or an ODBC connection to a SQL database to name a few.

This example will focus on importing CSV files.

R.2.0 - Use the readr package and read_csv() function to import the CSV file fuel_data.csv into object fuel_data
R.2.1 - Use the head() function to return a snapshot of the data frame

#R.2.0 - Import CSV file from working directory
fuel_data <- read_csv("fuel_data.csv")

#R.2.1 - Return the leading rows in the data frame
head(fuel_data)

## # A tibble: 6 x 8
##   ServiceStationN… Address Suburb Postcode Brand FuelCode PriceUpdatedDate
##   <chr>            <chr>   <chr>     <dbl> <chr> <chr>    <chr>           
## 1 7-Eleven Baulkh… 217-21… BAULK…     2153 7-El… P95      1/7/18          
## 2 BP Port Kembla   144 We… Port …     2505 BP    U91      1/7/18          
## 3 BP Port Kembla   144 We… Port …     2505 BP    E10      1/7/18          
## 4 Caltex Belrose   153 Fo… Belro…     2085 Calt… E10      1/7/18          
## 5 Caltex Belrose   153 Fo… Belro…     2085 Calt… P95      1/7/18          
## 6 Caltex Belrose   153 Fo… Belro…     2085 Calt… P98      1/7/18          
## # … with 1 more variable: Price <dbl>

Py.2.0 - Use the pandas package and read_csv() function to import the CSV file fuel_data.csv into object fuel_data
Py.2.1 - Use the head() function to return a snapshot of the data frame, specifying the first 10 rows.

#Py.2.0 - Import CSV file from working directory
fuel_data = pandas.read_csv("fuel_data.csv")

#Py.2.1 - Return the 10 leading rows in the data frame
fuel_data.head(10)

##         ServiceStationName  ...  Price
## 0  7-Eleven Baulkham Hills  ...  155.9
## 1           BP Port Kembla  ...  132.9
## 2           BP Port Kembla  ...  130.9
## 3           Caltex Belrose  ...  143.9
## 4           Caltex Belrose  ...  157.9
## 5           Caltex Belrose  ...  164.9
## 6          Metro Punchbowl  ...  133.7
## 7          Metro Punchbowl  ...  150.7
## 8          Metro Punchbowl  ...  130.7
## 9   Metro Pertoleum Lurnea  ...  133.9
## 
## [10 rows x 8 columns]

Task 3: Data Types

Understanding data types and the structure of your data is a key skill required to be data scientist.

It is critical to ensure imported data has the correct data type assigned to each variable.

The examples below outline the process to check the data has imported as expected.

R.3.0 - Use the summary() function to return metadata
R.3.1 - Use the class() function to return the variable class of an individual variable
R.3.2 - Convert fuel_data$PriceUpdatedDate to Date

#R.3.0 - Return metadata of the imported data
summary(fuel_data)

##  ServiceStationName   Address             Suburb             Postcode   
##  Length:451793      Length:451793      Length:451793      Min.   :1579  
##  Class :character   Class :character   Class :character   1st Qu.:2151  
##  Mode  :character   Mode  :character   Mode  :character   Median :2232  
##                                                           Mean   :2340  
##                                                           3rd Qu.:2541  
##                                                           Max.   :4383  
##     Brand             FuelCode         PriceUpdatedDate       Price      
##  Length:451793      Length:451793      Length:451793      Min.   : 47.9  
##  Class :character   Class :character   Class :character   1st Qu.:138.9  
##  Mode  :character   Mode  :character   Mode  :character   Median :150.9  
##                                                           Mean   :149.9  
##                                                           3rd Qu.:160.9  
##                                                           Max.   :229.9

#R.3.1 - Return class of individual variable
class(fuel_data$Price)

## [1] "numeric"

#R.3.2 - Convert fuel_data$PriceUpdatedDate to Date
fuel_data$PriceUpdatedDate <- as.Date(fuel_data$PriceUpdatedDate,"%d/%m/%y")

Py.3.0 - Use the info() & describe() function to return metadata
Py.3.1 - Use the dtype method to return the variable class of an individual variable
Py.3.2 - Convert fuel_data.PriceUpdatedDate to Date

#Py.3.0 - Return metadata of the imported data
fuel_data.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 451793 entries, 0 to 451792
## Data columns (total 8 columns):
## ServiceStationName    451793 non-null object
## Address               451793 non-null object
## Suburb                451793 non-null object
## Postcode              451793 non-null int64
## Brand                 451793 non-null object
## FuelCode              451793 non-null object
## PriceUpdatedDate      451793 non-null object
## Price                 451793 non-null float64
## dtypes: float64(1), int64(1), object(6)
## memory usage: 27.6+ MB

fuel_data.describe()

##             Postcode          Price
## count  451793.000000  451793.000000
## mean     2339.647086     149.916113
## std       245.315055      17.019176
## min      1579.000000      47.900000
## 25%      2151.000000     138.900000
## 50%      2232.000000     150.900000
## 75%      2541.000000     160.900000
## max      4383.000000     229.900000

#Py.3.1 - Return class of individual variable
fuel_data.Price.dtype

## dtype('float64')

#Py.3.2 - Convert fuel_data.PriceUpdatedDate to Date
fuel_data['PriceUpdatedDate'] = pandas.to_datetime(fuel_data['PriceUpdatedDate'],format="%d/%m/%y")

Task 4: Filtering

Filtering a dataframe removes unwanted observations from your raw data set that are not needed for analysis.

In a data science project the raw data set will usually need to be filtered or cleaned to some extent before modelling can begin.

R.4.0 - Use the dplyr package to pipe the dataframe fuel_data into the filter() function where Suburb equals ‘Alexandria’ and FuelCode equals ‘P95’ or ‘P98’.
R.4.1 - Use the unique() function on the Suburb variable to check the results

#R.4.0 - Filter dataframe based on criteria
fuel_data_ALX_P95P98 <- fuel_data %>%
                        filter(Suburb == "Alexandria") %>%
                        filter(FuelCode == "P95" | FuelCode == "P98")

#R.4.1 - Check only Suburb is Alexandria
unique(fuel_data_ALX_P95P98$Suburb)

## [1] "Alexandria"

Py.4.0 - Use numpy functions logical_and() & logical_or() to filter fuel_data where Suburb equals ‘Alexandria’ and FuelCode equals ‘P95’ or ‘P98’.
Py.4.1 - Use the numpy function unique() on the Suburb variable to check the results

#Py.4.0 - Filter dataframe based on criteria
fuel_data_ALX_P95P98 = fuel_data[numpy.logical_and(fuel_data['Suburb'] == "Alexandria",numpy.logical_or(fuel_data['FuelCode'] == "P95",fuel_data['FuelCode'] == "P98"))]

#Py.4.1 - Check only Suburb is Alexandria
numpy.unique(fuel_data_ALX_P95P98.Suburb)

## array(['Alexandria'], dtype=object)

Task 5: Grouping & Summarising

Grouping or aggregating data is a common task data scientists will perform to generate summarised information and bring out insights or trends.

This may include summation, averages, min or max across various dimensions of the data set.

R.5.0 - Use dplyr package to group the dataframe fuel_data_ALX_P95P98 created in Task 4 by PriceUpdatedDate and FuelCode using the group_by() function and summarise data by average using the summarise() function

#R.5.0 - Group data and summarise by average price
fuel_data_ALX_P95P98_avg <- fuel_data_ALX_P95P98 %>%
                            group_by(PriceUpdatedDate,FuelCode) %>%
                            summarise(price_average = mean(Price))

#Check Results
fuel_data_ALX_P95P98_avg[1:10,]

## # A tibble: 10 x 3
## # Groups:   PriceUpdatedDate [7]
##    PriceUpdatedDate FuelCode price_average
##    <date>           <chr>            <dbl>
##  1 2018-07-01       P95               162.
##  2 2018-07-01       P98               169.
##  3 2018-07-03       P98               154.
##  4 2018-07-05       P95               160.
##  5 2018-07-05       P98               167.
##  6 2018-07-06       P95               156.
##  7 2018-07-07       P95               156.
##  8 2018-07-07       P98               163 
##  9 2018-07-11       P98               152.
## 10 2018-07-12       P95               176.

Py.5.0 - Use pandas package to group the dataframe fuel_data_ALX_P95P98 created in Task 4 by PriceUpdatedDate and FuelCode using the groupby() function and summarise data by average using the mean() function

#Py.5.0 - Group data and summarise by average price
fuel_data_ALX_P95P98_avg  = fuel_data_ALX_P95P98.groupby(['PriceUpdatedDate','FuelCode'])['Price'].mean().reset_index() #use reset_index() to store groupby variables as columns rather than indexs

#Check Results
fuel_data_ALX_P95P98_avg.iloc[0:10,] #Check output

##   PriceUpdatedDate FuelCode  Price
## 0       2018-07-01      P95  161.9
## 1       2018-07-01      P98  168.9
## 2       2018-07-03      P98  154.4
## 3       2018-07-05      P95  159.9
## 4       2018-07-05      P98  166.9
## 5       2018-07-06      P95  156.5
## 6       2018-07-07      P95  155.5
## 7       2018-07-07      P98  163.0
## 8       2018-07-11      P98  151.9
## 9       2018-07-12      P95  175.9

Task 6: Structuring

Adding new variables to dataframes is common practice when conducting exploratory data analysis or undertaking feature or data engineering tasks, this may include calculations, bucketing or flagging.

This example will build a new variable in a dataframe to be populated in a later task.

R.6.0 - Add a new variable to an existing dataframe by using the logic dataframe$<newvariable>, leave the observations blank by assigning it <- ""

#R.6.0 - Add new variable price category, leave blank
fuel_data$price_cat <- ""

colnames(fuel_data) #Check variable has been added

## [1] "ServiceStationName" "Address"            "Suburb"            
## [4] "Postcode"           "Brand"              "FuelCode"          
## [7] "PriceUpdatedDate"   "Price"              "price_cat"

Py.6.0 - Add a new variable to an existing dataframe by using the logic dataframe[<newvariable>], leave the observations blank by assigning it = ""

#Py.6.0 - Add new variable price category, leave blank
fuel_data['price_cat'] = ""

list(fuel_data.columns)  #Check variable has been added

## ['ServiceStationName', 'Address', 'Suburb', 'Postcode', 'Brand', 'FuelCode', 'PriceUpdatedDate', 'Price', 'price_cat']

Task 7: Conditional Statements

Programming conditional statements (IF,ELSE) with logical operators (e.g. AND,OR,NOT,EQUAL) is the foundation of building control structures into your code

Data scientists will often perform the task of programming conditional statements to specify what should happen next if a condition was met, or how the code should respond or behave regarding a given outcome.

R.7.0 - Use an IF statement to categorise the price of the first observation in the fuel_data dataframe. If price greater than or equal to 1.40 then ‘expensive’.
R.7.1 - Use an IF and AND operator to categorise the price of the second observation in the fuel_data dataframe. If price between 1.30 and 1.40 then ‘moderate’.

#R.7.0 - Use IF statement to categorise price first observation
if(fuel_data$Price[1] >= 140)
  
    {print("expensive")}

## [1] "expensive"

#R.7.1 - Use IF and AND statement to categorise price second observation
if(fuel_data$Price[2] >= 130 & fuel_data$Price[2] < 140)
    
  {print("moderate")}

## [1] "moderate"

Py.7.0 - Use an IF statement to categorise the price of the first observation in the fuel_data dataframe. If price greater than or equal to 1.40 then ‘expensive’.
Py.7.1 - Use an IF statement and numpy function logical_and to categorise the price of the second observation in the fuel_data dataframe. If price between 1.30 and 1.40 then ‘moderate’.

#Py.7.0 - Use IF statement to categorise price first observation
if fuel_data.Price[0] >= 140:
  print("expensive")

## expensive

#Py.7.1 - Use IF and AND statement to categorise price second observation
if numpy.logical_and(fuel_data.Price[1] >= 130,fuel_data.Price[1] < 140):
  print("moderate")

## moderate

Task 8: Loops

Programming loops in conjunction with the logical operators shown in Task 7 allow a data scientist to add control to the execution of the code.
The most common loops are the FOR and WHILE loops, these can be used to iterate through data and respond to inputs.
This example will focus on the FOR loop.

R.8.0 - Use a FOR loop to iterate through the Price variable in fuel_data dataframe and populate the price_cat variable added in Task 6. The variable will be populated based on the categories in Task 7 with an additional ‘cheap’ category for prices below 1.30.

#R.8.0 - Use a for loop to populate the price category variable.

y <- 1 #Set counter to 1

for(x in fuel_data$Price) #For loop
    
    {
      if(x >= 140)
          
       {fuel_data$price_cat[y] = "expensive"}
          
     else if(x >= 130 & x < 140)
          
       {fuel_data$price_cat[y] = "moderate"}
        
     else
                  
       {fuel_data$price_cat[y] = "cheap"}

      y <- y + 1 #Increment counter
   
    }


fuel_data[1:10,8:9] #Check output

## # A tibble: 10 x 2
##    Price price_cat
##    <dbl> <chr>    
##  1  156. expensive
##  2  133. moderate 
##  3  131. moderate 
##  4  144. expensive
##  5  158. expensive
##  6  165. expensive
##  7  134. moderate 
##  8  151. expensive
##  9  131. moderate 
## 10  134. moderate

Py.8.0 - Use a FOR loop to iterate through the Price variable in the fuel_data dataframe and populate the price_cat variable added in Task 6. The category will be populated based on the conditions in Task 7 with an additional ‘cheap’ category for prices below 1.30.

#Py.8.0 - Use a for loop to populate the price category variable.

y = 0 #Set counter to 0

for x in fuel_data.Price: #For loop
  if x >= 140:
    fuel_data.price_cat[y] = "expensive"
  elif x >= 130 and x < 140:
    fuel_data.price_cat[y] = "moderate"
  else:
    fuel_data.price_cat[y] = "cheap"
  y = y + 1 #Increment counter


fuel_data.iloc[:10,7:] #Check output

##    Price  price_cat
## 0  155.9  expensive
## 1  132.9   moderate
## 2  130.9   moderate
## 3  143.9  expensive
## 4  157.9  expensive
## 5  164.9  expensive
## 6  133.7   moderate
## 7  150.7  expensive
## 8  130.7   moderate
## 9  133.9   moderate

Task 9: Functions

Building functions allow a data scientist to package code that can be called upon to perform a specific task or return a result at a later point

Functions play a critical role in creating good data engineering and flow in your code.

R.9.0 - Build function get_suburbs(fuel_code,price_max) to return a list of suburbs selling the FuelCode at or below the specified Price, the input (arguments) to the function will be the FuelCode and the max Price.
R.9.0 - Call function to return suburbs with P98 fuel at or below 1.25

#R.9.0 - Build function to return suburb based on input arguments
get_suburbs <- function(fuel_code,price_max) 
                    {
            
                    fuel_data_func <- fuel_data %>%
                                filter(FuelCode == fuel_code & Price <= price_max)
            
                    suburbs <- unique(fuel_data_func$Suburb)
                    
                    return(suburbs)
            
                    }


#R.9.1 - Call function to return Suburbs
get_suburbs("P98",125)

##  [1] "Camden"           "TEMPE"            "Kirrawee"        
##  [4] "Chester Hill"     "MINTO"            "Fairfield"       
##  [7] "Warilla"          "TOUKLEY"          "SYLVANIA HEIGHTS"
## [10] "YAGOONA"          "WEST GOSFORD"     "LITHGOW"         
## [13] "Bonnyrigg"        "Punchbowl"        "Lansdowne"       
## [16] "BEXLEY NORTH"

Py.9.0 - Build function get_suburbs(fuel_code,price_max) to return a list of suburbs selling the FuelCode at or below the specified Price, the input (arguments) to the function will be the FuelCode and the max Price.
Py.9.0 - Call function to return suburbs with P98 fuel at or below 1.25

#Py.9.0 - Build function to return suburb based on input arguments
def get_suburbs(fuel_code,price_max):
  fuel_data_func = fuel_data[(fuel_data['FuelCode'] == fuel_code) & (fuel_data['Price'] <= price_max)]
  suburbs = numpy.unique(fuel_data_func.Suburb)
  return suburbs

#Py.9.1 - Call function to return Suburbs
get_suburbs("P98",125)

## array(['BEXLEY NORTH', 'Bonnyrigg', 'Camden', 'Chester Hill', 'Fairfield',
##        'Kirrawee', 'LITHGOW', 'Lansdowne', 'MINTO', 'Punchbowl',
##        'SYLVANIA HEIGHTS', 'TEMPE', 'TOUKLEY', 'WEST GOSFORD', 'Warilla',
##        'YAGOONA'], dtype=object)

Task 10: Visualisation

Investigating trends and generating insights from data will require use of visualisations.

Plotting graphs can quickly and easily convey insights to stakeholders.

The following example will use a line graph to show how average fuel price is trending over time.

R.10.0 - Use the ggpolt2 package to plot the average price of P95 & P98 fuel in Alexandria from the dataframe created in Task 5.

#R.10.0 - Plot average price of P95 & P98 fuel in Alexandria
ggplot(fuel_data_ALX_P95P98_avg, aes(x = PriceUpdatedDate, y = price_average, col = FuelCode)) +
        geom_line() +
        scale_x_date(date_labels = "%m/%Y",breaks = as.Date(c("2018-07-01","2018-08-01","2018-09-01","2018-09-01","2018-10-01","2018-11-01","2018-12-01","2019-01-01")))

Py.10.0 - Use the seaborn package to plot the average price of P95 & P98 fuel in Alexandria from the dataframe created in Task 5

#Py.10.0 - Plot average price of P95 & P98 fuel in Alexandria
seaborn.lineplot(x = 'PriceUpdatedDate',y = 'Price',hue = 'FuelCode',data = fuel_data_ALX_P95P98_avg).set(ylabel='price_average')