Loading a package in R and Python is a critical task that all data scientists will likely perform during a data science project, usually at the beginning of their workflow.
As R and Python are open source languages there is a large number of community developed packages available. These packages will generally have a specific purpose and provide enhanced functionality and efficiency for given data science tasks.
To load a package it must be previously installed in the user’s environment.
Packages must be installed through the install.packages("packagename") function first.
Once a package is installed there are multiple ways to load it in R, a package is loaded by calling the library() function and passing the name of the package:
R.1.0 - a specific package is loaded
R.1.1 - a specific function from a package is called without loading the entire package, this can also be used when two packages have the same function name to specify the correct one.
#R.1.0 - Load specific package
library(ggplot2)
#R.1.1 Load specific function (sub_str) from a package (stringr) without loading the whole package
stringr::str_sub("programming",1,7)
## [1] "program"
Packages must be installed through Anaconda or via PIP in the terminal first.
Once a package is installed there are multiple ways to load it in Python, a package is loaded by calling the import statement followed by the name of the package:
Py.1.0 - a specific package is loaded
Py.1.1 - multiple packages are loaded at once
Py.1.2 - a specific function from a package (matplotlib) is loaded
#Py.1.0 - Load specific package only
import pandas
#Py.1.1 - Load multiple packages at once
import numpy, sklearn, matplotlib, seaborn
#Py.1.2 Load specific function (pyplot) from a package (matplotlib)
from matplotlib import pyplot
Importing data is a key skill needed in any data science project, bringing data into your user environment can be achieved through a variety of ways and will mainly be dependent on your data source.
Data sources could include Excel CSV files or an ODBC connection to a SQL database to name a few.
This example will focus on importing CSV files.
R.2.0 - Use the readr package and read_csv() function to import the CSV file fuel_data.csv into object fuel_data
R.2.1 - Use the head() function to return a snapshot of the data frame
#R.2.0 - Import CSV file from working directory
fuel_data <- read_csv("fuel_data.csv")
#R.2.1 - Return the leading rows in the data frame
head(fuel_data)
## # A tibble: 6 x 8
## ServiceStationN… Address Suburb Postcode Brand FuelCode PriceUpdatedDate
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 7-Eleven Baulkh… 217-21… BAULK… 2153 7-El… P95 1/7/18
## 2 BP Port Kembla 144 We… Port … 2505 BP U91 1/7/18
## 3 BP Port Kembla 144 We… Port … 2505 BP E10 1/7/18
## 4 Caltex Belrose 153 Fo… Belro… 2085 Calt… E10 1/7/18
## 5 Caltex Belrose 153 Fo… Belro… 2085 Calt… P95 1/7/18
## 6 Caltex Belrose 153 Fo… Belro… 2085 Calt… P98 1/7/18
## # … with 1 more variable: Price <dbl>
Py.2.0 - Use the pandas package and read_csv() function to import the CSV file fuel_data.csv into object fuel_data
Py.2.1 - Use the head() function to return a snapshot of the data frame, specifying the first 10 rows.
#Py.2.0 - Import CSV file from working directory
fuel_data = pandas.read_csv("fuel_data.csv")
#Py.2.1 - Return the 10 leading rows in the data frame
fuel_data.head(10)
## ServiceStationName ... Price
## 0 7-Eleven Baulkham Hills ... 155.9
## 1 BP Port Kembla ... 132.9
## 2 BP Port Kembla ... 130.9
## 3 Caltex Belrose ... 143.9
## 4 Caltex Belrose ... 157.9
## 5 Caltex Belrose ... 164.9
## 6 Metro Punchbowl ... 133.7
## 7 Metro Punchbowl ... 150.7
## 8 Metro Punchbowl ... 130.7
## 9 Metro Pertoleum Lurnea ... 133.9
##
## [10 rows x 8 columns]
Understanding data types and the structure of your data is a key skill required to be data scientist.
It is critical to ensure imported data has the correct data type assigned to each variable.
The examples below outline the process to check the data has imported as expected.
R.3.0 - Use the summary() function to return metadata
R.3.1 - Use the class() function to return the variable class of an individual variable
R.3.2 - Convert fuel_data$PriceUpdatedDate to Date
#R.3.0 - Return metadata of the imported data
summary(fuel_data)
## ServiceStationName Address Suburb Postcode
## Length:451793 Length:451793 Length:451793 Min. :1579
## Class :character Class :character Class :character 1st Qu.:2151
## Mode :character Mode :character Mode :character Median :2232
## Mean :2340
## 3rd Qu.:2541
## Max. :4383
## Brand FuelCode PriceUpdatedDate Price
## Length:451793 Length:451793 Length:451793 Min. : 47.9
## Class :character Class :character Class :character 1st Qu.:138.9
## Mode :character Mode :character Mode :character Median :150.9
## Mean :149.9
## 3rd Qu.:160.9
## Max. :229.9
#R.3.1 - Return class of individual variable
class(fuel_data$Price)
## [1] "numeric"
#R.3.2 - Convert fuel_data$PriceUpdatedDate to Date
fuel_data$PriceUpdatedDate <- as.Date(fuel_data$PriceUpdatedDate,"%d/%m/%y")
Py.3.0 - Use the info() & describe() function to return metadata
Py.3.1 - Use the dtype method to return the variable class of an individual variable
Py.3.2 - Convert fuel_data.PriceUpdatedDate to Date
#Py.3.0 - Return metadata of the imported data
fuel_data.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 451793 entries, 0 to 451792
## Data columns (total 8 columns):
## ServiceStationName 451793 non-null object
## Address 451793 non-null object
## Suburb 451793 non-null object
## Postcode 451793 non-null int64
## Brand 451793 non-null object
## FuelCode 451793 non-null object
## PriceUpdatedDate 451793 non-null object
## Price 451793 non-null float64
## dtypes: float64(1), int64(1), object(6)
## memory usage: 27.6+ MB
fuel_data.describe()
## Postcode Price
## count 451793.000000 451793.000000
## mean 2339.647086 149.916113
## std 245.315055 17.019176
## min 1579.000000 47.900000
## 25% 2151.000000 138.900000
## 50% 2232.000000 150.900000
## 75% 2541.000000 160.900000
## max 4383.000000 229.900000
#Py.3.1 - Return class of individual variable
fuel_data.Price.dtype
## dtype('float64')
#Py.3.2 - Convert fuel_data.PriceUpdatedDate to Date
fuel_data['PriceUpdatedDate'] = pandas.to_datetime(fuel_data['PriceUpdatedDate'],format="%d/%m/%y")
Filtering a dataframe removes unwanted observations from your raw data set that are not needed for analysis.
In a data science project the raw data set will usually need to be filtered or cleaned to some extent before modelling can begin.
R.4.0 - Use the dplyr package to pipe the dataframe fuel_data into the filter() function where Suburb equals ‘Alexandria’ and FuelCode equals ‘P95’ or ‘P98’.
R.4.1 - Use the unique() function on the Suburb variable to check the results
#R.4.0 - Filter dataframe based on criteria
fuel_data_ALX_P95P98 <- fuel_data %>%
filter(Suburb == "Alexandria") %>%
filter(FuelCode == "P95" | FuelCode == "P98")
#R.4.1 - Check only Suburb is Alexandria
unique(fuel_data_ALX_P95P98$Suburb)
## [1] "Alexandria"
Py.4.0 - Use numpy functions logical_and() & logical_or() to filter fuel_data where Suburb equals ‘Alexandria’ and FuelCode equals ‘P95’ or ‘P98’.
Py.4.1 - Use the numpy function unique() on the Suburb variable to check the results
#Py.4.0 - Filter dataframe based on criteria
fuel_data_ALX_P95P98 = fuel_data[numpy.logical_and(fuel_data['Suburb'] == "Alexandria",numpy.logical_or(fuel_data['FuelCode'] == "P95",fuel_data['FuelCode'] == "P98"))]
#Py.4.1 - Check only Suburb is Alexandria
numpy.unique(fuel_data_ALX_P95P98.Suburb)
## array(['Alexandria'], dtype=object)
Grouping or aggregating data is a common task data scientists will perform to generate summarised information and bring out insights or trends.
This may include summation, averages, min or max across various dimensions of the data set.
R.5.0 - Use dplyr package to group the dataframe fuel_data_ALX_P95P98 created in Task 4 by PriceUpdatedDate and FuelCode using the group_by() function and summarise data by average using the summarise() function
#R.5.0 - Group data and summarise by average price
fuel_data_ALX_P95P98_avg <- fuel_data_ALX_P95P98 %>%
group_by(PriceUpdatedDate,FuelCode) %>%
summarise(price_average = mean(Price))
#Check Results
fuel_data_ALX_P95P98_avg[1:10,]
## # A tibble: 10 x 3
## # Groups: PriceUpdatedDate [7]
## PriceUpdatedDate FuelCode price_average
## <date> <chr> <dbl>
## 1 2018-07-01 P95 162.
## 2 2018-07-01 P98 169.
## 3 2018-07-03 P98 154.
## 4 2018-07-05 P95 160.
## 5 2018-07-05 P98 167.
## 6 2018-07-06 P95 156.
## 7 2018-07-07 P95 156.
## 8 2018-07-07 P98 163
## 9 2018-07-11 P98 152.
## 10 2018-07-12 P95 176.
Py.5.0 - Use pandas package to group the dataframe fuel_data_ALX_P95P98 created in Task 4 by PriceUpdatedDate and FuelCode using the groupby() function and summarise data by average using the mean() function
#Py.5.0 - Group data and summarise by average price
fuel_data_ALX_P95P98_avg = fuel_data_ALX_P95P98.groupby(['PriceUpdatedDate','FuelCode'])['Price'].mean().reset_index() #use reset_index() to store groupby variables as columns rather than indexs
#Check Results
fuel_data_ALX_P95P98_avg.iloc[0:10,] #Check output
## PriceUpdatedDate FuelCode Price
## 0 2018-07-01 P95 161.9
## 1 2018-07-01 P98 168.9
## 2 2018-07-03 P98 154.4
## 3 2018-07-05 P95 159.9
## 4 2018-07-05 P98 166.9
## 5 2018-07-06 P95 156.5
## 6 2018-07-07 P95 155.5
## 7 2018-07-07 P98 163.0
## 8 2018-07-11 P98 151.9
## 9 2018-07-12 P95 175.9
Adding new variables to dataframes is common practice when conducting exploratory data analysis or undertaking feature or data engineering tasks, this may include calculations, bucketing or flagging.
This example will build a new variable in a dataframe to be populated in a later task.
R.6.0 - Add a new variable to an existing dataframe by using the logic dataframe$<newvariable>, leave the observations blank by assigning it <- ""
#R.6.0 - Add new variable price category, leave blank
fuel_data$price_cat <- ""
colnames(fuel_data) #Check variable has been added
## [1] "ServiceStationName" "Address" "Suburb"
## [4] "Postcode" "Brand" "FuelCode"
## [7] "PriceUpdatedDate" "Price" "price_cat"
Py.6.0 - Add a new variable to an existing dataframe by using the logic dataframe[<newvariable>], leave the observations blank by assigning it = ""
#Py.6.0 - Add new variable price category, leave blank
fuel_data['price_cat'] = ""
list(fuel_data.columns) #Check variable has been added
## ['ServiceStationName', 'Address', 'Suburb', 'Postcode', 'Brand', 'FuelCode', 'PriceUpdatedDate', 'Price', 'price_cat']
Programming conditional statements (IF,ELSE) with logical operators (e.g. AND,OR,NOT,EQUAL) is the foundation of building control structures into your code
Data scientists will often perform the task of programming conditional statements to specify what should happen next if a condition was met, or how the code should respond or behave regarding a given outcome.
R.7.0 - Use an IF statement to categorise the price of the first observation in the fuel_data dataframe. If price greater than or equal to 1.40 then ‘expensive’.
R.7.1 - Use an IF and AND operator to categorise the price of the second observation in the fuel_data dataframe. If price between 1.30 and 1.40 then ‘moderate’.
#R.7.0 - Use IF statement to categorise price first observation
if(fuel_data$Price[1] >= 140)
{print("expensive")}
## [1] "expensive"
#R.7.1 - Use IF and AND statement to categorise price second observation
if(fuel_data$Price[2] >= 130 & fuel_data$Price[2] < 140)
{print("moderate")}
## [1] "moderate"
Py.7.0 - Use an IF statement to categorise the price of the first observation in the fuel_data dataframe. If price greater than or equal to 1.40 then ‘expensive’.
Py.7.1 - Use an IF statement and numpy function logical_and to categorise the price of the second observation in the fuel_data dataframe. If price between 1.30 and 1.40 then ‘moderate’.
#Py.7.0 - Use IF statement to categorise price first observation
if fuel_data.Price[0] >= 140:
print("expensive")
## expensive
#Py.7.1 - Use IF and AND statement to categorise price second observation
if numpy.logical_and(fuel_data.Price[1] >= 130,fuel_data.Price[1] < 140):
print("moderate")
## moderate
Programming loops in conjunction with the logical operators shown in Task 7 allow a data scientist to add control to the execution of the code.
The most common loops are the FOR and WHILE loops, these can be used to iterate through data and respond to inputs.
This example will focus on the FOR loop.
R.8.0 - Use a FOR loop to iterate through the Price variable in fuel_data dataframe and populate the price_cat variable added in Task 6. The variable will be populated based on the categories in Task 7 with an additional ‘cheap’ category for prices below 1.30.
#R.8.0 - Use a for loop to populate the price category variable.
y <- 1 #Set counter to 1
for(x in fuel_data$Price) #For loop
{
if(x >= 140)
{fuel_data$price_cat[y] = "expensive"}
else if(x >= 130 & x < 140)
{fuel_data$price_cat[y] = "moderate"}
else
{fuel_data$price_cat[y] = "cheap"}
y <- y + 1 #Increment counter
}
fuel_data[1:10,8:9] #Check output
## # A tibble: 10 x 2
## Price price_cat
## <dbl> <chr>
## 1 156. expensive
## 2 133. moderate
## 3 131. moderate
## 4 144. expensive
## 5 158. expensive
## 6 165. expensive
## 7 134. moderate
## 8 151. expensive
## 9 131. moderate
## 10 134. moderate
Py.8.0 - Use a FOR loop to iterate through the Price variable in the fuel_data dataframe and populate the price_cat variable added in Task 6. The category will be populated based on the conditions in Task 7 with an additional ‘cheap’ category for prices below 1.30.
#Py.8.0 - Use a for loop to populate the price category variable.
y = 0 #Set counter to 0
for x in fuel_data.Price: #For loop
if x >= 140:
fuel_data.price_cat[y] = "expensive"
elif x >= 130 and x < 140:
fuel_data.price_cat[y] = "moderate"
else:
fuel_data.price_cat[y] = "cheap"
y = y + 1 #Increment counter
fuel_data.iloc[:10,7:] #Check output
## Price price_cat
## 0 155.9 expensive
## 1 132.9 moderate
## 2 130.9 moderate
## 3 143.9 expensive
## 4 157.9 expensive
## 5 164.9 expensive
## 6 133.7 moderate
## 7 150.7 expensive
## 8 130.7 moderate
## 9 133.9 moderate
Building functions allow a data scientist to package code that can be called upon to perform a specific task or return a result at a later point
Functions play a critical role in creating good data engineering and flow in your code.
R.9.0 - Build function get_suburbs(fuel_code,price_max) to return a list of suburbs selling the FuelCode at or below the specified Price, the input (arguments) to the function will be the FuelCode and the max Price.
R.9.0 - Call function to return suburbs with P98 fuel at or below 1.25
#R.9.0 - Build function to return suburb based on input arguments
get_suburbs <- function(fuel_code,price_max)
{
fuel_data_func <- fuel_data %>%
filter(FuelCode == fuel_code & Price <= price_max)
suburbs <- unique(fuel_data_func$Suburb)
return(suburbs)
}
#R.9.1 - Call function to return Suburbs
get_suburbs("P98",125)
## [1] "Camden" "TEMPE" "Kirrawee"
## [4] "Chester Hill" "MINTO" "Fairfield"
## [7] "Warilla" "TOUKLEY" "SYLVANIA HEIGHTS"
## [10] "YAGOONA" "WEST GOSFORD" "LITHGOW"
## [13] "Bonnyrigg" "Punchbowl" "Lansdowne"
## [16] "BEXLEY NORTH"
Py.9.0 - Build function get_suburbs(fuel_code,price_max) to return a list of suburbs selling the FuelCode at or below the specified Price, the input (arguments) to the function will be the FuelCode and the max Price.
Py.9.0 - Call function to return suburbs with P98 fuel at or below 1.25
#Py.9.0 - Build function to return suburb based on input arguments
def get_suburbs(fuel_code,price_max):
fuel_data_func = fuel_data[(fuel_data['FuelCode'] == fuel_code) & (fuel_data['Price'] <= price_max)]
suburbs = numpy.unique(fuel_data_func.Suburb)
return suburbs
#Py.9.1 - Call function to return Suburbs
get_suburbs("P98",125)
## array(['BEXLEY NORTH', 'Bonnyrigg', 'Camden', 'Chester Hill', 'Fairfield',
## 'Kirrawee', 'LITHGOW', 'Lansdowne', 'MINTO', 'Punchbowl',
## 'SYLVANIA HEIGHTS', 'TEMPE', 'TOUKLEY', 'WEST GOSFORD', 'Warilla',
## 'YAGOONA'], dtype=object)
Investigating trends and generating insights from data will require use of visualisations.
Plotting graphs can quickly and easily convey insights to stakeholders.
The following example will use a line graph to show how average fuel price is trending over time.
R.10.0 - Use the ggpolt2 package to plot the average price of P95 & P98 fuel in Alexandria from the dataframe created in Task 5.
#R.10.0 - Plot average price of P95 & P98 fuel in Alexandria
ggplot(fuel_data_ALX_P95P98_avg, aes(x = PriceUpdatedDate, y = price_average, col = FuelCode)) +
geom_line() +
scale_x_date(date_labels = "%m/%Y",breaks = as.Date(c("2018-07-01","2018-08-01","2018-09-01","2018-09-01","2018-10-01","2018-11-01","2018-12-01","2019-01-01")))
Py.10.0 - Use the seaborn package to plot the average price of P95 & P98 fuel in Alexandria from the dataframe created in Task 5
#Py.10.0 - Plot average price of P95 & P98 fuel in Alexandria
seaborn.lineplot(x = 'PriceUpdatedDate',y = 'Price',hue = 'FuelCode',data = fuel_data_ALX_P95P98_avg).set(ylabel='price_average')