Congratulations! You have just been hired by a US Venture Capital firm as a data analyst.
The company is considering foreign grain markets to help meet its supply chain requirements for its recent investments in the microbrewery and microdistillery industry, which is involved with the production and distribution of craft beers and spirits.
Your first task is to provide a high level analysis of crop production in Canada. Your stakeholders want to understand the current and historical performance of certain crop types in terms of supply and price volatility. For now they are mainly interested in a macro-view of Canada’s crop farming industry, and how it relates to the relative value of the Canadian and US dollars.
Using this R notebook you will:
To complete the assignment problems in this notebook you will be using subsetted snapshots of two datasets from Statistics Canada, and one from the Bank of Canada. The links to the prepared datasets are provided in the next section; the interested student can explore the landing pages for the source datasets as follows:
This dataset contains agricultural production measures for the principle crops grown in Canada, including a breakdown by province and teritory, for each year from 1908 to 2020.
For this assignment you will use a preprocessed snapshot of this dataset (see below).
A detailed description of this dataset can be obtained from the
StatsCan Data Portal at: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210035901
Detailed information is included in the metadata file and as header text
in the data file, which can be downloaded - look for the ‘download
options’ link.
This dataset contains monthly average farm product prices for Canadian crops and livestock by province and teritory, from 1980 to 2020 (or ‘last year’, whichever is greatest).
For this assignment you will use a preprocessed snapshot of this dataset (see below).
A description of this dataset can be obtained from the StatsCan Data Portal at: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210007701 The information is included in the metadata file, which can be downloaded - look for the ‘download options’ link.
This dataset contains the daily average exchange rates for multiple foreign currencies. Exchange rates are expressed as 1 unit of the foreign currency converted into Canadian dollars. It includes only the latest four years of data, and the rates are published once each business day by 16:30 ET.
For this assignment you will use a snapshot of this dataset with only the USD-CAD exchange rates included (see next section). We have also prepared a monthly averaged version which you will be using below.
A brief description of this dataset and the original dataset can be obtained from the Bank of Canada Data Portal at: https://www.bankofcanada.ca/rates/exchange/daily-exchange-rates/
Annual Crop Data: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Annual_Crop_Data.csv
Farm product prices: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_Farm_Prices.csv
Daily FX Data: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Daily_FX.csv
Monthly FX Data: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_FX.csv
Load the RODBC package:
library(RODBC)
Establish a connection to the Db2 database, and create the following four tables using the RODBC package in R. Use the separate cells provided below to create each of your tables.
# Establish database connection
dsn_driver <- "{IBM DB2 ODBC Driver}"
dsn_database <- "bludb"
dsn_hostname <- "..."
dsn_port <- "..."
dsn_protocol <- "TCPIP"
dsn_uid <- "..."
dsn_pwd <- "..."
dsn_security <- "ssl"
conn_path <- paste("DRIVER=",dsn_driver,
";DATABASE=",dsn_database,
";HOSTNAME=",dsn_hostname,
";PORT=",dsn_port,
";PROTOCOL=",dsn_protocol,
";UID=",dsn_uid,
";PWD=",dsn_pwd,
";SECURITY=",dsn_security,
sep="")
conn <- odbcDriverConnect(conn_path)
conn
RODBC Connection 1
Details:
case=nochange
DRIVER={IBM DB2 ODBC DRIVER}
UID=...
PWD...
DATABASE=bludb
HOSTNAME=...
PORT=...
PROTOCOL=TCPIP
SECURITY=SSL
# Check Connection Details
sql.info <- sqlTypeInfo(conn)
conn.info <- odbcGetInfo(conn)
conn.info["DBMS_Name"]
conn.info["DBMS_Ver"]
conn.info["Driver_ODBC_Ver"]
DBMS_Name: ‘DB2/LINUXX8664’
DBMS_Ver: ‘11.05.0800’
Driver_ODBC_Ver: ‘03.51’
# CROP_DATA:
df1 <- sqlQuery(conn,
"CREATE TABLE CROP_DATA (
CD_ID INTEGER NOT NULL,
YEAR DATE NOT NULL,
CROP_TYPE VARCHAR(20) NOT NULL,
GEO VARCHAR(20) NOT NULL,
SEEDED_AREA INTEGER NOT NULL,
HARVESTED_AREA INTEGER NOT NULL,
PRODUCTION INTEGER NOT NULL,
AVG_YIELD INTEGER NOT NULL,
PRIMARY KEY (CD_ID)
)",
errors=FALSE
)
if (df1 == -1){
cat ("An error has occurred.\n")
msg <- odbcGetErrMsg(conn)
print (msg)
} else {
cat ("Table was created successfully.\n")
}
Table was created successfully.
# FARM_PRICES:
df2 <- sqlQuery(conn,
"CREATE TABLE FARM_PRICES (
CD_ID INTEGER NOT NULL,
DATE DATE NOT NULL,
CROP_TYPE VARCHAR(20) NOT NULL,
GEO VARCHAR(20) NOT NULL,
PRICE_PRERMT DECIMAL NOT NULL,
PRIMARY KEY (CD_ID)
)"
errors=FALSE
)
if (df2 == -1){
cat ("An error has occurred.\n")
msg <- odbcGetErrMsg(conn)
print (msg)
} else {
cat ("Table was created successfully.\n")
}
Table was created successfully.
# DAILY_FX:
df3 <- sqlQuery(conn,
"CREATE TABLE DAILY_FX (
DFX_ID INTEGER NOT NULL,
DATE DATE NOT NULL,
FXUSDCAD FLOAT(6) NOT NULL,
PRIMARY KEY (DFX_ID)
)"
Table was created successfully.
# MONTHLY_FX:
df4 <- sqlQuery(conn,
"CREATE TABLE MONTHLY_FX (
DFX_ID INTEGER NOT NULL,
DATE DATE NOT NULL,
FXUSDCAD FLOAT(6) NOT NULL,
PRIMARY KEY (DFX_ID)
)",
errors=FALSE
)
if (df4 == -1){
shcat ("An error has occurred.\n")
msg <- odbcGetErrMsg(conn)
print (msg)
} else {
cat ("Table was created successfully.\n")
}
Table was created successfully.
Read the datasets into R dataframes using the urls provided above. Then load your tables.
crop_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Annual_Crop_Data.csv', colClasses=c(YEAR="character"))
farm_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_Farm_Prices.csv', colClasses=c(DATE="character"))
monthly_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_FX.csv', colClasses=c(DATE="character"))
daily_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Daily_FX.csv', colClasses=c(DATE="character"))
head(crop_df)
head(farm_df)
head(monthly_df)
head(daily_df)
| CD_ID | YEAR | CROP_TYPE | GEO | SEEDED_AREA | HARVESTED_AREA | PRODUCTION | AVG_YIELD | |
|---|---|---|---|---|---|---|---|---|
| <int> | <chr> | <fct> | <fct> | <int> | <int> | <int> | <int> | |
| 1 | 0 | 1965-12-31 | Barley | Alberta | 1372000 | 1372000 | 2504000 | 1825 |
| 2 | 1 | 1965-12-31 | Barley | Canada | 2476800 | 2476800 | 4752900 | 1920 |
| 3 | 2 | 1965-12-31 | Barley | Saskatchewan | 708000 | 708000 | 1415000 | 2000 |
| 4 | 3 | 1965-12-31 | Canola | Alberta | 297400 | 297400 | 215500 | 725 |
| 5 | 4 | 1965-12-31 | Canola | Canada | 580700 | 580700 | 512600 | 885 |
| 6 | 5 | 1965-12-31 | Canola | Saskatchewan | 224600 | 224600 | 242700 | 1080 |
| CD_ID | DATE | CROP_TYPE | GEO | PRICE_PRERMT | |
|---|---|---|---|---|---|
| <int> | <chr> | <fct> | <fct> | <dbl> | |
| 1 | 0 | 1985-01-01 | Barley | Alberta | 127.39 |
| 2 | 1 | 1985-01-01 | Barley | Saskatchewan | 121.38 |
| 3 | 2 | 1985-01-01 | Canola | Alberta | 342.00 |
| 4 | 3 | 1985-01-01 | Canola | Saskatchewan | 339.82 |
| 5 | 4 | 1985-01-01 | Rye | Alberta | 100.77 |
| 6 | 5 | 1985-01-01 | Rye | Saskatchewan | 109.75 |
| DFX_ID | DATE | FXUSDCAD | |
|---|---|---|---|
| <int> | <chr> | <dbl> | |
| 1 | 0 | 2017-01-01 | 1.319276 |
| 2 | 1 | 2017-02-01 | 1.310726 |
| 3 | 2 | 2017-03-01 | 1.338643 |
| 4 | 3 | 2017-04-01 | 1.344021 |
| 5 | 4 | 2017-05-01 | 1.360705 |
| 6 | 5 | 2017-06-01 | 1.329805 |
| DFX_ID | DATE | FXUSDCAD | |
|---|---|---|---|
| <int> | <chr> | <dbl> | |
| 1 | 0 | 2017-01-03 | 1.3435 |
| 2 | 1 | 2017-01-04 | 1.3315 |
| 3 | 2 | 2017-01-05 | 1.3244 |
| 4 | 3 | 2017-01-06 | 1.3214 |
| 5 | 4 | 2017-01-09 | 1.3240 |
| 6 | 5 | 2017-01-10 | 1.3213 |
sqlSave(conn, crop_df, "CROP_DATA", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)
sqlSave(conn, farm_df, "FARM_PRICES", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)
sqlSave(conn, monthly_df, "MONTHLY_FX", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)
sqlSave(conn, daily_df, "DAILY_FX", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)
query = "SELECT COUNT(CD_ID) AS RECORDS_IN_DATASET FROM FARM_PRICES"
sqlQuery(conn,query)
| RECORDS_IN_DATASET | |
|---|---|
| <int> | |
| 1 | 2678 |
query = "SELECT GEO AS GEOGRAPHIES_IN_DATASET FROM FARM_PRICES
GROUP BY GEO"
sqlQuery(conn,query)
| GEOGRAPHIES_IN_DATASET | |
|---|---|
| <fct> | |
| 1 | Alberta |
| 2 | Saskatchewan |
query = "SELECT SUM(HARVESTED_AREA) AS CANADA_HARVESTED_RYE_1968 FROM CROP_DATA
WHERE GEO = 'Canada' AND CROP_TYPE = 'Rye' AND YEAR(YEAR) = 1968"
sqlQuery(conn,query)
| CANADA_HARVESTED_RYE_1968 | |
|---|---|
| <int> | |
| 1 | 274100 |
query = "SELECT * FROM FARM_PRICES
WHERE CROP_TYPE ='Rye' LIMIT 6"
sqlQuery(conn,query)
| CD_ID | DATE | CROP_TYPE | GEO | PRICE_PRERMT | |
|---|---|---|---|---|---|
| <int> | <date> | <fct> | <fct> | <int> | |
| 1 | 4 | 1985-01-01 | Rye | Alberta | 100 |
| 2 | 5 | 1985-01-01 | Rye | Saskatchewan | 109 |
| 3 | 10 | 1985-02-01 | Rye | Alberta | 95 |
| 4 | 11 | 1985-02-01 | Rye | Saskatchewan | 103 |
| 5 | 16 | 1985-03-01 | Rye | Alberta | 96 |
| 6 | 17 | 1985-03-01 | Rye | Saskatchewan | 106 |
query = "SELECT GEO AS BARLEY_GROWING_PROVINCES FROM CROP_DATA
WHERE CROP_TYPE='Barley' AND NOT GEO = 'Canada' GROUP BY GEO"
sqlQuery(conn,query)
| BARLEY_GROWING_PROVINCES | |
|---|---|
| <fct> | |
| 1 | Alberta |
| 2 | Saskatchewan |
query = "SELECT min(DATE) AS FIRST_DATE,
max(DATE) AS LAST_DATE
FROM FARM_PRICES"
sqlQuery(conn,query)
| FIRST_DATE | LAST_DATE | |
|---|---|---|
| <date> | <date> | |
| 1 | 1985-01-01 | 2020-12-01 |
query = "SELECT DISTINCT (CROP_TYPE),
PRICE_PRERMT, YEAR(DATE) AS YEAR FROM FARM_PRICES
WHERE PRICE_PRERMT >= 350 ORDER BY PRICE_PRERMT LIMIT 10"
sqlQuery(conn,query)
| CROP_TYPE | PRICE_PRERMT | YEAR | |
|---|---|---|---|
| <fct> | <int> | <int> | |
| 1 | Canola | 350 | 1985 |
| 2 | Canola | 350 | 1995 |
| 3 | Canola | 350 | 2003 |
| 4 | Canola | 350 | 2007 |
| 5 | Canola | 351 | 1999 |
| 6 | Canola | 351 | 2007 |
| 7 | Canola | 352 | 1997 |
| 8 | Canola | 352 | 1998 |
| 9 | Canola | 352 | 2003 |
| 10 | Canola | 352 | 2007 |
query = "SELECT CROP_TYPE AS CROP_RANK, MAX(AVG_YIELD)AS AVG_YIELD_YEAR_2000
FROM CROP_DATA
WHERE YEAR(YEAR)=2000 AND GEO='Saskatchewan'
GROUP BY CROP_TYPE
ORDER BY AVG_YIELD_YEAR_2000 desc"
sqlQuery(conn,query)
| CROP_RANK | AVG_YIELD_YEAR_2000 | |
|---|---|---|
| <fct> | <int> | |
| 1 | Barley | 2800 |
| 2 | Wheat | 2200 |
| 3 | Rye | 2100 |
| 4 | Canola | 1400 |
query = "SELECT CROP_TYPE AS BEST_PERFORMING_CROP, AVG_YIELD AS YIELD
FROM CROP_DATA
WHERE YEAR(YEAR)=2000 AND GEO='Saskatchewan' LIMIT 1"
sqlQuery(conn,query)
| BEST_PERFORMING_CROP | YIELD | |
|---|---|---|
| <fct> | <int> | |
| 1 | Barley | 2800 |
query = "SELECT CROP_TYPE, GEO, AVG_YIELD
FROM CROP_DATA
WHERE YEAR(YEAR) >=2000
ORDER BY AVG_YIELD DESC LIMIT 10"
sqlQuery(conn,query)
| CROP_TYPE | GEO | AVG_YIELD | |
|---|---|---|---|
| <fct> | <fct> | <int> | |
| 1 | Barley | Alberta | 4100 |
| 2 | Barley | Alberta | 4100 |
| 3 | Barley | Alberta | 3980 |
| 4 | Wheat | Alberta | 3900 |
| 5 | Barley | Alberta | 3900 |
| 6 | Wheat | Alberta | 3900 |
| 7 | Barley | Canada | 3900 |
| 8 | Barley | Alberta | 3890 |
| 9 | Barley | Canada | 3820 |
| 10 | Barley | Canada | 3810 |
query = "SELECT CROP_TYPE, GEO, AVG_YIELD AS HIGHEST_AVG_YIELD
FROM CROP_DATA
WHERE YEAR(YEAR) >=2000
ORDER BY AVG_YIELD DESC LIMIT 1"
sqlQuery(conn,query)
| CROP_TYPE | GEO | HIGHEST_AVG_YIELD | |
|---|---|---|---|
| <fct> | <fct> | <int> | |
| 1 | Barley | Alberta | 4100 |
query = "SELECT MAX(YEAR) AS MOST_RECENT_YEAR,
SUM(HARVESTED_AREA) AS TOT_HARVESTED_WHEAT_CANADA
FROM CROP_DATA
WHERE YEAR(YEAR)=(SELECT MAX(YEAR(YEAR))FROM CROP_DATA
WHERE GEO = 'Canada' AND CROP_TYPE = 'Wheat')"
sqlQuery(conn,query)
| MOST_RECENT_YEAR | TOT_HARVESTED_WHEAT_CANADA | |
|---|---|---|
| <date> | <int> | |
| 1 | 2020-12-31 | 38897100 |
query = "SELECT A.DATE,A.CROP_TYPE,A.GEO,A.PRICE_PRERMT AS CANADIAN_DOLLARS,
(A.PRICE_PRERMT / B.FXUSDCAD) AS US_DOLLARS, B.FXUSDCAD
FROM FARM_PRICES A , MONTHLY_FX B
WHERE YEAR(A.DATE) = YEAR(B.DATE)
AND MONTH(A.DATE) = MONTH(B.DATE)
AND CROP_TYPE ='Canola'
AND GEO = 'Saskatchewan'
ORDER BY DATE DESC LIMIT 6"
sqlQuery(conn,query)
| DATE | CROP_TYPE | GEO | CANADIAN_DOLLARS | US_DOLLARS | FXUSDCAD | |
|---|---|---|---|---|---|---|
| <date> | <fct> | <fct> | <int> | <dbl> | <dbl> | |
| 1 | 2020-12-01 | Canola | Saskatchewan | 507 | 395.8553 | 1.280771 |
| 2 | 2020-11-01 | Canola | Saskatchewan | 495 | 378.7821 | 1.306820 |
| 3 | 2020-10-01 | Canola | Saskatchewan | 474 | 358.6912 | 1.321471 |
| 4 | 2020-09-01 | Canola | Saskatchewan | 463 | 350.0125 | 1.322810 |
| 5 | 2020-08-01 | Canola | Saskatchewan | 464 | 350.9290 | 1.322205 |
| 6 | 2020-07-01 | Canola | Saskatchewan | 462 | 342.2602 | 1.349850 |
close(conn)