SQL for Data Science with R Final Assignment

Assignment Scenario

Congratulations! You have just been hired by a US Venture Capital firm as a data analyst.

The company is considering foreign grain markets to help meet its supply chain requirements for its recent investments in the microbrewery and microdistillery industry, which is involved with the production and distribution of craft beers and spirits.

Your first task is to provide a high level analysis of crop production in Canada. Your stakeholders want to understand the current and historical performance of certain crop types in terms of supply and price volatility. For now they are mainly interested in a macro-view of Canada’s crop farming industry, and how it relates to the relative value of the Canadian and US dollars.

Introduction

Using this R notebook you will:

  1. Understand four datasets
  2. Load the datasets into four separate tables in a Db2 database
  3. Execute SQL queries unsing the RODBC R package to answer assignment questions

Understand the datasets

To complete the assignment problems in this notebook you will be using subsetted snapshots of two datasets from Statistics Canada, and one from the Bank of Canada. The links to the prepared datasets are provided in the next section; the interested student can explore the landing pages for the source datasets as follows:

  1. Canadian Principal Crops (Data & Metadata)
  2. Farm product prices (Data & Metadata)
  3. Bank of Canada daily average exchange rates

1. Canadian Principal Crops Data

This dataset contains agricultural production measures for the principle crops grown in Canada, including a breakdown by province and teritory, for each year from 1908 to 2020.

For this assignment you will use a preprocessed snapshot of this dataset (see below).

A detailed description of this dataset can be obtained from the StatsCan Data Portal at: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210035901
Detailed information is included in the metadata file and as header text in the data file, which can be downloaded - look for the ‘download options’ link.

2. Farm product prices

This dataset contains monthly average farm product prices for Canadian crops and livestock by province and teritory, from 1980 to 2020 (or ‘last year’, whichever is greatest).

For this assignment you will use a preprocessed snapshot of this dataset (see below).

A description of this dataset can be obtained from the StatsCan Data Portal at: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210007701 The information is included in the metadata file, which can be downloaded - look for the ‘download options’ link.

3. Bank of Canada daily average exchange rates

This dataset contains the daily average exchange rates for multiple foreign currencies. Exchange rates are expressed as 1 unit of the foreign currency converted into Canadian dollars. It includes only the latest four years of data, and the rates are published once each business day by 16:30 ET.

For this assignment you will use a snapshot of this dataset with only the USD-CAD exchange rates included (see next section). We have also prepared a monthly averaged version which you will be using below.

A brief description of this dataset and the original dataset can be obtained from the Bank of Canada Data Portal at: https://www.bankofcanada.ca/rates/exchange/daily-exchange-rates/

Dataset URLs

  1. Annual Crop Data: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Annual_Crop_Data.csv

  2. Farm product prices: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_Farm_Prices.csv

  3. Daily FX Data: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Daily_FX.csv

  4. Monthly FX Data: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_FX.csv

Load the RODBC package:

library(RODBC)

1. Create tables

Establish a connection to the Db2 database, and create the following four tables using the RODBC package in R. Use the separate cells provided below to create each of your tables.

  1. CROP_DATA
  2. FARM_PRICES
  3. DAILY_FX
  4. MONTHLY_FX
# Establish database connection
dsn_driver <- "{IBM DB2 ODBC Driver}"
dsn_database <- "bludb"
dsn_hostname <- "..."
dsn_port <- "..."
dsn_protocol <- "TCPIP"
dsn_uid <- "..."
dsn_pwd <- "..." 
dsn_security <- "ssl"

conn_path <- paste("DRIVER=",dsn_driver,
                ";DATABASE=",dsn_database,
                ";HOSTNAME=",dsn_hostname,
                ";PORT=",dsn_port,
                ";PROTOCOL=",dsn_protocol,
                ";UID=",dsn_uid,
                ";PWD=",dsn_pwd,
                ";SECURITY=",dsn_security,       
                    sep="")
conn <- odbcDriverConnect(conn_path)
conn
RODBC Connection 1
Details:
  case=nochange
  DRIVER={IBM DB2 ODBC DRIVER}
  UID=...
  PWD...
  DATABASE=bludb
  HOSTNAME=...
  PORT=...
  PROTOCOL=TCPIP
  SECURITY=SSL
# Check Connection Details
sql.info <- sqlTypeInfo(conn)
conn.info <- odbcGetInfo(conn)
conn.info["DBMS_Name"]
conn.info["DBMS_Ver"]
conn.info["Driver_ODBC_Ver"]

DBMS_Name: ‘DB2/LINUXX8664’

DBMS_Ver: ‘11.05.0800’

Driver_ODBC_Ver: ‘03.51’

# CROP_DATA:
df1 <- sqlQuery(conn, 
                    "CREATE TABLE CROP_DATA (
                                      CD_ID INTEGER NOT NULL,
                                      YEAR DATE NOT NULL,
                                      CROP_TYPE VARCHAR(20) NOT NULL,
                                      GEO VARCHAR(20) NOT NULL, 
                                      SEEDED_AREA INTEGER NOT NULL,
                                      HARVESTED_AREA INTEGER NOT NULL,
                                      PRODUCTION INTEGER NOT NULL,
                                      AVG_YIELD INTEGER NOT NULL,
                                      PRIMARY KEY (CD_ID)
                                      )", 
                    errors=FALSE
                    )

    if (df1 == -1){
        cat ("An error has occurred.\n")
        msg <- odbcGetErrMsg(conn)
        print (msg)
    } else {
        cat ("Table was created successfully.\n")
    }
Table was created successfully.
# FARM_PRICES:
df2 <- sqlQuery(conn,
                    "CREATE TABLE FARM_PRICES (
                                    CD_ID INTEGER NOT NULL,
                                    DATE DATE NOT NULL,
                                    CROP_TYPE VARCHAR(20) NOT NULL,
                                    GEO VARCHAR(20) NOT NULL,
                                    PRICE_PRERMT DECIMAL NOT NULL,
                                    PRIMARY KEY (CD_ID)
                                    )"
                    errors=FALSE
                    )

    if (df2 == -1){
        cat ("An error has occurred.\n")
        msg <- odbcGetErrMsg(conn)
        print (msg)
    } else {
        cat ("Table was created successfully.\n")
    }
Table was created successfully.
# DAILY_FX:
df3 <- sqlQuery(conn,
                    "CREATE TABLE DAILY_FX (
                                    DFX_ID INTEGER NOT NULL,
                                    DATE DATE NOT NULL,
                                    FXUSDCAD FLOAT(6) NOT NULL,
                                    PRIMARY KEY (DFX_ID)
                                    )"
Table was created successfully.
# MONTHLY_FX:
df4 <- sqlQuery(conn,
                    "CREATE TABLE MONTHLY_FX (
                                    DFX_ID INTEGER NOT NULL,
                                    DATE DATE NOT NULL,
                                    FXUSDCAD FLOAT(6) NOT NULL,
                                    PRIMARY KEY (DFX_ID)
                                    )",
                    errors=FALSE
                    )

    if (df4 == -1){
        shcat ("An error has occurred.\n")
        msg <- odbcGetErrMsg(conn)
        print (msg)
    } else {
        cat ("Table was created successfully.\n")
    }
Table was created successfully.

2. Read Datasets and Load Tables

Read the datasets into R dataframes using the urls provided above. Then load your tables.

crop_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Annual_Crop_Data.csv', colClasses=c(YEAR="character"))
farm_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_Farm_Prices.csv', colClasses=c(DATE="character"))
monthly_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_FX.csv', colClasses=c(DATE="character"))
daily_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Daily_FX.csv', colClasses=c(DATE="character"))

head(crop_df)
head(farm_df)
head(monthly_df)
head(daily_df)
A data.frame: 6 × 8
CD_ID YEAR CROP_TYPE GEO SEEDED_AREA HARVESTED_AREA PRODUCTION AVG_YIELD
<int> <chr> <fct> <fct> <int> <int> <int> <int>
1 0 1965-12-31 Barley Alberta 1372000 1372000 2504000 1825
2 1 1965-12-31 Barley Canada 2476800 2476800 4752900 1920
3 2 1965-12-31 Barley Saskatchewan 708000 708000 1415000 2000
4 3 1965-12-31 Canola Alberta 297400 297400 215500 725
5 4 1965-12-31 Canola Canada 580700 580700 512600 885
6 5 1965-12-31 Canola Saskatchewan 224600 224600 242700 1080
A data.frame: 6 × 5
CD_ID DATE CROP_TYPE GEO PRICE_PRERMT
<int> <chr> <fct> <fct> <dbl>
1 0 1985-01-01 Barley Alberta 127.39
2 1 1985-01-01 Barley Saskatchewan 121.38
3 2 1985-01-01 Canola Alberta 342.00
4 3 1985-01-01 Canola Saskatchewan 339.82
5 4 1985-01-01 Rye Alberta 100.77
6 5 1985-01-01 Rye Saskatchewan 109.75
A data.frame: 6 × 3
DFX_ID DATE FXUSDCAD
<int> <chr> <dbl>
1 0 2017-01-01 1.319276
2 1 2017-02-01 1.310726
3 2 2017-03-01 1.338643
4 3 2017-04-01 1.344021
5 4 2017-05-01 1.360705
6 5 2017-06-01 1.329805
A data.frame: 6 × 3
DFX_ID DATE FXUSDCAD
<int> <chr> <dbl>
1 0 2017-01-03 1.3435
2 1 2017-01-04 1.3315
3 2 2017-01-05 1.3244
4 3 2017-01-06 1.3214
5 4 2017-01-09 1.3240
6 5 2017-01-10 1.3213
sqlSave(conn, crop_df, "CROP_DATA", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)
sqlSave(conn, farm_df, "FARM_PRICES", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)
sqlSave(conn, monthly_df, "MONTHLY_FX", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)
sqlSave(conn, daily_df, "DAILY_FX", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)

Execute SQL queries using the RODBC R package to answer the following questions.

3. How many records are in the farm prices dataset?

query = "SELECT COUNT(CD_ID) AS RECORDS_IN_DATASET FROM FARM_PRICES"
sqlQuery(conn,query)
A data.frame: 1 × 1
RECORDS_IN_DATASET
<int>
1 2678

4. Which geographies are included in the farm prices dataset?

query = "SELECT GEO AS GEOGRAPHIES_IN_DATASET FROM FARM_PRICES 
GROUP BY GEO"
sqlQuery(conn,query)
A data.frame: 2 × 1
GEOGRAPHIES_IN_DATASET
<fct>
1 Alberta
2 Saskatchewan

5. How many hectares of Rye were harvested in Canada in 1968?

query = "SELECT SUM(HARVESTED_AREA) AS CANADA_HARVESTED_RYE_1968 FROM CROP_DATA
WHERE GEO = 'Canada' AND CROP_TYPE = 'Rye' AND YEAR(YEAR) = 1968"
sqlQuery(conn,query)
A data.frame: 1 × 1
CANADA_HARVESTED_RYE_1968
<int>
1 274100

6. Query and display the first 6 rows of the farm prices table for Rye.

query = "SELECT * FROM FARM_PRICES 
WHERE CROP_TYPE ='Rye' LIMIT 6"
sqlQuery(conn,query)
A data.frame: 6 × 5
CD_ID DATE CROP_TYPE GEO PRICE_PRERMT
<int> <date> <fct> <fct> <int>
1 4 1985-01-01 Rye Alberta 100
2 5 1985-01-01 Rye Saskatchewan 109
3 10 1985-02-01 Rye Alberta 95
4 11 1985-02-01 Rye Saskatchewan 103
5 16 1985-03-01 Rye Alberta 96
6 17 1985-03-01 Rye Saskatchewan 106

7. Which provinces grow Barley?

query = "SELECT GEO AS BARLEY_GROWING_PROVINCES FROM CROP_DATA 
WHERE CROP_TYPE='Barley' AND NOT GEO = 'Canada' GROUP BY GEO"
sqlQuery(conn,query)
A data.frame: 2 × 1
BARLEY_GROWING_PROVINCES
<fct>
1 Alberta
2 Saskatchewan

8. Find the first and last dates for the farm prices data.

query = "SELECT min(DATE) AS FIRST_DATE, 
max(DATE) AS LAST_DATE
FROM FARM_PRICES"
sqlQuery(conn,query)
A data.frame: 1 × 2
FIRST_DATE LAST_DATE
<date> <date>
1 1985-01-01 2020-12-01

9. Which crops have ever reached a farm price greater than or equal to $350 per metric tonne?

query = "SELECT DISTINCT (CROP_TYPE), 
PRICE_PRERMT, YEAR(DATE) AS YEAR FROM FARM_PRICES 
WHERE PRICE_PRERMT >= 350 ORDER BY PRICE_PRERMT LIMIT 10"
sqlQuery(conn,query)
A data.frame: 10 × 3
CROP_TYPE PRICE_PRERMT YEAR
<fct> <int> <int>
1 Canola 350 1985
2 Canola 350 1995
3 Canola 350 2003
4 Canola 350 2007
5 Canola 351 1999
6 Canola 351 2007
7 Canola 352 1997
8 Canola 352 1998
9 Canola 352 2003
10 Canola 352 2007

10.(A) Rank the crop types harvested in Saskatchewan in the year 2000 by their average yield.

query = "SELECT CROP_TYPE AS CROP_RANK, MAX(AVG_YIELD)AS AVG_YIELD_YEAR_2000 
FROM CROP_DATA 
WHERE YEAR(YEAR)=2000 AND GEO='Saskatchewan' 
GROUP BY CROP_TYPE 
ORDER BY AVG_YIELD_YEAR_2000 desc"
sqlQuery(conn,query)
A data.frame: 4 × 2
CROP_RANK AVG_YIELD_YEAR_2000
<fct> <int>
1 Barley 2800
2 Wheat 2200
3 Rye 2100
4 Canola 1400

10.(B) Which crop performed best?

query = "SELECT CROP_TYPE AS BEST_PERFORMING_CROP, AVG_YIELD AS YIELD 
FROM CROP_DATA 
WHERE YEAR(YEAR)=2000 AND GEO='Saskatchewan' LIMIT 1"
sqlQuery(conn,query)
A data.frame: 1 × 2
BEST_PERFORMING_CROP YIELD
<fct> <int>
1 Barley 2800

11.(A) Rank the crops and geographies by their average yield (KG per hectare) since the year 2000.

query = "SELECT CROP_TYPE, GEO, AVG_YIELD 
FROM CROP_DATA
WHERE YEAR(YEAR) >=2000
ORDER BY AVG_YIELD DESC LIMIT 10"
sqlQuery(conn,query)
A data.frame: 10 × 3
CROP_TYPE GEO AVG_YIELD
<fct> <fct> <int>
1 Barley Alberta 4100
2 Barley Alberta 4100
3 Barley Alberta 3980
4 Wheat Alberta 3900
5 Barley Alberta 3900
6 Wheat Alberta 3900
7 Barley Canada 3900
8 Barley Alberta 3890
9 Barley Canada 3820
10 Barley Canada 3810

11. (B) Which crop and province had the highest average yield since the year 2000?

query = "SELECT CROP_TYPE, GEO, AVG_YIELD AS HIGHEST_AVG_YIELD 
FROM CROP_DATA
WHERE YEAR(YEAR) >=2000
ORDER BY AVG_YIELD DESC LIMIT 1"
sqlQuery(conn,query)
A data.frame: 1 × 3
CROP_TYPE GEO HIGHEST_AVG_YIELD
<fct> <fct> <int>
1 Barley Alberta 4100

12. Use a subquery to determine how much wheat was harvested in Canada in the most recent year of the data.

query = "SELECT MAX(YEAR) AS MOST_RECENT_YEAR, 
SUM(HARVESTED_AREA) AS TOT_HARVESTED_WHEAT_CANADA 
FROM CROP_DATA
WHERE YEAR(YEAR)=(SELECT MAX(YEAR(YEAR))FROM CROP_DATA 
WHERE GEO = 'Canada' AND CROP_TYPE  = 'Wheat')"
sqlQuery(conn,query)
A data.frame: 1 × 2
MOST_RECENT_YEAR TOT_HARVESTED_WHEAT_CANADA
<date> <int>
1 2020-12-31 38897100

13. Use an implicit inner join to calculate the monthly price per metric tonne of Canola grown in Saskatchewan in both Canadian and US dollars. Display the most recent 6 months of the data.

query = "SELECT A.DATE,A.CROP_TYPE,A.GEO,A.PRICE_PRERMT AS CANADIAN_DOLLARS, 
(A.PRICE_PRERMT / B.FXUSDCAD) AS US_DOLLARS, B.FXUSDCAD 
FROM FARM_PRICES A , MONTHLY_FX B 
WHERE YEAR(A.DATE) = YEAR(B.DATE) 
AND MONTH(A.DATE) = MONTH(B.DATE) 
AND CROP_TYPE ='Canola' 
AND GEO = 'Saskatchewan' 
ORDER BY DATE DESC LIMIT 6"
sqlQuery(conn,query)
A data.frame: 6 × 6
DATE CROP_TYPE GEO CANADIAN_DOLLARS US_DOLLARS FXUSDCAD
<date> <fct> <fct> <int> <dbl> <dbl>
1 2020-12-01 Canola Saskatchewan 507 395.8553 1.280771
2 2020-11-01 Canola Saskatchewan 495 378.7821 1.306820
3 2020-10-01 Canola Saskatchewan 474 358.6912 1.321471
4 2020-09-01 Canola Saskatchewan 463 350.0125 1.322810
5 2020-08-01 Canola Saskatchewan 464 350.9290 1.322205
6 2020-07-01 Canola Saskatchewan 462 342.2602 1.349850

End Connection With DataBase

close(conn)