SQL for Data Science with R Final Assignment

Assignment Scenario

Congratulations! You have just been hired by a US Venture Capital firm as a data analyst.

The company is considering foreign grain markets to help meet its supply chain requirements for its recent investments in the microbrewery and microdistillery industry, which is involved with the production and distribution of craft beers and spirits.

Your first task is to provide a high level analysis of crop production in Canada. Your stakeholders want to understand the current and historical performance of certain crop types in terms of supply and price volatility. For now they are mainly interested in a macro-view of Canada’s crop farming industry, and how it relates to the relative value of the Canadian and US dollars.

Introduction

Using this R notebook you will:

Understand four datasets
Load the datasets into four separate tables in a Db2 database
Execute SQL queries unsing the RODBC R package to answer assignment questions

Understand the datasets

To complete the assignment problems in this notebook you will be using subsetted snapshots of two datasets from Statistics Canada, and one from the Bank of Canada. The links to the prepared datasets are provided in the next section; the interested student can explore the landing pages for the source datasets as follows:

1. Canadian Principal Crops Data

This dataset contains agricultural production measures for the principle crops grown in Canada, including a breakdown by province and teritory, for each year from 1908 to 2020.

For this assignment you will use a preprocessed snapshot of this dataset (see below).

A detailed description of this dataset can be obtained from the StatsCan Data Portal at: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210035901
Detailed information is included in the metadata file and as header text in the data file, which can be downloaded - look for the ‘download options’ link.

2. Farm product prices

This dataset contains monthly average farm product prices for Canadian crops and livestock by province and teritory, from 1980 to 2020 (or ‘last year’, whichever is greatest).

For this assignment you will use a preprocessed snapshot of this dataset (see below).

A description of this dataset can be obtained from the StatsCan Data Portal at: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3210007701 The information is included in the metadata file, which can be downloaded - look for the ‘download options’ link.

3. Bank of Canada daily average exchange rates

This dataset contains the daily average exchange rates for multiple foreign currencies. Exchange rates are expressed as 1 unit of the foreign currency converted into Canadian dollars. It includes only the latest four years of data, and the rates are published once each business day by 16:30 ET.

For this assignment you will use a snapshot of this dataset with only the USD-CAD exchange rates included (see next section). We have also prepared a monthly averaged version which you will be using below.

A brief description of this dataset and the original dataset can be obtained from the Bank of Canada Data Portal at: https://www.bankofcanada.ca/rates/exchange/daily-exchange-rates/

Dataset URLs

Load the RODBC package:

library(RODBC)

1. Create tables

Establish a connection to the Db2 database, and create the following four tables using the RODBC package in R. Use the separate cells provided below to create each of your tables.

CROP_DATA
FARM_PRICES
DAILY_FX
MONTHLY_FX

# Establish database connection
dsn_driver <- "{IBM DB2 ODBC Driver}"
dsn_database <- "bludb"
dsn_hostname <- "..."
dsn_port <- "..."
dsn_protocol <- "TCPIP"
dsn_uid <- "..."
dsn_pwd <- "..." 
dsn_security <- "ssl"

conn_path <- paste("DRIVER=",dsn_driver,
                ";DATABASE=",dsn_database,
                ";HOSTNAME=",dsn_hostname,
                ";PORT=",dsn_port,
                ";PROTOCOL=",dsn_protocol,
                ";UID=",dsn_uid,
                ";PWD=",dsn_pwd,
                ";SECURITY=",dsn_security,       
                    sep="")
conn <- odbcDriverConnect(conn_path)
conn

RODBC Connection 1
Details:
  case=nochange
  DRIVER={IBM DB2 ODBC DRIVER}
  UID=...
  PWD...
  DATABASE=bludb
  HOSTNAME=...
  PORT=...
  PROTOCOL=TCPIP
  SECURITY=SSL

# Check Connection Details
sql.info <- sqlTypeInfo(conn)
conn.info <- odbcGetInfo(conn)
conn.info["DBMS_Name"]
conn.info["DBMS_Ver"]
conn.info["Driver_ODBC_Ver"]

DBMS_Name: ‘DB2/LINUXX8664’

DBMS_Ver: ‘11.05.0800’

Driver_ODBC_Ver: ‘03.51’

# CROP_DATA:
df1 <- sqlQuery(conn, 
                    "CREATE TABLE CROP_DATA (
                                      CD_ID INTEGER NOT NULL,
                                      YEAR DATE NOT NULL,
                                      CROP_TYPE VARCHAR(20) NOT NULL,
                                      GEO VARCHAR(20) NOT NULL, 
                                      SEEDED_AREA INTEGER NOT NULL,
                                      HARVESTED_AREA INTEGER NOT NULL,
                                      PRODUCTION INTEGER NOT NULL,
                                      AVG_YIELD INTEGER NOT NULL,
                                      PRIMARY KEY (CD_ID)
                                      )", 
                    errors=FALSE
                    )

    if (df1 == -1){
        cat ("An error has occurred.\n")
        msg <- odbcGetErrMsg(conn)
        print (msg)
    } else {
        cat ("Table was created successfully.\n")
    }

Table was created successfully.

# FARM_PRICES:
df2 <- sqlQuery(conn,
                    "CREATE TABLE FARM_PRICES (
                                    CD_ID INTEGER NOT NULL,
                                    DATE DATE NOT NULL,
                                    CROP_TYPE VARCHAR(20) NOT NULL,
                                    GEO VARCHAR(20) NOT NULL,
                                    PRICE_PRERMT DECIMAL NOT NULL,
                                    PRIMARY KEY (CD_ID)
                                    )"
                    errors=FALSE
                    )

    if (df2 == -1){
        cat ("An error has occurred.\n")
        msg <- odbcGetErrMsg(conn)
        print (msg)
    } else {
        cat ("Table was created successfully.\n")
    }

Table was created successfully.

# DAILY_FX:
df3 <- sqlQuery(conn,
                    "CREATE TABLE DAILY_FX (
                                    DFX_ID INTEGER NOT NULL,
                                    DATE DATE NOT NULL,
                                    FXUSDCAD FLOAT(6) NOT NULL,
                                    PRIMARY KEY (DFX_ID)
                                    )"

Table was created successfully.

# MONTHLY_FX:
df4 <- sqlQuery(conn,
                    "CREATE TABLE MONTHLY_FX (
                                    DFX_ID INTEGER NOT NULL,
                                    DATE DATE NOT NULL,
                                    FXUSDCAD FLOAT(6) NOT NULL,
                                    PRIMARY KEY (DFX_ID)
                                    )",
                    errors=FALSE
                    )

    if (df4 == -1){
        shcat ("An error has occurred.\n")
        msg <- odbcGetErrMsg(conn)
        print (msg)
    } else {
        cat ("Table was created successfully.\n")
    }

Table was created successfully.

2. Read Datasets and Load Tables

Read the datasets into R dataframes using the urls provided above. Then load your tables.

crop_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Annual_Crop_Data.csv', colClasses=c(YEAR="character"))
farm_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_Farm_Prices.csv', colClasses=c(DATE="character"))
monthly_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_FX.csv', colClasses=c(DATE="character"))
daily_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Daily_FX.csv', colClasses=c(DATE="character"))

head(crop_df)
head(farm_df)
head(monthly_df)
head(daily_df)

A data.frame: 6 × 8
	CD_ID	YEAR	CROP_TYPE	GEO	SEEDED_AREA	HARVESTED_AREA	PRODUCTION	AVG_YIELD
	<int>	<chr>	<fct>	<fct>	<int>	<int>	<int>	<int>
1	0	1965-12-31	Barley	Alberta	1372000	1372000	2504000	1825
2	1	1965-12-31	Barley	Canada	2476800	2476800	4752900	1920
3	2	1965-12-31	Barley	Saskatchewan	708000	708000	1415000	2000
4	3	1965-12-31	Canola	Alberta	297400	297400	215500	725
5	4	1965-12-31	Canola	Canada	580700	580700	512600	885
6	5	1965-12-31	Canola	Saskatchewan	224600	224600	242700	1080

A data.frame: 6 × 5
	CD_ID	DATE	CROP_TYPE	GEO	PRICE_PRERMT
	<int>	<chr>	<fct>	<fct>	<dbl>
1	0	1985-01-01	Barley	Alberta	127.39
2	1	1985-01-01	Barley	Saskatchewan	121.38
3	2	1985-01-01	Canola	Alberta	342.00
4	3	1985-01-01	Canola	Saskatchewan	339.82
5	4	1985-01-01	Rye	Alberta	100.77
6	5	1985-01-01	Rye	Saskatchewan	109.75

A data.frame: 6 × 3
	DFX_ID	DATE	FXUSDCAD
	<int>	<chr>	<dbl>
1	0	2017-01-01	1.319276
2	1	2017-02-01	1.310726
3	2	2017-03-01	1.338643
4	3	2017-04-01	1.344021
5	4	2017-05-01	1.360705
6	5	2017-06-01	1.329805

A data.frame: 6 × 3
	DFX_ID	DATE	FXUSDCAD
	<int>	<chr>	<dbl>
1	0	2017-01-03	1.3435
2	1	2017-01-04	1.3315
3	2	2017-01-05	1.3244
4	3	2017-01-06	1.3214
5	4	2017-01-09	1.3240
6	5	2017-01-10	1.3213

sqlSave(conn, crop_df, "CROP_DATA", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)
sqlSave(conn, farm_df, "FARM_PRICES", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)
sqlSave(conn, monthly_df, "MONTHLY_FX", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)
sqlSave(conn, daily_df, "DAILY_FX", append=TRUE, fast=FALSE, rownames=FALSE, colnames=FALSE, verbose=FALSE)

Execute SQL queries using the RODBC R package to answer the following questions.

3. How many records are in the farm prices dataset?

query = "SELECT COUNT(CD_ID) AS RECORDS_IN_DATASET FROM FARM_PRICES"
sqlQuery(conn,query)

A data.frame: 1 × 1
	RECORDS_IN_DATASET
	<int>
1	2678

4. Which geographies are included in the farm prices dataset?

query = "SELECT GEO AS GEOGRAPHIES_IN_DATASET FROM FARM_PRICES 
GROUP BY GEO"
sqlQuery(conn,query)

A data.frame: 2 × 1
	GEOGRAPHIES_IN_DATASET
	<fct>
1	Alberta
2	Saskatchewan

5. How many hectares of Rye were harvested in Canada in 1968?

query = "SELECT SUM(HARVESTED_AREA) AS CANADA_HARVESTED_RYE_1968 FROM CROP_DATA
WHERE GEO = 'Canada' AND CROP_TYPE = 'Rye' AND YEAR(YEAR) = 1968"
sqlQuery(conn,query)

A data.frame: 1 × 1
	CANADA_HARVESTED_RYE_1968
	<int>
1	274100

6. Query and display the first 6 rows of the farm prices table for Rye.

query = "SELECT * FROM FARM_PRICES 
WHERE CROP_TYPE ='Rye' LIMIT 6"
sqlQuery(conn,query)

A data.frame: 6 × 5
	CD_ID	DATE	CROP_TYPE	GEO	PRICE_PRERMT
	<int>	<date>	<fct>	<fct>	<int>
1	4	1985-01-01	Rye	Alberta	100
2	5	1985-01-01	Rye	Saskatchewan	109
3	10	1985-02-01	Rye	Alberta	95
4	11	1985-02-01	Rye	Saskatchewan	103
5	16	1985-03-01	Rye	Alberta	96
6	17	1985-03-01	Rye	Saskatchewan	106

7. Which provinces grow Barley?

query = "SELECT GEO AS BARLEY_GROWING_PROVINCES FROM CROP_DATA 
WHERE CROP_TYPE='Barley' AND NOT GEO = 'Canada' GROUP BY GEO"
sqlQuery(conn,query)

A data.frame: 2 × 1
	BARLEY_GROWING_PROVINCES
	<fct>
1	Alberta
2	Saskatchewan

8. Find the first and last dates for the farm prices data.

query = "SELECT min(DATE) AS FIRST_DATE, 
max(DATE) AS LAST_DATE
FROM FARM_PRICES"
sqlQuery(conn,query)

A data.frame: 1 × 2
	FIRST_DATE	LAST_DATE
	<date>	<date>
1	1985-01-01	2020-12-01

9. Which crops have ever reached a farm price greater than or equal to $350 per metric tonne?

query = "SELECT DISTINCT (CROP_TYPE), 
PRICE_PRERMT, YEAR(DATE) AS YEAR FROM FARM_PRICES 
WHERE PRICE_PRERMT >= 350 ORDER BY PRICE_PRERMT LIMIT 10"
sqlQuery(conn,query)

A data.frame: 10 × 3
	CROP_TYPE	PRICE_PRERMT	YEAR
	<fct>	<int>	<int>
1	Canola	350	1985
2	Canola	350	1995
3	Canola	350	2003
4	Canola	350	2007
5	Canola	351	1999
6	Canola	351	2007
7	Canola	352	1997
8	Canola	352	1998
9	Canola	352	2003
10	Canola	352	2007

10.(A) Rank the crop types harvested in Saskatchewan in the year 2000 by their average yield.

query = "SELECT CROP_TYPE AS CROP_RANK, MAX(AVG_YIELD)AS AVG_YIELD_YEAR_2000 
FROM CROP_DATA 
WHERE YEAR(YEAR)=2000 AND GEO='Saskatchewan' 
GROUP BY CROP_TYPE 
ORDER BY AVG_YIELD_YEAR_2000 desc"
sqlQuery(conn,query)

A data.frame: 4 × 2
	CROP_RANK	AVG_YIELD_YEAR_2000
	<fct>	<int>
1	Barley	2800
2	Wheat	2200
3	Rye	2100
4	Canola	1400

10.(B) Which crop performed best?

query = "SELECT CROP_TYPE AS BEST_PERFORMING_CROP, AVG_YIELD AS YIELD 
FROM CROP_DATA 
WHERE YEAR(YEAR)=2000 AND GEO='Saskatchewan' LIMIT 1"
sqlQuery(conn,query)

A data.frame: 1 × 2
	BEST_PERFORMING_CROP	YIELD
	<fct>	<int>
1	Barley	2800

11.(A) Rank the crops and geographies by their average yield (KG per hectare) since the year 2000.

query = "SELECT CROP_TYPE, GEO, AVG_YIELD 
FROM CROP_DATA
WHERE YEAR(YEAR) >=2000
ORDER BY AVG_YIELD DESC LIMIT 10"
sqlQuery(conn,query)

A data.frame: 10 × 3
	CROP_TYPE	GEO	AVG_YIELD
	<fct>	<fct>	<int>
1	Barley	Alberta	4100
2	Barley	Alberta	4100
3	Barley	Alberta	3980
4	Wheat	Alberta	3900
5	Barley	Alberta	3900
6	Wheat	Alberta	3900
7	Barley	Canada	3900
8	Barley	Alberta	3890
9	Barley	Canada	3820
10	Barley	Canada	3810

11. (B) Which crop and province had the highest average yield since the year 2000?

query = "SELECT CROP_TYPE, GEO, AVG_YIELD AS HIGHEST_AVG_YIELD 
FROM CROP_DATA
WHERE YEAR(YEAR) >=2000
ORDER BY AVG_YIELD DESC LIMIT 1"
sqlQuery(conn,query)

A data.frame: 1 × 3
	CROP_TYPE	GEO	HIGHEST_AVG_YIELD
	<fct>	<fct>	<int>
1	Barley	Alberta	4100

12. Use a subquery to determine how much wheat was harvested in Canada in the most recent year of the data.

query = "SELECT MAX(YEAR) AS MOST_RECENT_YEAR, 
SUM(HARVESTED_AREA) AS TOT_HARVESTED_WHEAT_CANADA 
FROM CROP_DATA
WHERE YEAR(YEAR)=(SELECT MAX(YEAR(YEAR))FROM CROP_DATA 
WHERE GEO = 'Canada' AND CROP_TYPE  = 'Wheat')"
sqlQuery(conn,query)

A data.frame: 1 × 2
	MOST_RECENT_YEAR	TOT_HARVESTED_WHEAT_CANADA
	<date>	<int>
1	2020-12-31	38897100

13. Use an implicit inner join to calculate the monthly price per metric tonne of Canola grown in Saskatchewan in both Canadian and US dollars. Display the most recent 6 months of the data.

query = "SELECT A.DATE,A.CROP_TYPE,A.GEO,A.PRICE_PRERMT AS CANADIAN_DOLLARS, 
(A.PRICE_PRERMT / B.FXUSDCAD) AS US_DOLLARS, B.FXUSDCAD 
FROM FARM_PRICES A , MONTHLY_FX B 
WHERE YEAR(A.DATE) = YEAR(B.DATE) 
AND MONTH(A.DATE) = MONTH(B.DATE) 
AND CROP_TYPE ='Canola' 
AND GEO = 'Saskatchewan' 
ORDER BY DATE DESC LIMIT 6"
sqlQuery(conn,query)

A data.frame: 6 × 6
	DATE	CROP_TYPE	GEO	CANADIAN_DOLLARS	US_DOLLARS	FXUSDCAD
	<date>	<fct>	<fct>	<int>	<dbl>	<dbl>
1	2020-12-01	Canola	Saskatchewan	507	395.8553	1.280771
2	2020-11-01	Canola	Saskatchewan	495	378.7821	1.306820
3	2020-10-01	Canola	Saskatchewan	474	358.6912	1.321471
4	2020-09-01	Canola	Saskatchewan	463	350.0125	1.322810
5	2020-08-01	Canola	Saskatchewan	464	350.9290	1.322205
6	2020-07-01	Canola	Saskatchewan	462	342.2602	1.349850

End Connection With DataBase

close(conn)