Objective

  1. Comprehend the structure and content of four distinct datasets.
  2. Systematically load these datasets into four discrete tables within a relational database.
  3. Utilize the RODBC/RSQLite R package to formulate and execute SQL queries, thereby addressing the specified research questions.

Justification

Data analysis is a fundamental discipline in modern scientific research, enabling the extraction of valuable information and the generation of knowledge from large volumes of data. This project focuses on understanding and manipulating four distinct datasets with the aim of answering specific questions using advanced SQL query techniques through the RODBC/RSQLite packages in R. By deepening the understanding of data and its structure, this project contributes significantly to scientific knowledge. The ability to load and manage data in a relational database is a crucial skill for data scientists and researchers.

Database Description

  1. Canadian Principal Crops Data:
    • Description: This dataset contains agricultural production measures for the principal crops grown in Canada, including a breakdown by province and territory, for each year from 1908 to 2020. Detailed information about this dataset can be obtained from the StatsCan Data Portal.
    • Usage in the project: A preprocessed snapshot of this dataset will be used to analyze the current and historical performance of certain crop types in terms of supply and price.
  2. Farm Product Prices:
    • Description: This dataset contains monthly average farm product prices for Canadian crops and livestock by province and territory, from 1980 to 2020. Detailed information about this dataset can be obtained from the StatsCan Data Portal.
    • Usage in the project: A preprocessed snapshot of this dataset will be used to evaluate the prices of agricultural products and their variation over time.
  3. Bank of Canada Daily Average Exchange Rates:
    • Description: This dataset contains the daily average exchange rates for multiple foreign currencies. Exchange rates are expressed as 1 unit of the foreign currency converted into Canadian dollars. It includes only the latest four years of data, and the rates are published once each business day by 16:30 ET. Detailed information about this dataset can be obtained from the Bank of Canada Data Portal.
    • Usage in the project: A snapshot of this dataset, including only the USD-CAD exchange rates, as well as a monthly averaged version, will be used to analyze the relationship between the value of the Canadian and US dollars.

The first step involves establishing a connection to the database.

con <- dbConnect(RSQLite::SQLite(),"Final.sqlite")

To begin, establish a connection, referred to as conn, with the RSQLite database Final.sqlite. This step is crucial to ensure that all subsequent operations are performed efficiently and securely. Once the connection is established, proceed to create the following four tables, each designed to store specific data and facilitate subsequent analysis

  1. CROP_DATA
  2. FARM_PRICES
  3. DAILY_FX
  4. MONTHLY_FX
# Create "CROP_DATA" table in RSQLite

df1 <- dbExecute(con, 
                 "CREATE TABLE CROP_DATA (
                                      CD_ID INTEGER NOT NULL,
                                      YEAR DATE NOT NULL,
                                      CROP_TYPE VARCHAR(20) NOT NULL,
                                      GEO VARCHAR(20) NOT NULL, 
                                      SEEDED_AREA INTEGER NOT NULL,
                                      HARVESTED_AREA INTEGER NOT NULL,
                                      PRODUCTION INTEGER NOT NULL,
                                      AVG_YIELD INTEGER NOT NULL,
                                      PRIMARY KEY (CD_ID)
                                      )"
)

if (df1 == -1){
  cat ("An error has occurred.\n")
  msg <- odbcGetErrMsg(con)
  print (msg)
} else {
  cat ("Table was created successfully.\n")
}
## Table was created successfully.
# Create "FARM_PRICES" table in RSQLite

df2 <- dbExecute(con, 
                 "CREATE TABLE FARM_PRICES (
                                      CD_ID INTEGER NOT NULL,
                                      YEAR DATE NOT NULL,
                                      CROP_TYPE VARCHAR(20) NOT NULL,
                                      GEO VARCHAR(20) NOT NULL, 
                                      SEEDED_AREA INTEGER NOT NULL,
                                      HARVESTED_AREA INTEGER NOT NULL,
                                      PRODUCTION INTEGER NOT NULL,
                                      AVG_YIELD INTEGER NOT NULL,
                                      PRIMARY KEY (CD_ID)
                                      )"
)

if (df1 == -1){
  cat ("An error has occurred.\n")
  msg <- odbcGetErrMsg(con)
  print (msg)
} else {
  cat ("Table was created successfully.\n")
}
## Table was created successfully.
# Create "DAILY_FX" table in RSQLite.


df3 <- dbExecute(con, "CREATE TABLE DAYLY_FX (
                                DFX_ID INTEGER NOT NULL,
                                DATE DATE NOT NULL, 
                                FXUSDCAD FLOAT(6),
                                PRIMARY KEY (DFX_ID)
                                )"
)

if (df3 == -1){
  cat ("An error has occurred.\n")
  msg <- odbcGetErrMsg(con)
  print (msg)
} else {
  cat ("Table was created successfully.\n")
} 
## Table was created successfully.
# Create "MONTHLY_FX" table in RSQLite.


df4 <- dbExecute(con, "CREATE TABLE MONTHLY_FX (
                                DFX_ID INTEGER NOT NULL,
                                DATE DATE NOT NULL, 
                                FXUSDCAD FLOAT(6),
                                PRIMARY KEY (DFX_ID)
                                )"
)

if (df3 == -1){
  cat ("An error has occurred.\n")
  msg <- odbcGetErrMsg(con)
  print (msg)
} else {
  cat ("Table was created successfully.\n")
} 
## Table was created successfully.

Read Datasets and load your tables in database

crop_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Practice%20Assignment/Annual_Crop_Data.csv', colClasses=c(YEAR="character"))

daily_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Practice%20Assignment/Daily_FX.csv', colClasses=c(date="character"))
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## not all columns named in 'colClasses' exist
monthly_df <-read.csv ("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_FX.csv", colClasses = c(date="character"))
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## not all columns named in 'colClasses' exist
farmprices_df <- read.csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_Farm_Prices.csv", colClasses=c(YEAR="character"))
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## not all columns named in 'colClasses' exist
# Write the table in table

dbWriteTable(con, "CROP_DATA", crop_df, overwrite=TRUE, header = TRUE)
dbWriteTable(con, "DAILY_DATA", daily_df, overwrite=TRUE, header = TRUE)
dbWriteTable(con, "FARMPRICES_DATA", farmprices_df, overwrite=TRUE, header = TRUE)
dbWriteTable(con, "MONTHLY_DATA", monthly_df, overwrite=TRUE, header = TRUE)
#check list of tables in the present db.
dbListTables(con)
## [1] "CROP_DATA"       "DAILY_DATA"      "DAYLY_FX"        "FARMPRICES_DATA"
## [5] "FARM_PRICES"     "MONTHLY_DATA"    "MONTHLY_FX"

Subsequently, queries will be executed using SQLite to analyze the data stored in the tables. These queries will allow us to extract relevant information and gain insights into crops, agricultural prices, and exchange rates

how many records are in the FARM_PRICES dataset?

To better understand the scope of the data available for analysis, it is essential to determine the size of the dataset.

# Execute the SQL query to count the records in the farm prices dataset
dbGetQuery(con, "SELECT COUNT(*) FROM FARMPRICES_DATA")
##   COUNT(*)
## 1     2678

Which geographies are included in the farm prices dataset?

Additionally, it is equally important to identify the geographies included in the FARM_PRICES dataset. Knowing the geographical regions represented will enable us to evaluate the diversity and coverage of the dataset, as well as to perform comparative analyses between different areas. This can provide valuable insights into regional variations in agricultural prices and help identify specific trends and patterns for each region.

dbGetQuery(con, "SELECT DISTINCT GEO FROM FARMPRICES_DATA")
##            GEO
## 1      Alberta
## 2 Saskatchewan

How many hectares of Rye were harvested in Canada in 1968?

To gain a comprehensive understanding of the agricultural data, it is important to quantify the extent of rye cultivation in historical contexts. Specifically, how many hectares of Rye were harvested in Canada in 1968? This information will help us analyze trends in rye production and its significance within the agricultural sector during that period.

dbGetQuery(con, "
SELECT SUM(HARVESTED_AREA) AS total_hectares
FROM CROP_DATA
WHERE CROP_TYPE = 'Rye' AND GEO = 'Canada' AND strftime('%Y', YEAR) = '1968'")
##   total_hectares
## 1         274100

Execute a query to retrieve and display the initial six rows of the FARM_PRICES table for rye.

To conduct a detailed analysis of rye pricing data, execute a query to retrieve and display the first six rows of the FARM_PRICES table specifically for rye. This will provide an initial overview of the pricing trends and variations for rye within the dataset

dbGetQuery(con, "SELECT PRICE_PRERMT, CROP_TYPE FROM FARMPRICES_DATA WHERE CROP_TYPE = 'Rye' LIMIT  6")
##   PRICE_PRERMT CROP_TYPE
## 1       100.77       Rye
## 2       109.75       Rye
## 3        95.05       Rye
## 4       103.46       Rye
## 5        96.77       Rye
## 6       106.38       Rye

Which provinces have cultivated barley?

To understand the geographical distribution of barley cultivation, it is essential to identify the provinces that have grown barley. This information will provide insights into regional agricultural practices and the extent of barley production across different areas

dbGetQuery(con, "SELECT DISTINCT GEO FROM CROP_DATA WHERE CROP_TYPE = 'Barley'")
##            GEO
## 1      Alberta
## 2       Canada
## 3 Saskatchewan

Determine the initial and final dates of the farm prices dataset.

To better understand the temporal scope of the farm prices dataset, it is necessary to identify the first and last dates of the records. This will provide a clear view of the period covered by the data and enable precise historical analysis.

dbGetQuery(con, 'SELECT min(DATE) FIRST_DATE, max(DATE) LAST_DATE FROM FARMPRICES_DATA')
##   FIRST_DATE  LAST_DATE
## 1 1985-01-01 2020-12-01

Which crops have achieved a farm price of $350 or more per metric tonne?

To identify which crops have ever reached a farm price of $350 or more per metric tonne, it is necessary to perform a query on the FARM_PRICES dataset. This information will allow us to analyze which crops have had significant market value and better understand the dynamics of agricultural prices.

dbGetQuery(con, 'SELECT DISTINCT CROP_TYPE FROM FARMPRICES_DATA WHERE PRICE_PRERMT >= 350')
##   CROP_TYPE
## 1    Canola

Rank the crop types harvested in Saskatchewan in the year 2000 by their average yield. Which crop performed best?

To evaluate the agricultural performance in Saskatchewan in the year 2000, it is essential to rank the crop types harvested by their average yield. This analysis helps identify which crop performed best in terms of production. Ranking crops by average yield not only provides a clear view of the efficiency of each crop type but also aids farmers and agricultural policymakers in making informed decisions about resource management and future planning. By determining which crop had the highest yield, agricultural practices can be optimized, enhancing the sustainability and profitability of agricultural production in the region.

dbGetQuery(con, "SELECT CROP_TYPE, GEO, HARVESTED_AREA
    FROM CROP_DATA 
    WHERE GEO='Saskatchewan'AND strftime('%Y', YEAR) = '2000'
    ORDER BY HARVESTED_AREA DESC
    ")
##   CROP_TYPE          GEO HARVESTED_AREA
## 1     Wheat Saskatchewan        6080300
## 2    Canola Saskatchewan        2371500
## 3    Barley Saskatchewan        1922300
## 4       Rye Saskatchewan          46500

To evaluate agricultural performance since the year 2000, it is essential to rank the crops and geographies by their average yield (KG per hectare). This analysis helps identify which crop and province had the highest average yield, providing valuable insights into regional agricultural efficiency and productivity. By examining the average yield data, farmers and policymakers can make informed decisions about resource allocation, crop selection, and agricultural practices. This information is crucial for optimizing agricultural output, ensuring food security, and enhancing the sustainability and profitability of farming operations. Identifying the top-performing crop and province allows for targeted improvements and strategic planning to boost overall agricultural productivity.

dbGetQuery(con, "SELECT  CROP_TYPE, GEO , SUM(AVG_YIELD) as AVG
    FROM CROP_DATA 
    WHERE strftime('%Y', YEAR) >= '2000'
    GROUP BY CROP_TYPE, GEO
    ORDER BY AVG_YIELD DESC
    LIMIT 10
    ")
##    CROP_TYPE          GEO   AVG
## 1     Barley      Alberta 72465
## 2     Barley       Canada 68329
## 3     Barley Saskatchewan 62392
## 4      Wheat      Alberta 65113
## 5      Wheat       Canada 59752
## 6        Rye       Canada 53422
## 7      Wheat Saskatchewan 51017
## 8        Rye      Alberta 56360
## 9        Rye Saskatchewan 46761
## 10    Canola      Alberta 41984

Use a subquery to determine how much wheat was harvested in Canada in the most recent year of the data.

dbGetQuery(con,
"SELECT SUM(HARVESTED_AREA) AS total_hectares
FROM CROP_DATA
WHERE CROP_TYPE = 'Wheat' AND GEO = 'Canada' AND strftime('%Y', YEAR) = ( 
    SELECT MAX(strftime('%Y', YEAR))
    FROM CROP_DATA
    WHERE CROP_TYPE = 'Wheat' AND GEO = 'Canada')")
##   total_hectares
## 1       10017800

Use an implicit inner join to calculate the monthly price per metric tonne of Canola grown in Saskatchewan in both Canadian and US dollars. Display the most recent 6 months of the data.

dbGetQuery(con, "
SELECT 
    f.PRICE_PRERMT,
    e.FXUSDCAD
FROM 
    FARMPRICES_DATA f ,  MONTHLY_DATA e
WHERE 
    f.CROP_TYPE = 'Canola' AND 
    f.GEO = 'Saskatchewan' AND 
    f.DATE = e.DATE
ORDER BY 
   f.DATE DESC
LIMIT 6
")
##   PRICE_PRERMT FXUSDCAD
## 1       507.33 1.280771
## 2       495.64 1.306820
## 3       474.80 1.321471
## 4       463.52 1.322810
## 5       464.60 1.322205
## 6       462.88 1.349850

Conclusion

The analysis of crop production in Canada highlights the importance of understanding both current and historical performance of various crop types. This knowledge is crucial for meeting the supply chain requirements of the microbrewery and microdistillery industry, ensuring consistent and high-quality grain supplies. Additionally, the impact of exchange rates between the Canadian and US dollars on the cost of importing grains is significant, affecting overall financial planning and investment strategies.

By leveraging database management and data analysis techniques, such as creating tables and executing SQL queries, the venture capital firm can gain valuable insights into crop production metrics. This data-driven approach allows for strategic planning, identifying the most productive regions and crop types, and ultimately enhancing the sustainability and profitability of agricultural investments.