Data analysis is a fundamental discipline in modern scientific
research, enabling the extraction of valuable information and the
generation of knowledge from large volumes of data. This project focuses
on understanding and manipulating four distinct datasets with the aim of
answering specific questions using advanced SQL query techniques through
the RODBC/RSQLite packages in R. By deepening
the understanding of data and its structure, this project contributes
significantly to scientific knowledge. The ability to load and manage
data in a relational database is a crucial skill for data scientists and
researchers.
con <- dbConnect(RSQLite::SQLite(),"Final.sqlite")
To begin, establish a connection, referred to as conn, with the RSQLite database Final.sqlite. This step is crucial to ensure that all subsequent operations are performed efficiently and securely. Once the connection is established, proceed to create the following four tables, each designed to store specific data and facilitate subsequent analysis
# Create "CROP_DATA" table in RSQLite
df1 <- dbExecute(con,
"CREATE TABLE CROP_DATA (
CD_ID INTEGER NOT NULL,
YEAR DATE NOT NULL,
CROP_TYPE VARCHAR(20) NOT NULL,
GEO VARCHAR(20) NOT NULL,
SEEDED_AREA INTEGER NOT NULL,
HARVESTED_AREA INTEGER NOT NULL,
PRODUCTION INTEGER NOT NULL,
AVG_YIELD INTEGER NOT NULL,
PRIMARY KEY (CD_ID)
)"
)
if (df1 == -1){
cat ("An error has occurred.\n")
msg <- odbcGetErrMsg(con)
print (msg)
} else {
cat ("Table was created successfully.\n")
}
## Table was created successfully.
# Create "FARM_PRICES" table in RSQLite
df2 <- dbExecute(con,
"CREATE TABLE FARM_PRICES (
CD_ID INTEGER NOT NULL,
YEAR DATE NOT NULL,
CROP_TYPE VARCHAR(20) NOT NULL,
GEO VARCHAR(20) NOT NULL,
SEEDED_AREA INTEGER NOT NULL,
HARVESTED_AREA INTEGER NOT NULL,
PRODUCTION INTEGER NOT NULL,
AVG_YIELD INTEGER NOT NULL,
PRIMARY KEY (CD_ID)
)"
)
if (df1 == -1){
cat ("An error has occurred.\n")
msg <- odbcGetErrMsg(con)
print (msg)
} else {
cat ("Table was created successfully.\n")
}
## Table was created successfully.
# Create "DAILY_FX" table in RSQLite.
df3 <- dbExecute(con, "CREATE TABLE DAYLY_FX (
DFX_ID INTEGER NOT NULL,
DATE DATE NOT NULL,
FXUSDCAD FLOAT(6),
PRIMARY KEY (DFX_ID)
)"
)
if (df3 == -1){
cat ("An error has occurred.\n")
msg <- odbcGetErrMsg(con)
print (msg)
} else {
cat ("Table was created successfully.\n")
}
## Table was created successfully.
# Create "MONTHLY_FX" table in RSQLite.
df4 <- dbExecute(con, "CREATE TABLE MONTHLY_FX (
DFX_ID INTEGER NOT NULL,
DATE DATE NOT NULL,
FXUSDCAD FLOAT(6),
PRIMARY KEY (DFX_ID)
)"
)
if (df3 == -1){
cat ("An error has occurred.\n")
msg <- odbcGetErrMsg(con)
print (msg)
} else {
cat ("Table was created successfully.\n")
}
## Table was created successfully.
crop_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Practice%20Assignment/Annual_Crop_Data.csv', colClasses=c(YEAR="character"))
daily_df <- read.csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Practice%20Assignment/Daily_FX.csv', colClasses=c(date="character"))
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## not all columns named in 'colClasses' exist
monthly_df <-read.csv ("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_FX.csv", colClasses = c(date="character"))
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## not all columns named in 'colClasses' exist
farmprices_df <- read.csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-RP0203EN-SkillsNetwork/labs/Final%20Project/Monthly_Farm_Prices.csv", colClasses=c(YEAR="character"))
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## not all columns named in 'colClasses' exist
# Write the table in table
dbWriteTable(con, "CROP_DATA", crop_df, overwrite=TRUE, header = TRUE)
dbWriteTable(con, "DAILY_DATA", daily_df, overwrite=TRUE, header = TRUE)
dbWriteTable(con, "FARMPRICES_DATA", farmprices_df, overwrite=TRUE, header = TRUE)
dbWriteTable(con, "MONTHLY_DATA", monthly_df, overwrite=TRUE, header = TRUE)
#check list of tables in the present db.
dbListTables(con)
## [1] "CROP_DATA" "DAILY_DATA" "DAYLY_FX" "FARMPRICES_DATA"
## [5] "FARM_PRICES" "MONTHLY_DATA" "MONTHLY_FX"
Subsequently, queries will be executed using SQLite to analyze the data stored in the tables. These queries will allow us to extract relevant information and gain insights into crops, agricultural prices, and exchange rates
To better understand the scope of the data available for analysis, it is essential to determine the size of the dataset.
# Execute the SQL query to count the records in the farm prices dataset
dbGetQuery(con, "SELECT COUNT(*) FROM FARMPRICES_DATA")
## COUNT(*)
## 1 2678
Additionally, it is equally important to identify the geographies included in the FARM_PRICES dataset. Knowing the geographical regions represented will enable us to evaluate the diversity and coverage of the dataset, as well as to perform comparative analyses between different areas. This can provide valuable insights into regional variations in agricultural prices and help identify specific trends and patterns for each region.
dbGetQuery(con, "SELECT DISTINCT GEO FROM FARMPRICES_DATA")
## GEO
## 1 Alberta
## 2 Saskatchewan
To gain a comprehensive understanding of the agricultural data, it is important to quantify the extent of rye cultivation in historical contexts. Specifically, how many hectares of Rye were harvested in Canada in 1968? This information will help us analyze trends in rye production and its significance within the agricultural sector during that period.
dbGetQuery(con, "
SELECT SUM(HARVESTED_AREA) AS total_hectares
FROM CROP_DATA
WHERE CROP_TYPE = 'Rye' AND GEO = 'Canada' AND strftime('%Y', YEAR) = '1968'")
## total_hectares
## 1 274100
To conduct a detailed analysis of rye pricing data, execute a query to retrieve and display the first six rows of the FARM_PRICES table specifically for rye. This will provide an initial overview of the pricing trends and variations for rye within the dataset
dbGetQuery(con, "SELECT PRICE_PRERMT, CROP_TYPE FROM FARMPRICES_DATA WHERE CROP_TYPE = 'Rye' LIMIT 6")
## PRICE_PRERMT CROP_TYPE
## 1 100.77 Rye
## 2 109.75 Rye
## 3 95.05 Rye
## 4 103.46 Rye
## 5 96.77 Rye
## 6 106.38 Rye
To understand the geographical distribution of barley cultivation, it is essential to identify the provinces that have grown barley. This information will provide insights into regional agricultural practices and the extent of barley production across different areas
dbGetQuery(con, "SELECT DISTINCT GEO FROM CROP_DATA WHERE CROP_TYPE = 'Barley'")
## GEO
## 1 Alberta
## 2 Canada
## 3 Saskatchewan
To better understand the temporal scope of the farm prices dataset, it is necessary to identify the first and last dates of the records. This will provide a clear view of the period covered by the data and enable precise historical analysis.
dbGetQuery(con, 'SELECT min(DATE) FIRST_DATE, max(DATE) LAST_DATE FROM FARMPRICES_DATA')
## FIRST_DATE LAST_DATE
## 1 1985-01-01 2020-12-01
To identify which crops have ever reached a farm price of $350 or more per metric tonne, it is necessary to perform a query on the FARM_PRICES dataset. This information will allow us to analyze which crops have had significant market value and better understand the dynamics of agricultural prices.
dbGetQuery(con, 'SELECT DISTINCT CROP_TYPE FROM FARMPRICES_DATA WHERE PRICE_PRERMT >= 350')
## CROP_TYPE
## 1 Canola
To evaluate the agricultural performance in Saskatchewan in the year 2000, it is essential to rank the crop types harvested by their average yield. This analysis helps identify which crop performed best in terms of production. Ranking crops by average yield not only provides a clear view of the efficiency of each crop type but also aids farmers and agricultural policymakers in making informed decisions about resource management and future planning. By determining which crop had the highest yield, agricultural practices can be optimized, enhancing the sustainability and profitability of agricultural production in the region.
dbGetQuery(con, "SELECT CROP_TYPE, GEO, HARVESTED_AREA
FROM CROP_DATA
WHERE GEO='Saskatchewan'AND strftime('%Y', YEAR) = '2000'
ORDER BY HARVESTED_AREA DESC
")
## CROP_TYPE GEO HARVESTED_AREA
## 1 Wheat Saskatchewan 6080300
## 2 Canola Saskatchewan 2371500
## 3 Barley Saskatchewan 1922300
## 4 Rye Saskatchewan 46500
To evaluate agricultural performance since the year 2000, it is essential to rank the crops and geographies by their average yield (KG per hectare). This analysis helps identify which crop and province had the highest average yield, providing valuable insights into regional agricultural efficiency and productivity. By examining the average yield data, farmers and policymakers can make informed decisions about resource allocation, crop selection, and agricultural practices. This information is crucial for optimizing agricultural output, ensuring food security, and enhancing the sustainability and profitability of farming operations. Identifying the top-performing crop and province allows for targeted improvements and strategic planning to boost overall agricultural productivity.
dbGetQuery(con, "SELECT CROP_TYPE, GEO , SUM(AVG_YIELD) as AVG
FROM CROP_DATA
WHERE strftime('%Y', YEAR) >= '2000'
GROUP BY CROP_TYPE, GEO
ORDER BY AVG_YIELD DESC
LIMIT 10
")
## CROP_TYPE GEO AVG
## 1 Barley Alberta 72465
## 2 Barley Canada 68329
## 3 Barley Saskatchewan 62392
## 4 Wheat Alberta 65113
## 5 Wheat Canada 59752
## 6 Rye Canada 53422
## 7 Wheat Saskatchewan 51017
## 8 Rye Alberta 56360
## 9 Rye Saskatchewan 46761
## 10 Canola Alberta 41984
dbGetQuery(con,
"SELECT SUM(HARVESTED_AREA) AS total_hectares
FROM CROP_DATA
WHERE CROP_TYPE = 'Wheat' AND GEO = 'Canada' AND strftime('%Y', YEAR) = (
SELECT MAX(strftime('%Y', YEAR))
FROM CROP_DATA
WHERE CROP_TYPE = 'Wheat' AND GEO = 'Canada')")
## total_hectares
## 1 10017800
dbGetQuery(con, "
SELECT
f.PRICE_PRERMT,
e.FXUSDCAD
FROM
FARMPRICES_DATA f , MONTHLY_DATA e
WHERE
f.CROP_TYPE = 'Canola' AND
f.GEO = 'Saskatchewan' AND
f.DATE = e.DATE
ORDER BY
f.DATE DESC
LIMIT 6
")
## PRICE_PRERMT FXUSDCAD
## 1 507.33 1.280771
## 2 495.64 1.306820
## 3 474.80 1.321471
## 4 463.52 1.322810
## 5 464.60 1.322205
## 6 462.88 1.349850
The analysis of crop production in Canada highlights the importance of understanding both current and historical performance of various crop types. This knowledge is crucial for meeting the supply chain requirements of the microbrewery and microdistillery industry, ensuring consistent and high-quality grain supplies. Additionally, the impact of exchange rates between the Canadian and US dollars on the cost of importing grains is significant, affecting overall financial planning and investment strategies.
By leveraging database management and data analysis techniques, such as creating tables and executing SQL queries, the venture capital firm can gain valuable insights into crop production metrics. This data-driven approach allows for strategic planning, identifying the most productive regions and crop types, and ultimately enhancing the sustainability and profitability of agricultural investments.