Topics to be covered
Introduction to data science, evolution of data science, work profile of a data scientist, career in data science, nature of data science, typical working day of a data scientist, importance of data science in agribusiness; defining algorithm, big data, business analytics, statistical learning, defining machine learning, defining artificial intelligence, data mining; difference between analysis and analytics, business intelligence and business analytics, typical process of business analytics cycle.
Data science, as an interdisciplinary field, has a rich history of development and transformation. Its journey, rooted in statistics and mathematics, evolved alongside advancements in computing and data processing. Below is an overview of the evolution of data science with a timeline of key milestones.
Timeline of Data Science Evolution
The evolution of data science reflects humanity’s quest to make sense of the growing complexity of data. From its roots in statistics to its modern applications in artificial intelligence, data science continues to drive innovation across industries, shaping the future of technology and decision-making.
The role of a data scientist has emerged as one of the most sought-after professions in the 21st century. Combining skills from statistics, computer science, and domain expertise, a data scientist is responsible for extracting valuable insights from structured and unstructured data to inform decisions and drive innovation.
Key Responsibilities
A data scientist’s responsibilities revolve around the entire lifecycle of data handling and analysis. Here are the primary tasks they handle:
Understanding Business Problems:
Data scientists work closely with stakeholders to define problems, set
goals, and determine how data can solve those problems. They translate
business challenges into analytical tasks.
Data Collection and Preprocessing:
Gathering data from various sources like databases, APIs, or external
datasets is a critical step. They ensure the data is clean, consistent,
and ready for analysis by addressing missing values, outliers, and
inconsistencies.
Exploratory Data Analysis (EDA):
Using statistical tools and visualization techniques, data scientists
explore the data to uncover patterns, trends, and relationships. This
phase helps refine hypotheses and guide further analysis.
Model Development:
Data scientists apply machine learning, deep learning, or statistical
algorithms to build predictive or descriptive models. These models are
designed to solve specific problems, such as forecasting demand,
detecting fraud, or recommending products.
Model Evaluation and Optimization:
Once models are developed, they are tested for accuracy, precision,
recall, and other performance metrics. Optimization techniques are used
to fine-tune the models for the best results.
Deployment and Monitoring:
Data scientists collaborate with software engineers and DevOps teams to
integrate models into production systems. They monitor these models to
ensure they perform as expected in real-world scenarios.
Communication of Insights:
Presenting findings in a clear and actionable manner is critical. Data
scientists use dashboards, reports, and visualizations to communicate
insights to non-technical stakeholders.
Skills Required
To excel as a data scientist, a combination of technical and soft skills is essential:
Applications of a Data Scientist’s Work Data scientists work across diverse industries. Here are a few examples:
Conclusion
The role of a data scientist is a blend of art and science. With the growing importance of data-driven decision-making, data scientists play a pivotal role in shaping strategies and innovations. Their work not only helps organizations stay competitive but also addresses societal challenges, making it a career of impact and growth.
A career in data science is a dynamic and highly sought-after field that combines expertise in mathematics, statistics, programming, and domain knowledge to analyze and interpret complex data. Data scientists extract insights from structured and unstructured data to drive business decisions, optimize processes, and predict future trends. Here’s an overview of a career in data science:
Key Skills Required:
Typical Roles:
Career Path:
Education & Training:
Industry Applications:
Salary Range:
Job Market:
The demand for data scientists continues to grow as more industries recognize the importance of data-driven decision-making. It’s considered a versatile career with opportunities in diverse industries such as finance, healthcare, retail, and technology.
Starting a career in data science typically involves mastering key programming languages, mathematical foundations, and machine learning algorithms, followed by gaining hands-on experience through projects or internships.
Data science is the interdisciplinary field that combines statistical analysis, machine learning, data processing, and domain-specific knowledge to extract meaningful insights and make data-driven decisions. It involves the use of algorithms, data models, and advanced analytics techniques to process and analyze large volumes of structured and unstructured data. In India, data science is transforming industries by providing businesses with the tools they need to innovate, optimize processes, and create value.
Key Components of Data Science:
Data Science Applications in India:
Challenges of Data Science in India:
Data Privacy and Security: With the rise of data-driven solutions, concerns about data privacy and security have become paramount. Companies must ensure that sensitive data, especially in sectors like banking and healthcare, is protected from breaches.
Data Availability and Quality: In India, many data sources are fragmented or unreliable. Clean and structured data is often hard to come by, especially in industries like agriculture and healthcare, where data collection infrastructure may be underdeveloped.
Skill Gap: Despite the growing demand for data scientists in India, there is still a gap between the skills required by employers and the skills possessed by many professionals entering the workforce. Bridging this gap through training and education is crucial for further growth in the field.
Infrastructure: In some regions, inadequate internet connectivity and computational infrastructure can make it difficult to collect, store, and analyze large datasets.
Conclusion:
Data science is playing a crucial role in transforming industries across India. Whether it’s improving customer experience in e-commerce, advancing healthcare diagnostics, optimizing agriculture practices, or enhancing financial services, data science is the backbone of many innovations in the country. With a growing focus on data literacy, the rise of AI technologies, and increasing investments in digital infrastructure, India’s data science ecosystem is set to expand significantly in the coming years.
Typical Working Day of a Data Scientist
A data scientist’s typical working day involves a variety of tasks that require both technical expertise and analytical thinking. Their day begins with reviewing emails and communications from team members or stakeholders to understand any urgent needs or issues. Once up to speed, the primary task for a data scientist often revolves around gathering, cleaning, and preparing data. This stage involves accessing databases, extracting raw data, and ensuring it’s in a usable format. The data cleaning process can be time-consuming, as raw data is often messy, incomplete, or inconsistent.
After data preparation, the data scientist dives into exploratory data analysis (EDA). EDA involves visualizing data trends and identifying patterns or outliers that could inform the model-building process. This step typically includes generating summary statistics, creating charts, and using statistical techniques to assess the data’s quality.
Once the data is understood, the data scientist moves on to feature engineering. This involves creating new features or variables from the raw data that may help improve the performance of machine learning models. Then, machine learning models are built and tested. Data scientists frequently work with various algorithms, such as linear regression, decision trees, or neural networks, to find the best one for a given problem.
Following model development, a data scientist performs model evaluation using performance metrics like accuracy, precision, recall, or AUC-ROC to assess how well the model performs. The model may need refinement based on this feedback.
Throughout the day, data scientists also collaborate with stakeholders, data engineers, and other team members, participating in meetings to discuss findings, share progress, and refine strategies. They may end their day with documentation of their work, creating reports or presentations to communicate insights to non-technical stakeholders.
Importance of Data Science in Agribusiness
Data science plays a crucial role in the agribusiness sector by transforming traditional farming and agriculture practices into more efficient, sustainable, and data-driven processes. In agribusiness, data science applications can be seen in crop forecasting, precision farming, yield prediction, pest and disease detection, and resource management.
Crop Forecasting and Yield Prediction: Data scientists use machine learning algorithms and big data to analyze historical weather patterns, soil conditions, and crop health to predict crop yields. This helps farmers and agribusinesses make informed decisions about production volumes and market pricing.
Precision Farming: By analyzing data from various sources such as sensors, drones, and satellite imagery, data science enables precision farming techniques. This allows farmers to optimize resource use (water, fertilizers, etc.) and minimize waste, resulting in cost savings and higher productivity.
Pest and Disease Detection: Data science enables early detection of diseases and pests by analyzing environmental factors and crop images. Machine learning models trained on large datasets of plant health data can identify potential threats before they become widespread, reducing the need for chemical interventions and improving sustainability.
Supply Chain Optimization: Data science improves the efficiency of supply chains by forecasting demand, managing inventories, and optimizing routes for transporting produce. This reduces waste and ensures that goods reach markets in a timely manner, improving profitability for agribusinesses.
Sustainability and Environmental Impact: Data science aids in managing the environmental impact of agribusiness. Predictive models help optimize irrigation, monitor soil health, and assess the impact of various agricultural practices on the ecosystem, promoting sustainable farming practices.
Defining Key Terms in Data Science
Algorithm: An algorithm is a set of instructions or rules designed to perform a specific task or solve a problem. In data science, algorithms are used to process data, learn from it, and make predictions or decisions without human intervention. Common algorithms include decision trees, linear regression, and k-nearest neighbors.
Big Data: Big data refers to large and complex datasets that are beyond the ability of traditional data-processing applications to handle. Big data is characterized by the 3Vs—volume, variety, and velocity. In business and analytics, big data enables organizations to gain deeper insights from massive datasets generated by social media, sensors, and other digital sources.
Business Analytics: Business analytics refers to the process of using data analysis and statistical methods to drive business decisions. It combines descriptive, predictive, and prescriptive analytics to analyze business data and make decisions based on that analysis. Business analytics helps organizations improve operations, marketing strategies, and customer relations.
Statistical Learning: Statistical learning is a framework for understanding data through statistical models and algorithms. It focuses on making predictions or inferences about a dataset using statistical methods. Examples of statistical learning techniques include regression analysis, classification, and clustering.
Machine Learning: Machine learning is a subset of artificial intelligence that involves the development of algorithms that allow computers to learn from and make predictions on data without explicit programming. Machine learning techniques include supervised learning (e.g., linear regression, decision trees), unsupervised learning (e.g., k-means clustering), and reinforcement learning.
Artificial Intelligence (AI): AI refers to the simulation of human intelligence in machines. AI enables machines to perform tasks such as decision-making, problem-solving, and language processing that would normally require human cognition. AI encompasses a wide range of technologies, including machine learning, natural language processing, and robotics.
Data Mining: Data mining is the process of discovering patterns, correlations, and insights from large datasets. It combines methods from statistics, machine learning, and database systems to extract meaningful information. Data mining can be used for tasks like fraud detection, customer segmentation, and market basket analysis.
Difference Between Analysis and Analytics
Analysis refers to the process of examining data in order to extract useful information or insights. It involves the breakdown of complex data into smaller components for easier understanding. Data analysis can be performed using various techniques like statistical methods, hypothesis testing, or visualizations.
Analytics, on the other hand, is the broader application of data analysis to improve decision-making. Analytics involves the use of sophisticated tools, techniques, and algorithms to uncover deeper insights, predict trends, and recommend actions. Analytics can be classified into descriptive, predictive, and prescriptive types.
Following table elaborates the difference between analysis and analytics:
Aspect | Analysis | Analytics |
---|---|---|
Definition | The process of examining data to extract useful insights. | The broader application of analysis to drive decision-making. |
Focus | Understanding and interpreting data. | Applying insights to optimize and predict outcomes. |
Tools Used | Statistical methods, visualizations, descriptive techniques. | Predictive modeling, machine learning, optimization techniques. |
Goal | To uncover patterns, trends, and correlations in data. | To recommend actions, predict future trends, and optimize performance. |
Scope | Often limited to understanding past data or summarizing data. | Includes forecasting, decision-making, and process optimization. |
Outcome | Descriptive insights and historical understanding. | Actionable insights, predictions, and recommendations. |
Example Techniques | Descriptive statistics, hypothesis testing, regression. | Predictive analytics, prescriptive analytics, optimization. |
Type of Questions | What has happened? | What will happen, and what should we do about it? |
Difference Between Business Intelligence and Business Analytics
Business Intelligence (BI) focuses on the descriptive aspect of data, such as summarizing past performance using dashboards, reports, and visualizations. BI tools help organizations understand what has happened in the past and provide historical insights into business operations.
Business Analytics (BA), on the other hand, goes beyond descriptive insights and includes predictive and prescriptive analytics. BA involves the use of advanced analytics techniques like machine learning to predict future trends, optimize business processes, and provide actionable recommendations.
Aspect | Business Intelligence (BI) | Business Analytics (BA) |
---|---|---|
Definition | BI refers to the process of analyzing past business data to understand and optimize business performance. | BA involves the use of advanced analytics techniques to predict future trends, optimize business processes, and provide actionable insights. |
Focus | Descriptive insights from historical data. | Predictive and prescriptive insights for decision-making. |
Data Type | Primarily focuses on historical and current data. | Uses both historical data and predictive modeling. |
Tools Used | Dashboards, reporting tools, data visualization, OLAP cubes. | Machine learning, statistical models, data mining, optimization tools. |
Purpose | To monitor and analyze past performance. | To forecast future outcomes and suggest strategies. |
Scope | Narrow focus on understanding what has happened in the business. | Broader focus on predicting what will happen and how to optimize business strategies. |
Example Techniques | Data querying, reporting, OLAP, visualization. | Regression analysis, forecasting, clustering, decision optimization. |
Outcome | Helps businesses understand trends and historical performance. | Helps businesses make data-driven decisions and improve future performance. |
Timeframe | Focuses on the past and present. | Focuses on the future and planning. |
Decision Support | Supports decision-making based on what has happened. | Supports decision-making based on what could happen. |
Typical Process of the Business Analytics Cycle
The business analytics cycle typically follows several steps:
Problem Definition: Identify the business challenge or question that needs to be answered. This involves understanding the business context and the data requirements.
Data Collection: Gather relevant data from various sources, including internal databases, third-party providers, and public data. Data collection can also involve real-time data from sensors or IoT devices.
Data Cleaning and Preparation: The data needs to be cleaned and transformed to ensure its quality and consistency. This involves handling missing values, removing duplicates, and converting data into a usable format.
Data Exploration and Analysis: Analysts explore the data to uncover patterns, trends, and correlations. Statistical techniques and visualizations are used to understand the data better.
Modeling: In this step, various statistical or machine learning models are applied to the data to make predictions or draw inferences. The best models are selected based on performance metrics.
Interpretation and Reporting: After analyzing the models, insights are interpreted, and recommendations are made. This step often involves creating visualizations or reports that communicate the findings to stakeholders.
Model Deployment, Decision Making and Action: The insights derived from analytics are used to inform business decisions. The final step is taking action based on these insights, whether it’s improving operational efficiency, launching a new product, or optimizing marketing campaigns.
Data science plays a critical role in transforming business operations across industries, including agribusiness, by providing valuable insights and predictive capabilities. As technology continues to evolve, the importance of data science, machine learning, and artificial intelligence in solving real-world problems will only increase, helping businesses make more informed decisions and improve overall performance.
Topics to be covered
Fundamentals of R and RStudio, fundamentals of packages of RStudio, data manipulations, data transformations, normalization, standardization, missing values imputation, dummy variables, data visualization (2D and 3D), basic architecture of machine learning analytical cycle, descriptive analytics-case study covering data manipulation, measures of central tendency, measures of dispersion, measures of distribution, measures of associations, t-test, ftest, ANOVA, Chi-square test, basic statistical modeling framework.
Certainly! Below is the revised version of the content with examples tailored to the agribusiness sector.
In the agribusiness domain, several key R packages are essential for data analysis:
dplyr
: Data manipulation, useful for filtering and
transforming agricultural datasets.ggplot2
: For visualizing crop yields, sales data, and
other agricultural trends.tidyr
: Helps tidy agricultural datasets, particularly
for time-series crop data.lubridate
: Useful for working with agricultural data
tied to dates, such as planting or harvest dates.Example: Installing and Loading Packages:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
Agribusiness often involves datasets related to crops, livestock, weather patterns, or market prices. Data manipulation involves filtering, selecting, or aggregating these datasets.
# Sample data related to crop yields and pricing
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- data.frame(crop=c("Wheat", "Rice", "Maize", "Barley"),
yield=c(4.2, 3.5, 5.1, 2.8), # yield per hectare in tons
price_per_ton=c(300, 400, 250, 350)) # market price per ton
# Filter crops with yield above 4 tons per hectare
data %>% filter(yield > 4)
# Create a new column calculating total revenue (yield * price)
data %>% mutate(revenue = yield * price_per_ton)
Agricultural datasets often need transformations, such as converting seasonal crop yields into annual totals or aggregating data by region.
# Aggregating crop yield by region (example: hypothetical data)
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.2
data_long <- gather(data, key="metric", value="value", yield, price_per_ton)
# Aggregating crop yield by crop type
data_summary <- data %>%
group_by(crop) %>%
summarise(total_yield = sum(yield))
Normalization and standardization are useful when comparing crop yield data across regions with different scales or for machine learning models.
Normalization Example:
# Min-Max Normalization (crop yield per hectare)
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
data$normalized_yield <- normalize(data$yield)
Standardization Example:
# Standardization (Z-score) of crop yields
standardize <- function(x) {
return((x - mean(x)) / sd(x))
}
data$standardized_yield <- standardize(data$yield)
Agribusiness datasets may have missing values due to weather anomalies or incomplete data collection. Common imputation techniques include replacing missing values with the mean or median.
# Impute missing crop yield values with the mean yield
data$yield[is.na(data$yield)] <- mean(data$yield, na.rm=TRUE)
Dummy variables are useful for categorical variables like crop type or region when modeling agricultural data.
# Create dummy variables for crop type
data$crop <- factor(c("Wheat", "Rice", "Maize", "Barley"))
data <- cbind(data, model.matrix(~crop - 1, data))
8. Data Visualization (2D and 3D) in Agribusiness
Install Libraries (Remove hash sign to install library)
#install.packages(c(
# "colorBlindness", "directlabels", "dplyr", "ggforce", "gghighlight",
# "ggnewscale", "ggplot2", "ggraph", "ggrepel", "ggtext", "ggthemes",
# "hexbin", "Hmisc", "mapproj", "maps", "munsell", "ozmaps",
# "paletteer", "patchwork", "rmapshaper", "scico", "seriation", "sf",
# "stars", "tidygraph", "tidyr", "wesanderson"
# ))
library(directlabels)
## Warning: package 'directlabels' was built under R version 4.4.2
library(dplyr)
library(ggforce)
## Warning: package 'ggforce' was built under R version 4.4.2
library(gghighlight)
## Warning: package 'gghighlight' was built under R version 4.4.2
library(ggnewscale)
## Warning: package 'ggnewscale' was built under R version 4.4.2
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.4.2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(ggplot2)
library(ggraph)
## Warning: package 'ggraph' was built under R version 4.4.2
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.4.2
library(ggtext)
## Warning: package 'ggtext' was built under R version 4.4.2
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.4.2
library(hexbin)
## Warning: package 'hexbin' was built under R version 4.4.2
library(maps)
## Warning: package 'maps' was built under R version 4.4.2
library(munsell)
## Warning: package 'munsell' was built under R version 4.4.2
library(ozmaps)
## Warning: package 'ozmaps' was built under R version 4.4.2
library(paletteer)
## Warning: package 'paletteer' was built under R version 4.4.2
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.4.2
library(rmapshaper)
## Warning: package 'rmapshaper' was built under R version 4.4.2
library(scico)
## Warning: package 'scico' was built under R version 4.4.2
library(seriation)
## Warning: package 'seriation' was built under R version 4.4.2
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
library(sf)
## Warning: package 'sf' was built under R version 4.4.2
## Linking to GEOS 3.12.2, GDAL 3.9.3, PROJ 9.4.1; sf_use_s2() is TRUE
library(stars)
## Warning: package 'stars' was built under R version 4.4.2
## Loading required package: abind
library(tidygraph)
## Warning: package 'tidygraph' was built under R version 4.4.2
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
##
## filter
library(tidyr)
library(wesanderson)
## Warning: package 'wesanderson' was built under R version 4.4.2
library(ggplot2)
ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()
ggplot(mpg, aes(x=model, y=manufacturer)) + geom_point()
ggplot(mpg, aes(cty, hwy)) + geom_point()
ggplot(diamonds, aes(carat, price)) + geom_point()
ggplot(economics, aes(date, unemploy)) + geom_line()
ggplot(mpg, aes(cty)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()
ggplot(mpg, aes(displ, hwy, shape = drv)) + geom_point()
ggplot(mpg, aes(displ, hwy, size = cyl)) + geom_point()
ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour="blue"))
ggplot(mpg, aes(displ, hwy)) + geom_point(colour="blue")
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(~class)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour=drv)) +
facet_wrap(~class)
ggplot(mpg, aes(drv, displ)) +
geom_point() +
facet_wrap(~hwy)
ggplot(mpg, aes(drv, displ)) +
geom_point() +
facet_wrap(~cyl)
ggplot(diamonds, aes(carat, price)) +
geom_smooth() +
geom_point(aes(colour=cut), alpha=0.1)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(diamonds, aes(price)) +
geom_boxplot()
ggplot(diamonds, aes(price)) +
geom_boxplot(aes(colour=cut))
ggplot(diamonds, aes(price)) +
geom_boxplot(aes(colour=cut)) +
facet_wrap(~color)
ggplot(diamonds, aes(price)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds, aes(price)) +
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds, aes(price)) +
geom_freqpoly() +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds, aes(color)) +
geom_bar()
ggplot(diamonds, aes(color)) +
geom_bar(aes(fill =clarity)) +
facet_wrap(~cut)
ggplot(economics, aes(date, unemploy)) +
geom_path()
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(shape=manufacturer)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 15 values. Consider specifying shapes manually if you need
## that many have them.
## Warning: Removed 112 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth() +
facet_wrap(~cyl, scales = "free")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(se=FALSE) +
facet_wrap(~cyl, scales = "free")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(span=0.2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(span=1)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:directlabels':
##
## gapply
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x))
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:tidygraph':
##
## select
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "rlm")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(mpg, aes(drv, hwy)) +
geom_point()
ggplot(mpg, aes(drv, hwy)) + geom_jitter()
ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
ggplot(mpg, aes(drv, hwy)) + geom_violin()
ggplot(mpg, aes(drv, hwy)) + geom_jitter(aes(colour=class))
ggplot(mpg, aes(drv, hwy)) + geom_jitter(aes(colour=cyl))
ggplot(mpg, aes(hwy)) + geom_histogram(binwidth = 2.5)
ggplot(mpg, aes(hwy)) + geom_histogram(binwidth = 1)
ggplot(mpg, aes(hwy, color=drv)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(hwy, fill=drv)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~drv, ncol=1)
ggplot(economics, aes(unemploy/pop, uempmed)) +
geom_path() +
geom_point()
ggplot(economics, aes(unemploy/pop, uempmed)) +
geom_path(colour="grey50") +
geom_point(aes(colour=date))
ggplot(mpg, aes(cty, hwy)) + geom_point( alpha=1/3)
ggplot(mpg, aes(cty, hwy)) + geom_point( alpha=1/3) +
xlab("City Driving MPG") +
ylab("Highway Driving MPG")
ggplot(mpg, aes(cty, hwy)) + geom_point( alpha=1/3) +
xlab(NULL) +
ylab(NULL)
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = .25)
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = .25) +
xlim("r","f") +
ylim(20,30)
## Warning: Removed 136 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = .25, na.rm = TRUE) +
ylim(NA,30)
p=ggplot(mpg, aes(cty, hwy, label=class)) +
labs(x=NULL, y=NULL) +
theme(plot.title = element_text(size=12))
p + geom_point() + ggtitle("point Plot")
p + geom_text() + ggtitle("text")
p + geom_bar(stat="identity") + ggtitle("bar")
p + geom_tile() + ggtitle("Raster")
p + geom_line() + ggtitle("line")
p + geom_area() + ggtitle("area")
p + geom_path() + ggtitle("path")
p + geom_polygon() + ggtitle("polygon")
ggplot(Oxboys, aes(age, height, group= Subject)) +
geom_point() +
geom_line()
ggplot(Oxboys, aes(age, height)) +
geom_point() +
geom_line()
ggplot(Oxboys, aes(age, height, group= Subject)) +
geom_line() +
geom_smooth(method = "lm", se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Oxboys, aes(age, height)) +
geom_line(aes(group=Subject)) +
geom_smooth(method = "lm", linewidth=2,se=FALSE)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot()
ggplot(Oxboys, aes(Occasion, height)) +
geom_boxplot() +
geom_line(aes(group=Subject), colour="#3366FF", alpha=0.5)
ggplot(mpg, aes(class)) + geom_bar()
ggplot(mpg, aes(class, fill= drv)) + geom_bar()
ggplot(mpg, aes(class, fill=hwy)) + geom_bar()
## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
ggplot(mpg, aes(class, fill=hwy, group=hwy)) + geom_bar()
ggplot(mpg, aes(displ, cty, group=cyl)) +
geom_boxplot()
ggplot(mpg, aes(drv)) + geom_bar()
ggplot(mpg, aes(drv, fill=hwy, group=hwy)) + geom_bar()
library(dplyr)
mpg2= mpg %>% arrange(hwy) %>% mutate(id=seq_along(hwy))
ggplot(mpg2, aes(drv, fill=hwy, group=id)) + geom_bar()
library(babynames)
## Warning: package 'babynames' was built under R version 4.4.2
hadley = dplyr:: filter(babynames, name=="Hadley")
ggplot(hadley, aes(year, n)) + geom_line()
y=c(18, 11, 16)
df=data.frame(x=1:3, y=y, se=c(1.2, 0.5, 1.0))
base = ggplot(df, aes(x,y, ymin=y-se, ymax=y+se))
base +geom_crossbar()
base + geom_pointrange()
base + geom_smooth(stat="identity")
base + geom_errorbar()
base + geom_linerange()
base + geom_ribbon()
9. Basic Architecture of the Machine Learning Analytical Cycle in Agribusiness
In agribusiness, machine learning could be used to predict crop yields, market prices, or detect plant diseases.
# Example: Predicting crop yield using linear regression
model <- lm(yield ~ price_per_ton + crop, data=data)
summary(model)
##
## Call:
## lm(formula = yield ~ price_per_ton + crop, data = data)
##
## Residuals:
## ALL 4 residuals are 0: no residual degrees of freedom!
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.600 NaN NaN NaN
## price_per_ton -0.028 NaN NaN NaN
## cropMaize -0.500 NaN NaN NaN
## cropRice 2.100 NaN NaN NaN
## cropWheat NA NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 3 and 0 DF, p-value: NA
10. Descriptive Analytics in Agribusiness - Case Study
Descriptive analytics helps summarize key agricultural metrics like crop yields, weather patterns, and market trends.
Data Manipulation:
# Summarizing yield by crop type
yield_summary <- data %>%
group_by(crop) %>%
summarise(mean_yield = mean(yield), median_yield = median(yield))
# Standard deviation of crop yields
sd_yield <- sd(data$yield)
# Skewness of yield
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.2
##
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
##
## impute
skewness(data$yield)
## [1] 0.09469508
# Correlation between price and yield
cor(data$price_per_ton, data$yield)
## [1] -0.8140999
Applying Statistical Tests on the Dummy Dataset
Certainly! Below is an example of how to create a dummy dataset for performing T-test, F-test, ANOVA, and Chi-Square tests in the context of agribusiness. The dataset will include information about crop yields, regions, and crop types, allowing us to apply these statistical tests.
Creating the Dummy Dataset
Here’s the updated version with hypotheses statements and interpretations for each test:
# Load necessary library
library(dplyr)
# Create a dummy dataset for agribusiness analysis
set.seed(123)
data <- data.frame(
region = sample(c("North", "South", "East", "West"), 100, replace = TRUE),
crop = sample(c("Wheat", "Rice", "Maize", "Barley"), 100, replace = TRUE),
yield = c(rnorm(25, mean = 5, sd = 1), rnorm(25, mean = 4.8, sd = 0.9),
rnorm(25, mean = 4.5, sd = 1.2), rnorm(25, mean = 5.1, sd = 1.1)),
price_per_ton = c(rnorm(25, mean = 300, sd = 30), rnorm(25, mean = 320, sd = 25),
rnorm(25, mean = 290, sd = 35), rnorm(25, mean = 310, sd = 28))
)
# View the first few rows of the dataset
head(data)
1. T-Test: Comparing Crop Yields Between Two Regions (e.g., North vs. South)
# Perform T-test to compare crop yields between North and South regions
t_test_result <- t.test(yield ~ region, data = data[data$region %in% c("North", "South"),])
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: yield by region
## t = 1.5896, df = 50.589, p-value = 0.1182
## alternative hypothesis: true difference in means between group North and group South is not equal to 0
## 95 percent confidence interval:
## -0.1117820 0.9612051
## sample estimates:
## mean in group North mean in group South
## 5.045286 4.620575
# Perform ANOVA to compare mean yield between different crop types
anova_result <- aov(yield ~ crop, data = data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## crop 3 1.02 0.3402 0.336 0.799
## Residuals 96 97.13 1.0118
# Perform Chi-Square test to examine the relationship between crop type and region
chisq_test <- chisq.test(table(data$crop, data$region))
## Warning in chisq.test(table(data$crop, data$region)): Chi-squared approximation
## may be incorrect
print(chisq_test)
##
## Pearson's Chi-squared test
##
## data: table(data$crop, data$region)
## X-squared = 12.367, df = 9, p-value = 0.1934
# Perform F-test to compare variances in crop yields between North and South regions
var_test <- var.test(yield ~ region, data = data[data$region %in% c("North", "South"),])
print(var_test)
##
## F test to compare two variances
##
## data: yield by region
## F = 1.6336, num df = 27, denom df = 25, p-value = 0.2212
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.7395564 3.5654132
## sample estimates:
## ratio of variances
## 1.633583
Test | Null Hypothesis (Ho) | Alternative Hypothesis (Ha) | Interpretation |
---|---|---|---|
T-Test | Mean yields are equal in North and South regions. | Mean yields are not equal. | Use p-value to determine significance. |
ANOVA | Mean yields are equal across crop types. | At least one mean yield differs. | If significant, follow up with post-hoc tests. |
Chi-Square Test | Crop type and region are independent. | Crop type and region are related. | Check p-value for dependence. |
F-Test | Variances in yields are equal in North and South. | Variances are not equal. | Evaluate p-value for variance differences. |
This structure provides clear hypotheses and actionable interpretations, making the results more meaningful for agribusiness analysis.
General Data Science in Agriculture and Food
Industry 1. Wolfert, S., Ge, L., Verdouw, C., & Bogaardt,
M. J. (2017). Big Data in Smart Farming – A review.
Agricultural Systems, 153, 69-80.
https://doi.org/10.1016/j.agsy.2017.01.023
Zhang, C., & Kovacs, J. M. (2012). The application of
small unmanned aerial systems for precision agriculture: A review.
Precision Agriculture, 13(6), 693-712.
https://doi.org/10.1007/s11119-012-9274-5
Liakos, K. G., et al. (2018). Machine Learning in
Agriculture: A Review. Sensors, 18(8), 2674.
https://doi.org/10.3390/s18082674
Aung, M. M., & Chang, Y. S. (2014). Traceability in a
food supply chain: Safety and quality perspectives. Food Control,
39, 172-184.
https://doi.org/10.1016/j.foodcont.2013.11.007
Min, H. (2010). Artificial intelligence in supply chain
management: Theory and applications. International Journal of
Logistics Research and Applications, 13(1), 13-39.
https://doi.org/10.1080/13675560902736537
Järvinen, H., & Laukkanen, S. (2016). Applications of big
data analytics in food safety management systems. Trends in Food
Science & Technology, 51, 85-93.
https://doi.org/10.1016/j.tifs.2016.03.006
Van der Vorst, J. G. A. J., Tromp, S.-O., & Zee, D.-J.
(2009). Simulation modelling for food supply chain redesign;
integrated decision-making on product quality, sustainability and
logistics. International Journal of Production Research, 47(23),
6611-6631.
https://doi.org/10.1080/00207540902882417
Rehman, T., et al. (2007). Farm management techniques for
analyzing risk and uncertainty. Agricultural Systems, 94(2),
201-210.
https://doi.org/10.1016/j.agsy.2006.09.006
Pylianidis, C., Osinga, S., & Sitorus, E. (2021). Machine
learning for agricultural production: A review of applications.
Computers and Electronics in Agriculture, 180, 105882.
https://doi.org/10.1016/j.compag.2020.105882
Pearson, K. (1901). On Lines and Planes of Closest Fit to
Systems of Points in Space. Philosophical Magazine, 2(11),
559–572.
https://doi.org/10.1080/14786440109462720
Codd, E. F. (1970). A Relational Model of Data for Large
Shared Data Banks. Communications of the ACM, 13(6), 377–387.
https://doi.org/10.1145/362384.362685
Cleveland, W. S. (2001). Data Science: An Action Plan for
Expanding the Technical Areas of the Field of Statistics.
International Statistical Review, 69(1), 21–26.
https://doi.org/10.1111/j.1751-5823.2001.tb00477.x
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified
Data Processing on Large Clusters. Proceedings of the 6th Symposium
on Operating Systems Design and Implementation (OSDI).
https://research.google/pubs/pub62/
Kraska, T. (2013). Finding the Needle in the Big Data Systems
Haystack. IEEE Internet Computing, 17(1), 84–86.
https://doi.org/10.1109/MIC.2013.11
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep
Learning. Nature, 521(7553), 436–444.
https://doi.org/10.1038/nature14539
Amatriain, X., & Basilico, J. (2012). Netflix
Recommendations: Beyond the 5 Stars. Netflix Tech Blog.
https://netflixtechblog.com
Dwivedi, Y. K., et al. (2021). Artificial Intelligence (AI):
Multidisciplinary Perspectives on Emerging Challenges, Opportunities,
and Agenda for Research, Practice, and Policy. International
Journal of Information Management, 57, 101994.
https://doi.org/10.1016/j.ijinfomgt.2021.101994
Wang, Y., Kung, L., & Byrd, T. A. (2018). Big Data
Analytics: Understanding Its Capabilities and Potential Benefits for
Healthcare Organizations. Technological Forecasting and Social
Change, 126, 3–13.
https://doi.org/10.1016/j.techfore.2015.12.019
Lokuge, S., et al. (2018). Cognitive Computing Platforms in
Decision Making: A Study of Watson. Decision Support Systems, 105,
48–60.
https://doi.org/10.1016/j.dss.2017.11.006
Davenport, T. H., & Patil, D. J. (2012). Data Scientist:
The Sexiest Job of the 21st Century. Harvard Business Review.
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
Provost, F., & Fawcett, T. (2013). Data Science for
Business: What You Need to Know About Data Mining and Data-Analytic
Thinking. O’Reilly Media.
https://www.oreilly.com/library/view/data-science-for/9781449374280/
McKinney, W. (2017). Python for Data Analysis: Data Wrangling
with Pandas, NumPy, and IPython. O’Reilly Media.
https://www.oreilly.com/library/view/python-for-data/9781491957653/
IBM Analytics. (2020). What is a Data Scientist?. IBM
Knowledge Center.
https://www.ibm.com/cloud/learn/data-scientist
Zhou, J., & Piramuthu, S. (2015). Machine Learning for
Retail Product Recommendation: Predicting Shelf Placement. Expert
Systems with Applications, 42(5), 2385–2395.
https://doi.org/10.1016/j.eswa.2014.11.007
Han, J., Kamber, M., & Pei, J. (2012). Data Mining:
Concepts and Techniques. Morgan Kaufmann.
https://doi.org/10.1016/C2009-0-61819-5
Cielen, D., Meysman, A., & Ali, M. (2016). Introducing
Data Science: Big Data, Machine Learning, and More. Manning
Publications.
https://www.manning.com/books/introducing-data-science
Granville, V. (2014). Developing Analytic Talent: Becoming a
Data Scientist. Wiley.
https://www.wiley.com/en-us/Developing+Analytic+Talent:+Becoming+a+Data+Scientist-p-9781118793188
Wickham, H., & Grolemund, G. (2016). R for Data Science:
Import, Tidy, Transform, Visualize, and Model Data. O’Reilly
Media.
https://r4ds.had.co.nz
Waller, M. A., & Fawcett, S. E. (2013). Data Science,
Predictive Analytics, and Big Data: A Revolution that will Transform
Supply Chain Design and Management. Journal of Business Logistics,
34(2), 77–84.
https://doi.org/10.1111/jbl.12010
E-commerce and Retail: