N. Uttam Singh¹

Abhishek Thakur¹*

Eric Rani¹

*Corresponding Author Email: thakurabhishek7188@gmail.com

¹ICAR Research Complex for NEH Region

Umiam, Meghalaya

Introduction

Regression analysis is one of the most widely used statistical techniques in agricultural research. It is used to study the relationship between crop yield and various influencing factors such as fertilizer application, rainfall, irrigation, temperature, and soil nutrients.

Multiple regression analysis helps quantify these relationships and enables prediction of crop yield under varying conditions.

R Studio provides an efficient and user-friendly environment for performing regression analysis, graphical visualization, and statistical interpretation.

Objectives

The objectives of this practical tutorial are:

To understand the concept of multiple regression analysis
To create and manage agricultural datasets in RStudio
To fit a multiple linear regression model
To interpret regression coefficients and statistical output
To generate regression plots and diagnostic plots

Software Requirements

Software	Purpose
R Software	Statistical Computing
RStudio	Integrated Development Environment

Recommended Versions

R version 4.0 or above
Latest version of RStudio

Introduction to R and RStudio

R is an open-source programming language widely used for statistical analysis, data visualization, and predictive modelling.

RStudio is an Integrated Development Environment (IDE) for R.

Main Components of RStudio

Source Editor
Console
Environment/History
Files/Plots/Packages/Help

Agricultural Dataset Description

In this tutorial, a hypothetical agricultural dataset is used to study the effect of:

Fertilizer application
Rainfall
Irrigation

on crop yield.

Variables Used

Variable	Description	Unit
fertilizer	Amount of fertilizer applied	kg/ha
rainfall	Seasonal rainfall received	mm
irrigation	Irrigation hours supplied	hours
yield	Crop yield	quintal/ha

Library Required Packages

library(readxl)
library(ggplot2)
library(dplyr)
library(knitr)
library(car)
library(lmtest)

Import Data

Datanew <- read_excel("cropdataregress.xlsx")

head(Datanew)

## # A tibble: 6 × 4
##   fertilizer rainfall irrigation yield
##        <dbl>    <dbl>      <dbl> <dbl>
## 1         40      820         12    28
## 2         42      790         13    30
## 3         38      760         11    26
## 4         50      880         15    36
## 5         55      910         16    40
## 6         48      850         14    34

Structure of Dataset

str(Datanew)

## tibble [40 × 4] (S3: tbl_df/tbl/data.frame)
##  $ fertilizer: num [1:40] 40 42 38 50 55 48 60 62 45 52 ...
##  $ rainfall  : num [1:40] 820 790 760 880 910 850 940 960 810 900 ...
##  $ irrigation: num [1:40] 12 13 11 15 16 14 17 18 13 15 ...
##  $ yield     : num [1:40] 28 30 26 36 40 34 44 46 31 38 ...

Summary Statistics

summary(Datanew)

##    fertilizer       rainfall        irrigation        yield      
##  Min.   :38.00   Min.   : 760.0   Min.   :11.00   Min.   :26.00  
##  1st Qu.:45.75   1st Qu.: 827.5   1st Qu.:13.00   1st Qu.:31.75  
##  Median :53.50   Median : 897.5   Median :15.00   Median :38.50  
##  Mean   :53.35   Mean   : 889.6   Mean   :15.40   Mean   :38.40  
##  3rd Qu.:60.25   3rd Qu.: 942.5   3rd Qu.:17.25   3rd Qu.:44.25  
##  Max.   :69.00   Max.   :1010.0   Max.   :20.00   Max.   :52.00

Scatter Plot

plot(Datanew$fertilizer,
     Datanew$yield,
     main = "Effect of Fertilizer on Crop Yield",
     xlab = "Fertilizer (kg/ha)",
     ylab = "Crop Yield (quintal/ha)",
     pch = 19)

abline(lm(yield ~ fertilizer, data = Datanew),
       col = "blue",
       lwd = 2)

Multiple Regression Analysis

Fitting Regression Model

model <- lm(yield ~ fertilizer + rainfall + irrigation,
            data = Datanew)

model

## 
## Call:
## lm(formula = yield ~ fertilizer + rainfall + irrigation, data = Datanew)
## 
## Coefficients:
## (Intercept)   fertilizer     rainfall   irrigation  
##   -17.18765      0.50468      0.02209      0.58514

Regression Summary

summary(model)

## 
## Call:
## lm(formula = yield ~ fertilizer + rainfall + irrigation, data = Datanew)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99759 -0.24834 -0.02828  0.21340  0.93317 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.187649   3.386389  -5.076 1.19e-05 ***
## fertilizer    0.504680   0.069382   7.274 1.44e-08 ***
## rainfall      0.022090   0.007223   3.058  0.00418 ** 
## irrigation    0.585138   0.176587   3.314  0.00211 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3945 on 36 degrees of freedom
## Multiple R-squared:  0.9976, Adjusted R-squared:  0.9974 
## F-statistic:  4913 on 3 and 36 DF,  p-value: < 2.2e-16

ANOVA Table

anova(model)

## Analysis of Variance Table
## 
## Response: yield
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## fertilizer  1 2290.04 2290.04 14714.42 < 2.2e-16 ***
## rainfall    1    2.25    2.25    14.43 0.0005402 ***
## irrigation  1    1.71    1.71    10.98 0.0021068 ** 
## Residuals  36    5.60    0.16                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Understanding Regression Output

Regression Coefficients

The coefficients indicate the effect of each independent variable on crop yield.

Positive coefficient indicates increase in yield
Negative coefficient indicates decrease in yield

p-value

Decision rule:

p-value	Interpretation
p < 0.05	Significant
p > 0.05	Not Significant

R-squared

R² measures the goodness of fit of the regression model.

R² Value	Interpretation
Close to 1	Good Fit
Close to 0	Poor Fit

Regression Diagnostics

par(mfrow=c(2,2))
plot(model)

The above command generates:

Residual vs Fitted Plot
Normal Q-Q Plot
Scale-Location Plot
Residuals vs Leverage Plot

Interpretation of Diagnostic Plots

Residual vs Fitted Plot

This plot checks:

Linearity
Constant variance

Random scatter of points indicates good model fit.

Normal Q-Q Plot

This plot checks normality of residuals.

Residuals should approximately follow a straight line.

Scale-Location Plot

This plot checks homoscedasticity.

Equal spread of residuals indicates constant variance.

Residuals vs Leverage Plot

This plot identifies influential observations and outliers.

Variance Inflation Factor (VIF)

Variance Inflation Factor (VIF) is used to detect multicollinearity among independent variables in a regression model.

car::vif(model)

## fertilizer   rainfall irrigation 
##   97.83653   68.15842   54.81909

Interpretation of VIF

VIF < 5 indicates low multicollinearity
VIF between 5 and 10 indicates moderate multicollinearity
VIF > 10 indicates serious multicollinearity problem

Durbin-Watson Test

Durbin-Watson test is used to detect autocorrelation among residuals in regression analysis.

lmtest::dwtest(model)

## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.617, p-value = 0.1078
## alternative hypothesis: true autocorrelation is greater than 0

Interpretation of Durbin-Watson Test

Value close to 2 indicates no autocorrelation
Value less than 2 indicates positive autocorrelation
Value greater than 2 indicates negative autocorrelation

Applications in Agriculture

Regression analysis has wide applications in agricultural sciences.

Major Applications

Crop yield prediction
Fertilizer recommendation studies
Rainfall impact assessment
Soil nutrient analysis
Irrigation management
Agricultural economic forecasting
Pest and disease modelling
Climate change impact studies

Conclusion

Multiple regression analysis is an important statistical tool for agricultural research and decision-making.

RStudio provides a powerful environment for:

Data analysis
Model fitting
Graphical visualization
Statistical interpretation
Prediction modelling

The methods explained in this tutorial can be extended to advanced predictive agricultural analytics.

References

Montgomery, D.C., Peck, E.A. and Vining, G.G. Introduction to Linear Regression Analysis.
Kutner, M.H., Nachtsheim, C.J., Neter, J. and Li, W. Applied Linear Statistical Models.
R Core Team. R: A Language and Environment for Statistical Computing.
https://cran.r-project.org/
https://posit.co/download/rstudio-desktop/

Regression Analysis in R Studio Using Agricultural Dataset

R Codes for Regression Analysis

2026-05-22