N. Uttam Singh¹

Abhishek Thakur¹*

Eric Rani¹

*Corresponding Author Email:

¹ICAR Research Complex for NEH Region

Umiam, Meghalaya

Introduction

Regression analysis is one of the most widely used statistical techniques in agricultural research. It is used to study the relationship between crop yield and various influencing factors such as fertilizer application, rainfall, irrigation, temperature, and soil nutrients.

Multiple regression analysis helps quantify these relationships and enables prediction of crop yield under varying conditions.

R Studio provides an efficient and user-friendly environment for performing regression analysis, graphical visualization, and statistical interpretation.

Objectives

The objectives of this practical tutorial are:

  • To understand the concept of multiple regression analysis
  • To create and manage agricultural datasets in RStudio
  • To fit a multiple linear regression model
  • To interpret regression coefficients and statistical output
  • To generate regression plots and diagnostic plots

Software Requirements

Software Purpose
R Software Statistical Computing
RStudio Integrated Development Environment

Introduction to R and RStudio

R is an open-source programming language widely used for statistical analysis, data visualization, and predictive modelling.

RStudio is an Integrated Development Environment (IDE) for R.

Main Components of RStudio

  1. Source Editor
  2. Console
  3. Environment/History
  4. Files/Plots/Packages/Help

Agricultural Dataset Description

In this tutorial, a hypothetical agricultural dataset is used to study the effect of:

  • Fertilizer application
  • Rainfall
  • Irrigation

on crop yield.

Variables Used

Variable Description Unit
fertilizer Amount of fertilizer applied kg/ha
rainfall Seasonal rainfall received mm
irrigation Irrigation hours supplied hours
yield Crop yield quintal/ha

Library Required Packages

library(readxl)
library(ggplot2)
library(dplyr)
library(knitr)
library(car)
library(lmtest)

Import Data

Datanew <- read_excel("cropdataregress.xlsx")

head(Datanew)
## # A tibble: 6 × 4
##   fertilizer rainfall irrigation yield
##        <dbl>    <dbl>      <dbl> <dbl>
## 1         40      820         12    28
## 2         42      790         13    30
## 3         38      760         11    26
## 4         50      880         15    36
## 5         55      910         16    40
## 6         48      850         14    34

Structure of Dataset

str(Datanew)
## tibble [40 × 4] (S3: tbl_df/tbl/data.frame)
##  $ fertilizer: num [1:40] 40 42 38 50 55 48 60 62 45 52 ...
##  $ rainfall  : num [1:40] 820 790 760 880 910 850 940 960 810 900 ...
##  $ irrigation: num [1:40] 12 13 11 15 16 14 17 18 13 15 ...
##  $ yield     : num [1:40] 28 30 26 36 40 34 44 46 31 38 ...

Summary Statistics

summary(Datanew)
##    fertilizer       rainfall        irrigation        yield      
##  Min.   :38.00   Min.   : 760.0   Min.   :11.00   Min.   :26.00  
##  1st Qu.:45.75   1st Qu.: 827.5   1st Qu.:13.00   1st Qu.:31.75  
##  Median :53.50   Median : 897.5   Median :15.00   Median :38.50  
##  Mean   :53.35   Mean   : 889.6   Mean   :15.40   Mean   :38.40  
##  3rd Qu.:60.25   3rd Qu.: 942.5   3rd Qu.:17.25   3rd Qu.:44.25  
##  Max.   :69.00   Max.   :1010.0   Max.   :20.00   Max.   :52.00

Scatter Plot

plot(Datanew$fertilizer,
     Datanew$yield,
     main = "Effect of Fertilizer on Crop Yield",
     xlab = "Fertilizer (kg/ha)",
     ylab = "Crop Yield (quintal/ha)",
     pch = 19)

abline(lm(yield ~ fertilizer, data = Datanew),
       col = "blue",
       lwd = 2)

Multiple Regression Analysis

Fitting Regression Model

model <- lm(yield ~ fertilizer + rainfall + irrigation,
            data = Datanew)

model
## 
## Call:
## lm(formula = yield ~ fertilizer + rainfall + irrigation, data = Datanew)
## 
## Coefficients:
## (Intercept)   fertilizer     rainfall   irrigation  
##   -17.18765      0.50468      0.02209      0.58514

Regression Summary

summary(model)
## 
## Call:
## lm(formula = yield ~ fertilizer + rainfall + irrigation, data = Datanew)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99759 -0.24834 -0.02828  0.21340  0.93317 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.187649   3.386389  -5.076 1.19e-05 ***
## fertilizer    0.504680   0.069382   7.274 1.44e-08 ***
## rainfall      0.022090   0.007223   3.058  0.00418 ** 
## irrigation    0.585138   0.176587   3.314  0.00211 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3945 on 36 degrees of freedom
## Multiple R-squared:  0.9976, Adjusted R-squared:  0.9974 
## F-statistic:  4913 on 3 and 36 DF,  p-value: < 2.2e-16

ANOVA Table

anova(model)
## Analysis of Variance Table
## 
## Response: yield
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## fertilizer  1 2290.04 2290.04 14714.42 < 2.2e-16 ***
## rainfall    1    2.25    2.25    14.43 0.0005402 ***
## irrigation  1    1.71    1.71    10.98 0.0021068 ** 
## Residuals  36    5.60    0.16                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Understanding Regression Output

Regression Coefficients

The coefficients indicate the effect of each independent variable on crop yield.

  • Positive coefficient indicates increase in yield
  • Negative coefficient indicates decrease in yield

p-value

Decision rule:

p-value Interpretation
p < 0.05 Significant
p > 0.05 Not Significant

R-squared

R² measures the goodness of fit of the regression model.

R² Value Interpretation
Close to 1 Good Fit
Close to 0 Poor Fit

Regression Diagnostics

par(mfrow=c(2,2))
plot(model)

The above command generates:

  1. Residual vs Fitted Plot
  2. Normal Q-Q Plot
  3. Scale-Location Plot
  4. Residuals vs Leverage Plot

Interpretation of Diagnostic Plots

Residual vs Fitted Plot

This plot checks:

  • Linearity
  • Constant variance

Random scatter of points indicates good model fit.

Normal Q-Q Plot

This plot checks normality of residuals.

Residuals should approximately follow a straight line.

Scale-Location Plot

This plot checks homoscedasticity.

Equal spread of residuals indicates constant variance.

Residuals vs Leverage Plot

This plot identifies influential observations and outliers.

Variance Inflation Factor (VIF)

Variance Inflation Factor (VIF) is used to detect multicollinearity among independent variables in a regression model.

car::vif(model)
## fertilizer   rainfall irrigation 
##   97.83653   68.15842   54.81909

Interpretation of VIF

  • VIF < 5 indicates low multicollinearity
  • VIF between 5 and 10 indicates moderate multicollinearity
  • VIF > 10 indicates serious multicollinearity problem

Durbin-Watson Test

Durbin-Watson test is used to detect autocorrelation among residuals in regression analysis.

lmtest::dwtest(model)
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.617, p-value = 0.1078
## alternative hypothesis: true autocorrelation is greater than 0

Interpretation of Durbin-Watson Test

  • Value close to 2 indicates no autocorrelation
  • Value less than 2 indicates positive autocorrelation
  • Value greater than 2 indicates negative autocorrelation

Applications in Agriculture

Regression analysis has wide applications in agricultural sciences.

Major Applications

  • Crop yield prediction
  • Fertilizer recommendation studies
  • Rainfall impact assessment
  • Soil nutrient analysis
  • Irrigation management
  • Agricultural economic forecasting
  • Pest and disease modelling
  • Climate change impact studies

Conclusion

Multiple regression analysis is an important statistical tool for agricultural research and decision-making.

RStudio provides a powerful environment for:

  • Data analysis
  • Model fitting
  • Graphical visualization
  • Statistical interpretation
  • Prediction modelling

The methods explained in this tutorial can be extended to advanced predictive agricultural analytics.

References

  1. Montgomery, D.C., Peck, E.A. and Vining, G.G. Introduction to Linear Regression Analysis.
  2. Kutner, M.H., Nachtsheim, C.J., Neter, J. and Li, W. Applied Linear Statistical Models.
  3. R Core Team. R: A Language and Environment for Statistical Computing.
  4. https://cran.r-project.org/
  5. https://posit.co/download/rstudio-desktop/